Recognition: 2 theorem links
RLDX-1 Technical Report
Pith reviewed 2026-05-08 18:56 UTC · model grok-4.3
The pith
RLDX-1 uses a multi-stream transformer to reach 86.8% success on ALLEX humanoid tasks while other VLAs stay near 40%.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RLDX-1 is a robotic policy for dexterous manipulation built on the Multi-Stream Action Transformer that integrates heterogeneous modalities through modality-specific streams with cross-modal joint self-attention. Combined with data synthesis for rare scenarios, specialized learning procedures for human-like manipulation, and inference optimizations, the policy achieves higher success rates than frontier VLAs across simulation and real-world settings. On ALLEX humanoid tasks it reaches 86.8% success while π0.5 and GR00T N1.6 reach around 40%, showing effective control of high-DoF robots under diverse functional demands.
What carries the argument
The Multi-Stream Action Transformer (MSAT), which unifies modalities by running each in its own stream and then applying joint self-attention across streams to combine scene understanding with action generation.
If this is right
- RLDX-1 can control high-DoF humanoid robots reliably across contact-rich and dynamic tasks.
- Data synthesis for rare manipulation scenarios improves coverage of edge cases that general pre-training misses.
- Specialized learning procedures and real-time inference optimizations make the policy practical for physical robot deployment.
- Vision-language-action models can be extended to support broader functional capabilities without losing general versatility.
Where Pith is reading between the lines
- A multi-stream design might reduce reliance on ever-larger unified pre-training by letting each modality develop its own representations first.
- The same architecture could be tested on non-humanoid platforms to check whether the performance lift depends on the specific robot morphology.
- Future comparisons could isolate whether cross-modal joint self-attention or the separate streams contribute most to the observed gains.
Load-bearing premise
The performance differences arise from the MSAT architecture and listed system-level choices rather than uncontrolled differences in training data volume, evaluation protocols, robot hardware calibration, or task definitions.
What would settle it
A side-by-side retraining and evaluation of RLDX-1, π0.5, and GR00T N1.6 on identical data volumes, identical task definitions, and identical hardware calibration to test whether the success-rate gap remains.
Figures
read the original abstract
While Vision-Language-Action models (VLAs) have shown remarkable progress toward human-like generalist robotic policies through the versatile intelligence (i.e. broad scene understanding and language-conditioned generalization) inherited from pre-trained Vision-Language Models, they still struggle with complex real-world tasks requiring broader functional capabilities (e.g. motion awareness, long-term memory, and physical sensing). To address this, we introduce RLDX-1, a general-purpose robotic policy for dexterous manipulation built on the Multi-Stream Action Transformer (MSAT), an architecture that unifies these capabilities by integrating heterogeneous modalities through modality-specific streams with cross-modal joint self-attention. RLDX-1 further combines this architecture with system-level design choices, including data synthesis for rare manipulation scenarios, learning procedures specialized for human-like manipulation, and inference optimizations for real-time deployment. Through empirical evaluation, we show that RLDX-1 consistently outperforms recent frontier VLAs (e.g. $\pi_{0.5}$ and GR00T N1.6) across both simulation benchmarks and real-world tasks that require broad functional capabilities beyond general versatility. In particular, RLDX-1 shows superiority in ALLEX humanoid tasks by achieving success rates of 86.8% while $\pi_{0.5}$ and GR00T N1.6 achieve around 40%, highlighting the ability of RLDX-1 to control a high-DoF humanoid robot under diverse functional demands. Together, these results position RLDX-1 as a promising step toward reliable VLAs for complex, contact-rich, and dynamic real-world dexterous manipulation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces RLDX-1, a general-purpose robotic policy for dexterous manipulation built on the Multi-Stream Action Transformer (MSAT) architecture that integrates heterogeneous modalities via modality-specific streams and cross-modal attention. It further incorporates system-level choices including data synthesis for rare scenarios, specialized learning procedures, and inference optimizations. The central claim is that RLDX-1 consistently outperforms recent frontier VLAs (e.g., π_{0.5} and GR00T N1.6) on simulation benchmarks and real-world tasks, with a highlighted result of 86.8% success rate on ALLEX humanoid tasks versus approximately 40% for the baselines.
Significance. If the performance differences can be rigorously attributed to the MSAT architecture and listed design choices rather than confounds, this would constitute a notable contribution to vision-language-action models for complex, high-DoF humanoid manipulation requiring functional capabilities beyond general versatility. The architecture's unification of modalities addresses a recognized limitation in current VLAs, but the current lack of supporting evidence prevents assessing whether the result holds.
major comments (3)
- Abstract: The claim of consistent outperformance, including the specific 86.8% versus ~40% success rates on ALLEX humanoid tasks, is presented without any description of the experimental setup, baseline implementations, training data volumes or compositions, task definitions, success criteria, or hardware calibration details. This prevents determining whether the gap is attributable to MSAT rather than uncontrolled differences in training regime or evaluation protocols.
- Abstract: No ablation studies, controlled experiments, or comparisons isolating the MSAT architecture from the system-level choices (data synthesis, specialized learning procedures, inference optimizations) are reported, leaving the central performance attribution unsupported.
- Abstract: The manuscript supplies no information on the number of evaluation trials, error bars, variance, or statistical tests for the reported success rates, which is required to substantiate claims of consistent superiority across benchmarks and real-world tasks.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback, which identifies key areas where additional clarity and evidence are needed to support our claims. We address each major comment below and will incorporate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: Abstract: The claim of consistent outperformance, including the specific 86.8% versus ~40% success rates on ALLEX humanoid tasks, is presented without any description of the experimental setup, baseline implementations, training data volumes or compositions, task definitions, success criteria, or hardware calibration details. This prevents determining whether the gap is attributable to MSAT rather than uncontrolled differences in training regime or evaluation protocols.
Authors: We agree that the abstract is too concise and omits critical context on the experimental setup, baselines, data composition, task definitions, success criteria, and hardware details. We will revise the abstract to include a brief high-level summary of the evaluation protocol and benchmarks. We will also expand the main text to provide complete descriptions of training data volumes, baseline implementations, task specifications, success criteria, and hardware calibration, enabling readers to assess potential confounds and attribute performance differences more clearly. revision: yes
-
Referee: Abstract: No ablation studies, controlled experiments, or comparisons isolating the MSAT architecture from the system-level choices (data synthesis, specialized learning procedures, inference optimizations) are reported, leaving the central performance attribution unsupported.
Authors: The referee correctly notes the absence of ablations or controlled experiments that isolate the MSAT architecture from the system-level choices. While the full RLDX-1 system shows strong results against frontier VLAs, this does not rigorously separate the architectural contribution. We will add a dedicated ablation subsection in the Experiments section, including controlled comparisons (e.g., MSAT versus a standard transformer backbone with other components held fixed) to better support the attribution of gains to the multi-stream design. revision: yes
-
Referee: Abstract: The manuscript supplies no information on the number of evaluation trials, error bars, variance, or statistical tests for the reported success rates, which is required to substantiate claims of consistent superiority across benchmarks and real-world tasks.
Authors: We acknowledge that the reported success rates are presented without details on the number of trials, error bars, variance, or statistical tests. In the revised manuscript, we will update all results tables and figures to specify the number of evaluation trials per task, include standard deviations or confidence intervals, and report appropriate statistical tests (such as paired t-tests) to substantiate the consistency of the observed outperformance. revision: yes
Circularity Check
No circularity: empirical performance comparison with no derivations or self-defined reductions
full rationale
The paper is an empirical technical report introducing the RLDX-1 policy and MSAT architecture, then reporting success rates on benchmarks and real-world tasks (e.g., 86.8% on ALLEX vs. ~40% for baselines). No equations, fitted parameters, predictions, or first-principles derivations appear in the abstract or described content. Claims rest on direct experimental outcomes rather than any reduction to self-citations, ansatzes, or renamed inputs. The central attribution to architecture and system choices is presented as an empirical finding, not a mathematical necessity, making the work self-contained against external benchmarks with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Gemini Robotics: Bringing AI into the Physical World
Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. Gemini robotics: Bringing ai into the physical world. arXiv preprint arXiv:2503.20020, 2025
work page internal anchor Pith review arXiv 2025
-
[2]
${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities
Bo Ai, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Greg Balke, Kevin Black, George Bokinsky, Shihao Cao, Thomas Charbonnier, et al. π0.7: a steerable generalist robotic foundation model with emergent capabilities.arXiv preprint arXiv:2604.15483, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[3]
World Simulation with Video Foundation Models for Physical AI
Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Eliz- abeth Cha, Yu-Wei Chao, et al. World simulation with video foundation models for physical ai.arXiv preprint arXiv:2511.00062, 2025
work page internal anchor Pith review arXiv 2025
-
[4]
Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Kevin Black, Ken Conley, Grace Connors, James Darpinian, Karan Dhabalia, Jared DiCarlo, et al.π ∗ 0.6: A VLA That Learns From Experience.arXiv preprint arXiv:2511.14759, 2025
work page Pith review arXiv 2025
-
[5]
Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation
Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael V oznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, et al. Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation. InACM International Conference on Architectural Support for Programming Languages and Operating Syst...
2024
-
[6]
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025
work page internal anchor Pith review arXiv 2025
-
[7]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...
work page Pith review arXiv 2025
-
[8]
Video pretraining (vpt): Learning to act by watching unlabeled online videos
Bowen Baker, Ilge Akkaya, Peter Zhokov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. Video pretraining (vpt): Learning to act by watching unlabeled online videos. InAdvances in Neural Information Processing Systems, 2022
2022
-
[9]
PaliGemma: A versatile 3B VLM for transfer
Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer. arXiv preprint arXiv:2407.07726, 2024
work page internal anchor Pith review arXiv 2024
-
[10]
Anyskin: Plug-and-play skin sensing for robotic touch
Raunaq Bhirangi, Venkatesh Pattabiraman, Enes Erciyes, Yifeng Cao, Tess Hellebrekers, and Lerrel Pinto. Anyskin: Plug-and-play skin sensing for robotic touch. InIEEE International Conference on Robotics and Automation, 2025
2025
-
[11]
Jianxin Bi, Kevin Yuchen Ma, Ce Hao, Mike Zheng Shou, and Harold Soh. Vla-touch: Enhancing vision-language- action models with dual-level tactile feedback.arXiv preprint arXiv:2507.17294, 2025
-
[12]
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. GR00T N1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025
work page internal anchor Pith review arXiv 2025
-
[13]
$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization
Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. π0.5 a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025a
-
[14]
π0: A vision-language-action flow model for general robot control
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control. InRobotics: Science and Systems, 2025b. 29 RLDX-1 Technical Report
-
[15]
Real-time execution of action chunking flow policies
Kevin Black, Manuel Y Galliker, and Sergey Levine. Real-time execution of action chunking flow policies. InAdvances in Neural Information Processing Systems, 2025c
-
[16]
FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2, 2025
Black Forest Labs. FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2, 2025
2025
-
[17]
Rt-1: Robotics transformer for real-world control at scale
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakr- ishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. In Robotics: Science and Systems, 2023
2023
-
[18]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. InAdvances in Neural Information Processing Systems, 2020
2020
-
[19]
Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems
Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. In IEEE/RSJ International Conference on Intelligent Robots and Systems, 2025a
-
[20]
Univla: Learning to act anywhere with task-centric latent actions
Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions. InRobotics: Science and Systems, 2025b
-
[21]
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation
Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, et al. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation. arXiv preprint arXiv:2410.06158, 2024
work page internal anchor Pith review arXiv 2024
-
[22]
Guo Chen, Zhiqi Li, Shihao Wang, Jindong Jiang, Yicheng Liu, Lidong Lu, De-An Huang, Wonmin Byeon, Matthieu Le, Tuomas Rintamaki, et al. Eagle 2.5: Boosting long-context post-training for frontier vision-language models.arXiv preprint arXiv:2504.15271, 2025a
-
[23]
Offline reinforcement learning via high-fidelity generative behavior modeling
Huayu Chen, Cheng Lu, Chengyang Ying, Hang Su, and Jun Zhu. Offline reinforcement learning via high-fidelity generative behavior modeling. InInternational Conference on Learning Representations, 2023
2023
-
[24]
Training strategies for efficient embodied reasoning
William Chen, Suneel Belkhale, Suvir Mirchandani, Oier Mees, Danny Driess, Karl Pertsch, and Sergey Levine. Training strategies for efficient embodied reasoning. InConference on Robot Learning, 2025b
-
[25]
Villa-x: enhancing latent action modeling in vision-language-action models,
Xiaoyu Chen, Hangxing Wei, Pushi Zhang, Chuheng Zhang, Kaixin Wang, Yanjiang Guo, Rushuai Yang, Yucen Wang, Xinquan Xiao, Li Zhao, et al. Villa-x: enhancing latent action modeling in vision-language-action models.arXiv preprint arXiv:2507.23682, 2025c
-
[26]
Omnivtla: Vision-tactile-language-action model with semantic-aligned tactile sensing,
Zhengxue Cheng, Yiqian Zhang, Wenkang Zhang, Haoyu Li, Keyu Wang, Li Song, and Hengdi Zhang. Omnivtla: Vision-tactile-language-action model with semantic-aligned tactile sensing.arXiv preprint arXiv:2508.08706, 2025
-
[27]
Diffusion policy: Visuomotor policy learning via action diffusion
Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. InProceedings of Robotics: Science and Systems (RSS), 2023
2023
-
[28]
StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing
StarVLA Community. Starvla: A lego-like codebase for vision-language-action model developing.arXiv preprint arXiv:2604.05014, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[29]
Palm-e: An embodied multimodal language model
Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model. InInternational Conference on Machine Learning, 2023
2023
-
[30]
Knowledge insulating vision-language-action models: Train fast, run fast, generalize better
Danny Driess, Jost Tobias Springenberg, Brian Ichter, Lili Yu, Adrian Li-Bell, Karl Pertsch, Allen Z Ren, Homer Walke, Quan Vuong, Lucy Xiaoyang Shi, et al. Knowledge insulating vision-language-action models: Train fast, run fast, generalize better. InAdvances in Neural Information Processing Systems, 2025
2025
-
[31]
Scaling rectified flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In International Conference on Machine Learning, 2024
2024
-
[32]
Haoquan Fang, Markus Grotz, Wilbert Pumacay, Yi Ru Wang, Dieter Fox, Ranjay Krishna, and Jiafei Duan. Sam2act: Integrating visual foundation model with a memory architecture for robotic manipulation.arXiv preprint arXiv:2501.18564, 2025. 30 RLDX-1 Technical Report
-
[33]
LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models
Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei, et al. LIBERO-Plus: In-depth robustness analysis of vision-language-action models.arXiv preprint arXiv:2510.13626, 2025
work page internal anchor Pith review arXiv 2025
-
[34]
Fourier actionnet dataset.https://action-net.org, 2025
Yao Mu Fourier ActionNet Team. Fourier actionnet dataset.https://action-net.org, 2025
2025
-
[35]
GR00T N1.5: An improved open foundation model for generalist humanoid robots
NVIDIA GEAR. GR00T N1.5: An improved open foundation model for generalist humanoid robots. https: //research.nvidia.com/labs/gear/gr00t-n1_5/, June 2025a
-
[36]
GR00T N1.6: An improved open foundation model for generalist humanoid robots
NVIDIA GEAR. GR00T N1.6: An improved open foundation model for generalist humanoid robots. https: //research.nvidia.com/labs/gear/gr00t-n1_6/, December 2025b
-
[37]
Gemma Team. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025
work page internal anchor Pith review arXiv 2025
-
[38]
Octo: An open-source generalist robot policy
Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy. InRobotics: Science and Systems, 2024
2024
-
[39]
arXiv preprint arXiv:2511.19861 (2025)
GigaWorld Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jiagang Zhu, Kerui Li, Mengyuan Xu, et al. Gigaworld-0: World models as data engine to empower embodied ai.arXiv preprint arXiv:2511.19861, 2025
-
[40]
Gemini API | Google AI for Developers.https://ai.google.dev/api, 2026
Google. Gemini API | Google AI for Developers.https://ai.google.dev/api, 2026. [Accessed 03-05-2026]
2026
-
[41]
Lora: Low-rank adaptation of large language models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2022
2022
-
[42]
Huang Huang, Fangchen Liu, Letian Fu, Tingfan Wu, Mustafa Mukadam, Jitendra Malik, Ken Goldberg, and Pieter Abbeel. Otter: A vision-language-action model with text-aware visual feature extraction.arXiv preprint arXiv:2503.03734, 2025a
-
[43]
Tactile-vla: Unlocking vision-language-action model’s physical knowledge for tactile generalization,
Jialei Huang, Shuo Wang, Fanqi Lin, Yihang Hu, Chuan Wen, and Yang Gao. Tactile-vla: unlocking vision-language- action model’s physical knowledge for tactile generalization.arXiv preprint arXiv:2507.09160, 2025b
-
[44]
V oxposer: Composable 3d value maps for robotic manipulation with language models
Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, and Li Fei-Fei. V oxposer: Composable 3d value maps for robotic manipulation with language models. InConference on Robot Learning, 2023
2023
-
[45]
arXiv preprint arXiv:2510.04246 (2025) 13
Huiwon Jang, Sihyun Yu, Heeseung Kwon, Hojin Jeon, Younggyo Seo, and Jinwoo Shin. Contextvla: Vision-language- action model with amortized multi-frame context.arXiv preprint arXiv:2510.04246, 2025a
-
[46]
Dreamgen: Unlocking generalization in robot learning through video world models
Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, et al. Dreamgen: Unlocking generalization in robot learning through video world models. In Conference on Robot Learning, 2025b
-
[47]
Byungwoo Jeon, Dongyoung Kim, Huiwon Jang, Insoo Kim, and Jinwoo Shin. Spatialboost: Enhancing visual representation through language-guided reasoning.arXiv preprint arXiv:2603.22057, 2026
-
[48]
Galaxea open-world dataset and g0 dual-system vla model.arXiv preprint arXiv:2509.00576,
Tao Jiang, Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Jianning Cui, Xiao Liu, Shuiqi Cheng, Jiyang Gao, Huazhe Xu, and Hang Zhao. Galaxea open-world dataset and g0 dual-system vla model.arXiv preprint arXiv:2509.00576, 2025a
-
[49]
Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning
Zhenyu Jiang, Yuqi Xie, Kevin Lin, Zhenjia Xu, Weikang Wan, Ajay Mandlekar, Linxi Jim Fan, and Yuke Zhu. Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning. InIEEE International Conference on Robotics and Automation, 2025b
-
[50]
Interpreting physics in video world models.arXiv preprint arXiv:2602.07050, 2026
Sonia Joseph, Quentin Garrido, Randall Balestriero, Matthew Kowal, Thomas Fel, Shahab Bakhtiari, Blake Richards, and Mike Rabbat. Interpreting physics in video world models.arXiv preprint arXiv:2602.07050, 2026
-
[51]
Planning and acting in partially observable stochastic domains.Artificial intelligence, 1998
Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in partially observable stochastic domains.Artificial intelligence, 1998
1998
-
[52]
Droid: A large-scale in-the-wild robot manipulation dataset
Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset. InRobotics: Science and Systems, 2024. 31 RLDX-1 Technical Report
2024
-
[53]
DEAS: Detached value learning with action sequence for scalable offline RL
Changyeon Kim, Haeone Lee, Younggyo Seo, Kimin Lee, and Yuke Zhu. DEAS: Detached value learning with action sequence for scalable offline RL. InInternational Conference on Learning Representations, 2026a
-
[54]
Robot-r1: Reinforcement learning for enhanced embodied reasoning in robotics
Dongyoung Kim, Sumin Park, Huiwon Jang, Jinwoo Shin, Jaehyung Kim, and Younggyo Seo. Robot-r1: Reinforcement learning for enhanced embodied reasoning in robotics. InAdvances in Neural Information Processing Systems, 2025
2025
-
[55]
Dongyoung Kim, Sumin Park, Woomin Song, Seungku Kim, Taeyoung Kim, Huiwon Jang, Jinwoo Shin, Jaehyung Kim, and Younggyo Seo. Roboalign: Learning test-time reasoning for language-action alignment in vision-language-action models.arXiv preprint arXiv:2603.21341, 2026b
-
[56]
Exploring High-Order Self-Similarity for Video Understanding
Manjin Kim, Heeseung Kwon, Karteek Alahari, and Minsu Cho. Exploring high-order self-similarity for video understanding.arXiv preprint arXiv:2604.20760, 2026c
work page internal anchor Pith review Pith/arXiv arXiv
-
[57]
Openvla: An open-source vision-language-action model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model. InConference on Robot Learning, 2024
2024
-
[58]
Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning
Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, et al. Cosmos policy: Fine-tuning video models for visuomotor control and planning.arXiv preprint arXiv:2601.16163, 2026d
work page internal anchor Pith review arXiv
-
[59]
Seungku Kim, Suhyeok Jang, Byungjun Yoon, Dongyoung Kim, John Won, and Jinwoo Shin. Robocurate: Harnessing diversity with action-verified neural trajectory for robot learning.arXiv preprint arXiv:2602.18742, 2026e
-
[60]
Contrastive representation regularization for vision-language-action models
Taeyoung Kim, Jimin Lee, Myungkyu Koo, Dongyoung Kim, Kyungmin Lee, Changyeon Kim, Younggyo Seo, and Jinwoo Shin. Contrastive representation regularization for vision-language-action models. InInternational Conference on Machine Learning, 2026f
-
[61]
Hamlet: Switch your vision-language-action model into a history-aware policy
Myungkyu Koo, Daewon Choi, Taeyoung Kim, Kyungmin Lee, Changyeon Kim, Younggyo Seo, and Jinwoo Shin. Hamlet: Switch your vision-language-action model into a history-aware policy. InInternational Conference on Learning Representations, 2026
2026
-
[62]
Offline reinforcement learning with implicit Q-learning
Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit Q-learning. InInterna- tional Conference on Learning Representations, 2022
2022
-
[63]
Learning self-similarity in space and time as generalized motion for video action recognition
Heeseung Kwon, Manjin Kim, Suha Kwak, and Minsu Cho. Learning self-similarity in space and time as generalized motion for video action recognition. InIEEE International Conference on Computer Vision, 2021
2021
-
[64]
FLUX.https://github.com/black-forest-labs/flux, 2024
Black Forest Labs. FLUX.https://github.com/black-forest-labs/flux, 2024
2024
-
[65]
Reinforcement learning with augmented data
Misha Laskin, Kimin Lee, Adam Stooke, Lerrel Pinto, Pieter Abbeel, and Aravind Srinivas. Reinforcement learning with augmented data. InAdvances in Neural Information Processing Systems, 2020
2020
-
[66]
Modular Sensory Stream for Integrating Physical Feedback in Vision-Language-Action Models
Jimin Lee, Huiwon Jang, Myungkyu Koo, Jungwoo Park, and Jinwoo Shin. Modular sensory stream for integrating physical feedback in vision-language-action models.arXiv preprint arXiv:2604.23272, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[67]
Behavior-1k: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation
Chengshu Li, Ruohan Zhang, Josiah Wong, Cem Gokmen, Sanjana Srivastava, Roberto Martín-Martín, Chen Wang, Gabrael Levine, Michael Lingelbach, Jiankai Sun, et al. Behavior-1k: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation. InConference on Robot Learning, 2023
2023
-
[68]
Cronusvla: Transferring latent motion across time for multi-frame prediction in manipulation.arXiv e-prints, 2025a
Hao Li, Shuai Yang, Yilun Chen, Yang Tian, Xiaoda Yang, Xinyi Chen, Hanqing Wang, Tai Wang, Feng Zhao, Dahua Lin, et al. Cronusvla: Transferring latent motion across time for multi-frame prediction in manipulation.arXiv e-prints, 2025a
-
[69]
Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024a
-
[70]
Reinforcement learning with action chunking
Qiyang Li, Zhiyuan Zhou, and Sergey Levine. Reinforcement learning with action chunking. InAdvances in Neural Information Processing Systems, 2025b. 32 RLDX-1 Technical Report
-
[71]
Evaluating real-world robot manipulation policies in simulation
Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, et al. Evaluating real-world robot manipulation policies in simulation. InConference on Robot Learning, 2024b
-
[72]
Anthony Liang, Yigit Korkmaz, Jiahui Zhang, Minyoung Hwang, Abrar Anwar, Sidhant Kaushik, Aditya Shah, Alex S Huang, Luke Zettlemoyer, Dieter Fox, et al. Robometer: Scaling general-purpose robotic reward models via trajectory comparisons.arXiv preprint arXiv:2603.02115, 2026
work page internal anchor Pith review arXiv 2026
-
[73]
Code as policies: Language model programs for embodied control
Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. InIEEE International Conference on Robotics and Automation, 2023
2023
-
[74]
Flow matching for generative modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InInternational Conference on Learning Representations, 2023
2023
-
[75]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024
work page Pith review arXiv 2024
-
[76]
LIBERO: Benchmarking knowledge transfer for lifelong robot learning
Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. LIBERO: Benchmarking knowledge transfer for lifelong robot learning. InAdvances in Neural Information Processing Systems, 2023
2023
-
[77]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019
2019
-
[78]
Mimicgen: A data generation system for scalable robot learning using human demonstrations
Ajay Mandlekar, Soroush Nasiriany, Bowen Wen, Iretiayo Akinola, Yashraj Narang, Linxi Fan, Yuke Zhu, and Dieter Fox. Mimicgen: A data generation system for scalable robot learning using human demonstrations. InConference on Robot Learning, 2023
2023
-
[79]
Steering your generalists: Improving robotic foundation models via value guidance
Mitsuhiko Nakamoto, Oier Mees, Aviral Kumar, and Sergey Levine. Steering your generalists: Improving robotic foundation models via value guidance. InConference on Robot Learning, 2024
2024
-
[80]
Robocasa: Large-scale simulation of everyday tasks for generalist robots
Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots. InRobotics: Science and Systems, 2024
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.