Wall-OSS-0.5 Technical Report
Pith reviewed 2026-06-28 22:33 UTC · model grok-4.3
The pith
VLA pretraining produces executable zero-shot robot behavior on physical hardware without fine-tuning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The pretrained Wall-OSS-0.5 checkpoint achieves non-trivial zero-shot real-robot behavior, completing several tasks including a held-out deformable manipulation task at high task progress on a 17-task suite. The model is pretrained with a gradient-bridged co-training recipe in which discrete action prediction, multimodal prediction, and continuous flow matching play complementary roles.
What carries the argument
Gradient-bridged co-training recipe that combines discrete action prediction to route VLM gradients, multimodal prediction to preserve vision-language grounding, and continuous flow matching as the deployment action interface.
If this is right
- The same pretrained checkpoint serves as a stronger adaptation prior and reaches 60.5 percent average task progress on 15 real-robot tasks after fine-tuning.
- The model outperforms the π_0.5 baseline by 17.5 percent after fine-tuning.
- Action training does not erode grounded vision-language competence, as shown by multimodal evaluations.
Where Pith is reading between the lines
- If zero-shot performance improves with scale, future larger VLAs could handle more tasks directly without fine-tuning.
- Open release of the checkpoint enables independent tests on additional robot platforms or task distributions.
- The co-training recipe may transfer to other embodied domains that combine language, vision, and continuous control.
Load-bearing premise
The 17-task suite and held-out deformable task give a fair test of general zero-shot robot capability on physical hardware without selection effects.
What would settle it
A replication showing that the pretrained model records only low task progress across the 17-task suite or fails the held-out deformable task would falsify the zero-shot claim.
read the original abstract
Large-scale Vision-Language-Action (VLA) pretraining is increasingly adopted as the foundation for robot policies, yet the evidence for pretrained VLAs is almost invariably reported after task-specific fine-tuning. This leaves a foundational question unanswered: does VLA pretraining itself yield executable robot behavior, or does it merely furnish a better initialization for downstream policy learning? We present Wall-OSS-0.5, an open-source 4B VLA built upon a 3B VLM backbone augmented with action-generation components, designed so that pretrained robotic capability is directly measurable on physical hardware. The model is pretrained across more than 20 embodiments, processing over one million robot trajectories per epoch alongside a grounded multimodal corpus. We adopt a gradient-bridged co-training recipe in which three objectives play distinct and complementary roles: discrete action prediction routes strong VLM-native gradients into the backbone, multimodal prediction preserves grounded vision-language understanding, and continuous flow matching serves as the deployment-time action interface. Before task-specific fine-tuning, the pretrained checkpoint achieves non-trivial zero-shot real-robot behavior, completing several tasks, including a held-out deformable manipulation task, at high task progress on a 17-task suite. After fine-tuning, the same checkpoint serves as a stronger adaptation prior, reaching 60.5% average task progress on 15 real-robot tasks and outperforming \pi_0.5 by 17.5%. Multimodal evaluations further confirm that action training does not erode grounded vision-language competence: the model preserves broad vision-language ability while strengthening embodied grounding. Together, these results reposition VLA pretraining from an initialization strategy to a directly testable, already useful source of robot capability.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Wall-OSS-0.5, a 4B VLA model built on a 3B VLM backbone with added action components. It is pretrained on >20 embodiments and >1M trajectories/epoch using gradient-bridged co-training (discrete action prediction, multimodal prediction, continuous flow matching). The central claim is that the pretrained checkpoint produces non-trivial zero-shot real-robot behavior on a 17-task suite (including a held-out deformable manipulation task) at high task progress; after fine-tuning the same checkpoint reaches 60.5% average task progress on 15 tasks and outperforms π_0.5 by 17.5% while preserving multimodal competence.
Significance. If the zero-shot results can be substantiated with full experimental controls, the work would be significant for showing that large-scale VLA pretraining can yield directly usable physical robot policies rather than serving only as an initialization. The open-source release and the explicit separation of the three co-training objectives are positive features that support reproducibility and analysis.
major comments (2)
- [Abstract] Abstract: the claim of non-trivial zero-shot behavior 'at high task progress' on the 17-task suite (including the held-out deformable task) supplies no trial counts, statistical significance tests, precise task definitions, or protocol for confirming zero-shot isolation; these omissions are load-bearing because the central claim rests entirely on the empirical measurements.
- [Abstract] Abstract: the 17-task suite and held-out task are presented without any description of selection criteria, embodiment overlap with the >20 pretraining embodiments, or trajectory-distribution overlap with the >1M trajectories/epoch corpus; without this information the zero-shot generalization interpretation cannot be distinguished from possible memorization or selection effects.
minor comments (2)
- [Abstract] The distinction between the 17-task zero-shot suite and the 15-task fine-tuning evaluation is not explained.
- [Abstract] The baseline π_0.5 is referenced without citation or definition.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback emphasizing the need for rigorous documentation of the zero-shot evaluation. We will revise the abstract to address the concerns while preserving the core claims, and we provide point-by-point responses below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim of non-trivial zero-shot behavior 'at high task progress' on the 17-task suite (including the held-out deformable task) supplies no trial counts, statistical significance tests, precise task definitions, or protocol for confirming zero-shot isolation; these omissions are load-bearing because the central claim rests entirely on the empirical measurements.
Authors: We agree that the abstract should be more self-contained on these experimental details. In the revised version we will add: trial counts (5 trials per task), reporting of mean task progress with standard deviation, reference to the statistical protocol (non-parametric tests on task progress scores), concise task definitions, and an explicit statement that zero-shot isolation means direct deployment of the pretrained checkpoint with no task-specific gradient updates or data. These elements are already detailed in the Experiments section; the revision will summarize them in the abstract. revision: yes
-
Referee: [Abstract] Abstract: the 17-task suite and held-out task are presented without any description of selection criteria, embodiment overlap with the >20 pretraining embodiments, or trajectory-distribution overlap with the >1M trajectories/epoch corpus; without this information the zero-shot generalization interpretation cannot be distinguished from possible memorization or selection effects.
Authors: We accept that the abstract requires clarification on these points to support the generalization interpretation. The revision will state that tasks were selected for diversity across manipulation categories (with explicit criteria listed in a new table), that the suite includes both embodiment-overlapping and non-overlapping cases relative to the >20 pretraining embodiments, and that the held-out deformable task uses novel object instances and trajectory variations with no direct overlap to the pretraining corpus. Full overlap analysis appears in Section 4; we will add a one-sentence summary to the abstract. revision: yes
Circularity Check
No derivation chain present; empirical measurements only
full rationale
The paper is a technical report on VLA pretraining and zero-shot robot evaluation. It contains no equations, no claimed derivations, and no fitted parameters that are later renamed as predictions. All central claims rest on reported empirical task-progress metrics from a 17-task suite. Because there is no derivation chain at all, none of the enumerated circularity patterns (self-definitional, fitted-input-called-prediction, self-citation load-bearing, etc.) can apply. The result is self-contained against external benchmarks and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 2 Pith papers
-
X-Tokenizer: A Multimodal Action Tokenizer for Vision-Language-Action Pretraining
X-Tokenizer creates semantic action tokens via asymmetric residual quantization and contrastive pretraining on large trajectory data, outperforming prior methods like FAST on robotic tasks.
-
DMuon: Efficient Distributed Muon Training with Near-Adam Overhead
DMuon delivers 1.48x-3.01x end-to-end and 6.85x-163x optimizer-step speedups for Muon on embodied foundation models and LLMs while matching AdamW per-step latency.
Reference graph
Works this paper leans on
-
[1]
𝜋0.5: avision-language-actionmodelwithopen-worldgeneralization
Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, MichaelEqui,ChelseaFinn,NiccoloFusai,etal. 𝜋0.5: avision-language-actionmodelwithopen-worldgeneralization. arXiv preprint arXiv:2504.16054, 2025
Pith/arXiv arXiv 2025
-
[2]
Igniting vlms toward the embodied space.arXiv preprint arXiv:2509.11766, 2025
Andy Zhai, Brae Liu, Bruno Fang, Chalse Cai, Ellie Ma, Ethan Yin, Hao Wang, Hugo Zhou, James Wang, Lights Shi, et al. Igniting vlms toward the embodied space.arXiv preprint arXiv:2509.11766, 2025
arXiv 2025
-
[3]
Rt-2: Vision-language-action models transfer web knowledge to robotic control
Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023
2023
-
[4]
Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024
Pith/arXiv arXiv 2024
-
[5]
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.𝜋0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024
Pith/arXiv arXiv 2024
-
[6]
Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025
Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025
Pith/arXiv arXiv 2025
-
[7]
Helix: A vision-language-action model for generalist humanoid control.https://www.figure.ai/ helix, 2025
Figure AI. Helix: A vision-language-action model for generalist humanoid control.https://www.figure.ai/ helix, 2025
2025
-
[8]
X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model
Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model. arXiv preprint arXiv:2510.10274, 2025
Pith/arXiv arXiv 2025
-
[9]
GigaBrain Team, Boyuan Wang, Bohan Li, Chaojun Ni, Guan Huang, Guosheng Zhao, Hao Li, Jie Li, Jindi Lv, Jingyu Liu, et al. Gigabrain-0.5 m*: a vla that learns from world model-based reinforcement learning.arXiv preprint arXiv:2602.12099, 2026
arXiv 2026
-
[10]
A pragmatic VLA foundation model.arXiv preprint arXiv:2601.18692, 2026
Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang, Qian Zhu, He Sun, Yong Wang, Shuailei Ma, Yiyu Ren, Kejia Zhang, Hui Yu, Jingmei Zhao, Shuai Zhou, Zhenqi Qiu, Houlong Xiong, Ziyu Wang, Zechen Wang, Ran Cheng, Yong-Lu Li, Yongtao Huang, Xing Zhu, Yujun Shen, and Kecheng Zheng. A pragmatic VLA foundation model.arXiv preprint arXiv:2601.18692, 2026
Pith/arXiv arXiv 2026
-
[11]
Galaxea open-world dataset and G0 dual-system VLA model.arXiv preprint arXiv:2509.00576, 2025
Tao Jiang, Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Jianning Cui, Xiao Liu, Shuiqi Cheng, Jiyang Gao, Huazhe Xu, and Hang Zhao. Galaxea open-world dataset and G0 dual-system VLA model.arXiv preprint arXiv:2509.00576, 2025
arXiv 2025
-
[12]
Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation.IEEE Robotics and Automation Letters, 2025
Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Zhibin Tang, Kun Wu, Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin Shen, et al. Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation.IEEE Robotics and Automation Letters, 2025
2025
-
[13]
Qwen2.5-vl technical report, 2025
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2...
Pith/arXiv arXiv 2025
-
[14]
Paligemma: A versatile 3b vlm for transfer
Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer. arXiv preprint arXiv:2407.07726, 2024
Pith/arXiv arXiv 2024
-
[15]
Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022
Pith/arXiv arXiv 2022
-
[16]
Weixin Liang, Lili Yu, Liang Luo, Srinivasan Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen-tau Yih, Luke Zettlemoyer, et al. Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models.arXiv preprint arXiv:2411.04996, 2024
Pith/arXiv arXiv 2024
-
[17]
Autoregressive image generation using residual quantization
Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. Autoregressive image generation using residual quantization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11523–11532, 2022
2022
-
[18]
Fast: Efficientactiontokenizationforvision-language-actionmodels.arXivpreprintarXiv:2501.09747, 2025
Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and SergeyLevine. Fast: Efficientactiontokenizationforvision-language-actionmodels.arXivpreprintarXiv:2501.09747, 2025
Pith/arXiv arXiv 2025
-
[19]
Yuchun Feng, Jinliang Zheng, Zhihao Wang, Dongxiu Liu, Jianxiong Li, Jiangmiao Pang, Tai Wang, and Xianyuan Zhan. Demystifying action space design for robotic manipulation policies.arXiv preprint arXiv:2602.23408, 2026. 24
Pith/arXiv arXiv 2026
-
[20]
Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0
Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024
2024
-
[21]
Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024
Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024
Pith/arXiv arXiv 2024
-
[22]
Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation
Kun Wu, Chengkai Hou, Jiaming Liu, Zhengping Che, Xiaozhu Ju, Zhuqin Yang, Meng Li, Yinuo Zhao, Zhiyuan Xu, Guang Yang, et al. Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation. arXiv preprint arXiv:2412.13877, 2024
Pith/arXiv arXiv 2024
-
[23]
Chengkai Hou, Kun Wu, Jiaming Liu, Zhengping Che, Di Wu, Fei Liao, Guangrun Li, Jingyang He, Qiuxuan Feng, Zhao Jin, et al. Robomind 2.0: A multimodal, bimanual mobile manipulation dataset for generalizable embodied intelligence.arXiv preprint arXiv:2512.24653, 2025
arXiv 2025
-
[24]
QingwenBu, JisongCai, LiChen, XiuqiCui, YanDing, SiyuanFeng, ShenyuanGao, XindongHe, XuanHu, XuHuang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. arXiv preprint arXiv:2503.06669, 2025
Pith/arXiv arXiv 2025
-
[25]
Bridgedata v2: A dataset for robot learning at scale
Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, Andre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning, pages 1723–1736. PMLR, 2023
2023
-
[26]
Rt-1: Robotics transformer for real-world control at scale
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakr- ishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022
Pith/arXiv arXiv 2022
-
[27]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021
2021
-
[28]
Jianxiong Li, Jinliang Zheng, Yinan Zheng, Liyuan Mao, Xiao Hu, Sijie Cheng, Haoyi Niu, Jihao Liu, Yu Liu, Jingjing Liu, et al. Decisionnce: Embodied multimodal representations via implicit preference learning.arXiv preprint arXiv:2402.18137, 2024
arXiv 2024
-
[29]
Momentum contrast for unsupervised visual representation learning
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020
2020
-
[30]
Chubin Zhang, Jianan Wang, Zifeng Gao, Yue Su, Tianru Dai, Cai Zhou, Jiwen Lu, and Yansong Tang. Clap: Contrastive latent action pretraining for learning vision-language-action models from human videos, 2026. URL https://arxiv.org/abs/2601.04061
arXiv 2026
-
[31]
Back to basics: Let denoising generative models denoise.arXiv preprint arXiv:2511.13720, 2025
Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise.arXiv preprint arXiv:2511.13720, 2025
Pith/arXiv arXiv 2025
-
[32]
6d rotation representation for unconstrained head pose estimation
Thorsten Hempel, Ahmed A Abdelrahman, and Ayoub Al-Hamadi. 6d rotation representation for unconstrained head pose estimation. In2022 IEEE International Conference on image processing (ICIP), pages 2496–2500. IEEE, 2022
2022
-
[33]
Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025
Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025
Pith/arXiv arXiv 2025
-
[34]
Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017
Pith/arXiv arXiv 2017
-
[35]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020
Pith/arXiv arXiv 2010
-
[36]
Attention is all you need.Advances in neural information processing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017
2017
-
[37]
Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
2024
-
[38]
Root mean square layer normalization.Advances in neural information processing systems, 32, 2019
Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advances in neural information processing systems, 32, 2019
2019
-
[39]
Junming Wang, Teng Pu, Wingmun Fung, Jindong Wang, Shanchang Wang, Yuan Deng, Shuyuan Wang, Ziwei Liu, Kunhao Pan, Ping Yang, et al. Xrzero-g0: Pushing the frontier of dexterous robotic manipulation with interfaces, quality and ratios.arXiv preprint arXiv:2604.13001, 2026
Pith/arXiv arXiv 2026
-
[40]
Shihan Wu, Xuecheng Liu, Shaoxuan Xie, Pengwei Wang, Xinghang Li, et al. RoboCOIN: An open-sourced bimanual robotic data collection for integrated manipulation.arXiv preprint arXiv:2511.17441, 2025
Pith/arXiv arXiv 2025
-
[41]
Adina Yakefu, Bin Xie, Chongyang Xu, Enwen Zhang, Erjin Zhou, et al. RoboChallenge: Large-scale real-robot evaluation of embodied policies.arXiv preprint arXiv:2510.17950, 2025. 25
arXiv 2025
-
[42]
RealOmin: 10kh RealOmin-open dataset
GenRobot AI. RealOmin: 10kh RealOmin-open dataset. https://huggingface.co/datasets/ genrobot2025/10Kh-RealOmin-OpenData, 2025. Open robot manipulation dataset; see also https: //www.genrobot.ai/data/open-dataset
2025
-
[43]
Capsfusion: Rethinking image-text data at scale
QiyingYu, QuanSun, XiaosongZhang, YufengCui, FanZhang, YueCao, XinlongWang, andJingjingLiu. Capsfusion: Rethinking image-text data at scale. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14022–14032, 2024
2024
-
[44]
Cambrian-1: A fully open, vision-centric exploration of multimodal llms
Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai C Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. Advances in Neural Information Processing Systems, 37:87310–87356, 2024
2024
-
[45]
Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models
Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 91–104, 2025
2025
-
[46]
Microsoft coco: Common objects in context
Tsung-YiLin,MichaelMaire,SergeBelongie,JamesHays,PietroPerona,DevaRamanan,PiotrDollár,andCLawrence Zitnick. Microsoft coco: Common objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014
2014
-
[47]
Making the v in vqa matter: Elevating the role of image understanding in visual question answering
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017
2017
-
[48]
Onethinker: All-in-one reasoning model for image and video.arXiv preprint arXiv:2512.03043, 2025
Kaituo Feng, Manyuan Zhang, Hongyu Li, Kaixuan Fan, Shuang Chen, Yilei Jiang, Dian Zheng, Peiwen Sun, Yiyuan Zhang, Haoze Sun, et al. Onethinker: All-in-one reasoning model for image and video.arXiv preprint arXiv:2512.03043, 2025
Pith/arXiv arXiv 2025
-
[49]
Wentao Yuan, Jiafei Duan, Valts Blukis, Wilbert Pumacay, Ranjay Krishna, Adithyavairavan Murali, Arsalan Mousa- vian, and Dieter Fox. Robopoint: A vision-language model for spatial affordance prediction for robotics.arXiv preprint arXiv:2406.10721, 2024
arXiv 2024
-
[50]
Spacethinker.https://huggingface.co/datasets/remyxai/SpaceThinker, 2025
Remyx AI. Spacethinker.https://huggingface.co/datasets/remyxai/SpaceThinker, 2025. Hugging Face dataset page
2025
-
[51]
Openspaces
Remyx AI. Openspaces. https://huggingface.co/datasets/remyxai/OpenSpaces, 2025. Hugging Face dataset page
2025
-
[52]
Remyx AI. Spaceom. https://huggingface.co/datasets/remyxai/SpaceOm, 2025. Hugging Face dataset page
2025
-
[53]
Refspatial.https://huggingface.co/datasets/JingkunAn/RefSpatial, 2025
Jingkun An. Refspatial.https://huggingface.co/datasets/JingkunAn/RefSpatial, 2025. Hugging Face dataset page
2025
-
[54]
YipuWang, YuhengJi, YuyangLiu, EnshenZhou, ZiqiangYang, YuxuanTian, ZihengQin, YueLiu, HuajieTan, Cheng Chi, et al. Towards cross-view point correspondence in vision-language models.arXiv preprint arXiv:2512.04686, 2025
arXiv 2025
-
[55]
Scaling spatial intelligence with multimodal foundation models.arXiv preprint arXiv:2511.13719, 2025
Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, et al. Scaling spatial intelligence with multimodal foundation models.arXiv preprint arXiv:2511.13719, 2025. URLhttps://arxiv.org/abs/2511.13719
arXiv 2025
-
[56]
Kaiyuan Chen, Shuangyu Xie, Zehan Ma, Pannag R Sanketi, and Ken Goldberg. Robo2vlm: Visual question answering from large-scale in-the-wild robot manipulation datasets.arXiv preprint arXiv:2505.15517, 2025
arXiv 2025
-
[57]
Eo-1: An open unified embodied foundation model for general robot control, 2026
Delin Qu, Haoming Song, Qizhi Chen, Zhaoqing Chen, Xianqiang Gao, Dong Wang, Xinyi Ye, Qi Lv, Modi Shi, Guanghui Ren, Cheng Ruan, Maoqing Yao, Haoran Yang, Jiacheng Bao, Bin Zhao, and Xuelong Li. Eo-1: An open unified embodied foundation model for general robot control, 2026. URLhttps://arxiv.org/abs/2508. 21112
2026
-
[58]
Robovqa: Multimodal long-horizon reasoning for robotics
Pierre Sermanet, Tianli Ding, Jeffrey Zhao, Fei Xia, Debidatta Dwibedi, Keerthana Gopalakrishnan, Christine Chan, Gabriel Dulac-Arnold, Sharath Maddineni, Nikhil J Joshi, et al. Robovqa: Multimodal long-horizon reasoning for robotics. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 645–652. IEEE, 2024
2024
-
[59]
Cosmos-reason1: From physical common sense to embodied reasoning
Alisson Azzolini, Junjie Bai, Hannah Brandon, Jiaxin Cao, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, et al. Cosmos-reason1: From physical common sense to embodied reasoning. arXiv preprint arXiv:2503.15558, 2025
Pith/arXiv arXiv 2025
-
[60]
RealWorldQA: A benchmark for real-world spatial understanding of multimodal models.https://x.ai/ blog/grok-1.5v, 2024
xAI. RealWorldQA: A benchmark for real-world spatial understanding of multimodal models.https://x.ai/ blog/grok-1.5v, 2024. Dataset released with Grok-1.5V
2024
-
[61]
Knowledge insulating vision-language-action models: Train fast, run fast, generalize better.Advances in Neural Information Processing Systems, 38:102867–102888, 2026
Danny Driess, Jost Springenberg, Brian Ichter, Lili Yu, Adrian Li-Bell, Karl Pertsch, Allen Ren, Homer Walke, Quan Vuong, Lucy Xiaoyang Shi, et al. Knowledge insulating vision-language-action models: Train fast, run fast, generalize better.Advances in Neural Information Processing Systems, 38:102867–102888, 2026
2026
-
[62]
Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023
Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023. 26
2023
-
[63]
Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44 (10-11):1684–1704, 2025
Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44 (10-11):1684–1704, 2025
2025
-
[64]
Behavior generation with latent actions.arXiv preprint arXiv:2403.03181, 2024
Seungjae Lee, Yibin Wang, Haritheja Etukuru, H Jin Kim, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. Behavior generation with latent actions.arXiv preprint arXiv:2403.03181, 2024
arXiv 2024
-
[65]
Yicheng Liu, Shiduo Zhang, Zibin Dong, Baijun Ye, Tianyuan Yuan, Xiaopeng Yu, Linqi Yin, Chenhao Lu, Junhao Shi, Luca Jiang-Tao Yu, et al. Faster: Toward efficient autoregressive vision language action modeling via neural action tokenization.arXiv preprint arXiv:2512.04952, 2025
arXiv 2025
-
[66]
Oat: Ordered action tokenization
Chaoqi Liu, Xiaoshen Han, Jiawei Gao, Yue Zhao, Haonan Chen, and Yilun Du. Oat: Ordered action tokenization. InProceedings of Robotics: Science and Systems, 2026
2026
-
[67]
Actioncodec: What makes for good action tokenizers.arXiv preprint arXiv:2602.15397, 2026
Zibin Dong, Yicheng Liu, Shiduo Zhang, Baijun Ye, Yifu Yuan, Fei Ni, Jingjing Gong, Xipeng Qiu, Hang Zhao, Yinchuan Li, et al. Actioncodec: What makes for good action tokenizers.arXiv preprint arXiv:2602.15397, 2026
arXiv 2026
-
[68]
Action tokenizer matters in in-context imitation learning
An Dinh Vuong, Minh Nhat Vu, Dong An, and Ian Reid. Action tokenizer matters in in-context imitation learning. In 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 13490–13496. IEEE, 2025
2025
-
[69]
Latent action pretraining from videos
Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Se June Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, et al. Latent action pretraining from videos. InInternational Conference on Learning Representations, volume 2025, pages 28213–28239, 2025
2025
-
[70]
Universal actions for enhanced embodied foundation models
Jinliang Zheng, Jianxiong Li, Dongxiu Liu, Yinan Zheng, Zhihao Wang, Zhonghong Ou, Yu Liu, Jingjing Liu, Ya-Qin Zhang, and Xianyuan Zhan. Universal actions for enhanced embodied foundation models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22508–22519, 2025
2025
-
[71]
Boyu Chen, Yi Chen, Lu Qiu, Jerry Bai, Yuying Ge, and Yixiao Ge. UniT: Toward a unified physical language for human-to-humanoid policy learning and world modeling.arXiv preprint arXiv:2604.19734, 2026
Pith/arXiv arXiv 2026
-
[72]
Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation
Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, et al. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation. arXiv preprint arXiv:2410.06158, 2024
Pith/arXiv arXiv 2024
-
[73]
World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026
Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026
Pith/arXiv arXiv 2026
-
[74]
Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026
Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, Yujun Shen, and Yinghao Xu. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026
Pith/arXiv arXiv 2026
-
[75]
A generalist agent.arXiv preprint arXiv:2205.06175, 2022
Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gomez Colmenarejo, Alexander Novikov, Gabriel Barth-Maron, Mai Gimenez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, et al. A generalist agent.arXiv preprint arXiv:2205.06175, 2022
Pith/arXiv arXiv 2022
-
[76]
Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024
Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024
Pith/arXiv arXiv 2024
-
[77]
Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024
Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024
Pith/arXiv arXiv 2024
-
[78]
Vision-language foundation models as effective robot imitators.arXiv preprint arXiv:2311.01378, 2023
Xinghang Li, Minghuan Liu, Hanbo Zhang, Cunjun Yu, Jie Xu, Hongtao Wu, Chilam Cheang, Ya Jing, Weinan Zhang, Huaping Liu, Hang Li, and Tao Kong. Vision-language foundation models as effective robot imitators.arXiv preprint arXiv:2311.01378, 2023
Pith/arXiv arXiv 2023
-
[79]
Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830, 2025
Pith/arXiv arXiv 2025
-
[80]
3d-vla: A 3d vision-language-action generative world model.arXiv preprint arXiv:2403.09631, 2024
Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, and Chuang Gan. 3d-vla: A 3d vision-language-action generative world model.arXiv preprint arXiv:2403.09631, 2024
Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.