arxiv: 2604.23001 · v1 · submitted 2026-04-24 · 💻 cs.RO · cs.AI

Recognition: unknown

Vision-Language-Action in Robotics: A Survey of Datasets, Benchmarks, and Data Engines

Ziyao Wang , Bingying Wang , Hanrong Zhang , Tingting Du , Tianyang Chen , Guoheng Sun , Yexiao He , Zheyu Shen , Wanghao Ye , Ang Li

Authors on Pith no claims yet

Pith reviewed 2026-05-08 11:27 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords vision-language-actionrobotics datasetsembodied benchmarksdata enginessim-to-realcompositional generalizationlong-horizon tasks

0 comments

The pith

Future advances in vision-language-action robotics will depend more on high-fidelity data engines and evaluation protocols than on model architecture.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper surveys vision-language-action models by shifting focus from neural network designs to the supporting data resources that enable embodied learning. It breaks down the field into three areas: real and synthetic datasets, benchmarks for tasks and environments, and engines that generate training data. This reveals a recurring trade-off where higher physical realism in data comes at higher collection cost, plus missing tests for how well models combine skills over long sequences. If the argument holds, progress will require new infrastructure for realistic data and better ways to measure reasoning rather than just adding parameters to models. Readers should care because it reframes the main obstacle as solvable engineering of data pipelines instead of an abstract scaling limit.

Core claim

The survey organizes VLA research around datasets categorized by embodiment diversity, modality composition, and action space; benchmarks evaluated jointly on task complexity and environment structure; and data engines spanning simulation, video reconstruction, and automated task generation. It shows that these components share constraints in physical grounding and sim-to-real transfer, while current benchmarks leave gaps in compositional generalization and long-horizon reasoning. The authors conclude that four challenges—representation alignment, multimodal supervision, reasoning assessment, and scalable data generation—must be treated as primary research targets, with data infrastructureco

What carries the argument

The three-pillar data-centric framework that classifies datasets along embodiment, modalities, and actions; benchmarks along task complexity and environment structure; and data engines along simulation, video-reconstruction, and automated generation methods.

If this is right

Improved representation alignment across vision, language, and action data will become necessary for effective training.
New forms of multimodal supervision will be required to handle mixed inputs reliably.
Evaluation protocols must expand to test reasoning over extended sequences rather than isolated steps.
Scalable data generation techniques will need to solve physical grounding to enable larger training sets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Labs may shift resources from model training clusters toward shared real-world data collection platforms.
The same data-first lens could be applied to related areas such as manipulation in unstructured homes.
A direct test could measure whether performance gains from a new data engine exceed those from doubling model size on fixed data.

Load-bearing premise

The survey's division of the field into datasets, benchmarks, and data engines fully identifies the central bottlenecks, with compositional generalization and long-horizon reasoning as the most important missing capabilities.

What would settle it

A controlled comparison where scaling an existing VLA model on current datasets and benchmarks produces strong compositional generalization and long-horizon success without any new data engine or evaluation protocol would contradict the central claim.

Figures

Figures reproduced from arXiv: 2604.23001 by Ang Li, Bingying Wang, Guoheng Sun, Hanrong Zhang, Tianyang Chen, Tingting Du, Wanghao Ye, Yexiao He, Zheyu Shen, Ziyao Wang.

**Figure 2.** Figure 2: Scope of this survey. This survey focuses on datasets, benchmarks, and data engines for VLA learning in robotic manipulation. While the VLA paradigm also applies to other embodied domains such as autonomous driving or mobile navigation, we restrict our scope to robotic systems where actions correspond to controlling robot arms and, optionally, grippers. Accordingly, all datasets and benchmarks discussed i… view at source ↗

**Figure 1.** Figure 1: Taxonomy and Temporal Landscape of VLA Data-Centric Works (2023-2025). view at source ↗

**Figure 3.** Figure 3: Task–environment landscape of VLA benchmarks. Benchmarks are positioned according to task complexity (horizontal axis) and environment structure (vertical axis). Marker size roughly reflects relative task or dataset scale. A VLA benchmark is an evaluation dataset designed to assess the performance and generalization ability of a VLA model deployed on a robot. Compared with training datasets, benchmarks a… view at source ↗

read the original abstract

Despite remarkable progress in Vision--Language--Action (VLA) models, a central bottleneck remains underexamined: the data infrastructure that underlies embodied learning. In this survey, we argue that future advances in VLA will depend less on model architecture and more on the co-design of high-fidelity data engines and structured evaluation protocols. To this end, we present a systematic, data-centric analysis of VLA research organized around three pillars: datasets, benchmarks, and data engines. For datasets, we categorize real-world and synthetic corpora along embodiment diversity, modality composition, and action space formulation, revealing a persistent fidelity-cost trade-off that fundamentally constrains large-scale collection. For benchmarks, we analyze task complexity and environment structure jointly, exposing structural gaps in compositional generalization and long-horizon reasoning evaluation that existing protocols fail to address. For data engines, we examine simulation-based, video-reconstruction, and automated task-generation paradigms, identifying their shared limitations in physical grounding and sim-to-real transfer. Synthesizing these analyses, we distill four open challenges: representation alignment, multimodal supervision, reasoning assessment, and scalable data generation. Addressing them, we argue, requires treating data infrastructure as a first-class research problem rather than a background concern.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript is a survey of Vision-Language-Action (VLA) models in robotics that organizes the literature around three pillars—datasets, benchmarks, and data engines—and argues that future progress will depend more on the co-design of high-fidelity data engines and structured evaluation protocols than on model architecture. Datasets are categorized by embodiment diversity, modality composition, and action space, revealing a fidelity-cost trade-off. Benchmarks are analyzed jointly by task complexity and environment structure, exposing gaps in compositional generalization and long-horizon reasoning. Data engines (simulation-based, video-reconstruction, automated task generation) are critiqued for limitations in physical grounding and sim-to-real transfer. The survey distills four open challenges—representation alignment, multimodal supervision, reasoning assessment, and scalable data generation—and concludes that data infrastructure must be treated as a first-class research problem.

Significance. If the categorization and gap analysis are representative, the survey offers a timely data-centric reframing of VLA research that could usefully redirect community attention from architecture scaling to infrastructure co-design. The three-pillar structure provides a clear organizing framework, and the distillation of four concrete open challenges supplies actionable guidance. The internal consistency of the argument (synthesis from the surveyed literature rather than new empirical claims) is a strength, as is the explicit identification of the fidelity-cost trade-off and evaluation-protocol gaps.

minor comments (3)

[§2] §2 (or equivalent methodology section): the claim of a 'systematic' analysis would be strengthened by an explicit statement of literature search strategy, inclusion/exclusion criteria, and date range; without this, readers cannot fully assess coverage completeness or selection bias.
[§§3–5] Throughout §§3–5: several gap claims (e.g., absence of compositional generalization tests) would benefit from one or two concrete counter-examples or table entries showing which specific benchmarks or datasets were examined and found lacking, rather than remaining at the level of qualitative synthesis.
[Figure 1] Figure 1 (or equivalent taxonomy figure): the visual categorization of datasets/benchmarks/data engines is helpful but would be clearer if each leaf node included a representative citation count or example paper to ground the taxonomy in the literature.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive and accurate summary of the manuscript, recognition of its data-centric framing, and recommendation for minor revision. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

This survey paper performs a qualitative synthesis of external literature on Vision-Language-Action models, organizing the field into datasets, benchmarks, and data engines without any mathematical derivations, equations, fitted parameters, or self-referential definitions. The central claim that future advances depend on co-design of data engines and evaluation protocols follows directly from the reported categorization of fidelity-cost trade-offs, structural gaps, and physical-grounding limitations drawn from cited external works; no step reduces by construction to the paper's own inputs or unverified self-citations. The analysis is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a survey, the central claims rest on the completeness and accuracy of the authors' literature review rather than new axioms or parameters. No free parameters, invented entities, or non-standard mathematical axioms are introduced.

pith-pipeline@v0.9.0 · 5549 in / 1076 out tokens · 29090 ms · 2026-05-08T11:27:04.114382+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 11 internal anchors

[1]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817,

work page internal anchor Pith review arXiv
[2]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Yevgen Chebotar, Julian Ibarz, et al. Rt-2: Vision-language-action models with web knowledge and robotic control.arXiv preprint arXiv:2307.15818,

work page internal anchor Pith review arXiv
[3]

URLhttps://arxiv.org/abs/2402.15391. Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, Weiliang Deng, Yubin Guo, Tian Nian, Xuanbing Xie, Qiangyu Chen, Kailun Su, Tianling Xu, Guodong Liu, Mengkang Hu, Huan ang Gao, Kaixuan Wang, Zhixuan Liang, Yusen Qin, Xiaokang Yang, Ping Luo, a...

work page arXiv
[4]

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

URLhttps://arxiv.org/ abs/2506.18088. Cheng Chi, Zhenjia Xu, Chuer Pan, Eric Cousineau, Benjamin Burchfiel, Siyuan Feng, Russ Tedrake, and Shuran Song. Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots,

work page internal anchor Pith review arXiv
[5]

Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots

URLhttps://arxiv.org/abs/2402.10329. Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning.Advances in neural information processing systems, 36:49250–49267,

work page internal anchor Pith review arXiv
[6]

Graspvla: a grasping foundation model pre-trained on billion-scale synthetic action data.arXiv preprint arXiv:2505.03233, 2025

URLhttps://arxiv.org/ abs/2505.03233. Zhehao Dong, Xiaofeng Wang, Zheng Zhu, Yirui Wang, Yang Wang, Yukun Zhou, Boyuan Wang, Chaojun Ni, Runqi Ouyang, Wenkang Qin, Xinze Chen, Yun Ye, and Guan Huang. Emma: Generalizing real-world robot manipulation via generative visual transfer,

work page arXiv
[7]

HS Fang, C Liu, et al

URLhttps://arxiv.org/abs/2509.22407. HS Fang, C Liu, et al. Rh20t: A robotic dataset for learning diverse skills in one-shot. InProceedings of Robotics: Science and Systems (RSS),

work page arXiv
[8]

LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei, et al. Libero-plus: In-depth robustness analysis of vision-language-action models.arXiv preprint arXiv:2510.13626,

work page internal anchor Pith review arXiv
[9]

T owards generalizable vision-language robotic manipulation: A benchmark and LLM-guided 3D policy

URLhttps://arxiv.org/abs/2410.01345. Kristen Grauman, Andrew Westbury, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18995–19012,

work page arXiv
[10]

PointWorld: Scaling 3D world models for in-the-wild robotic manipulation.arXiv preprint arXiv:2601.03782, 2026

URLhttps: //arxiv.org/abs/2601.03782. Stephen James, Zicong Ma, David Rovick Arrojo, and Andrew J. Davison. Rlbench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Letters,

work page arXiv
[11]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language- action model.arXiv preprint arXiv:2406.09246,

work page internal anchor Pith review arXiv
[12]

URLhttps://arxiv.org/ abs/2505.11920. Xuanlin Li, Kyle Hsu, Jiayuan Gu, Oier Mees, Karl Pertsch, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, Sergey Levine, Jiajun Wu, Chelsea Finn, Hao Su, Quan Vuong, and Ted Xiao. Evaluating real-world robot manipulation policies in simulation. In8th Annual Conference on Robot Learning, 2024...

work page arXiv
[13]

LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

URLhttps://arxiv.org/abs/2306.03310. Yueen Ma, Zixing Song, Yuzheng Zhuang, Jianye Hao, and Irwin King. A survey on vision-language-action models for embodied ai.arXiv preprint arXiv:2405.14093,

work page internal anchor Pith review arXiv
[14]

://arxiv.org/abs/2112.03227

URLhttps://arxiv.org/ abs/2112.03227. Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual representation for robot manipulation.arXiv preprint arXiv:2203.12601,

work page arXiv
[15]

From generated human videos to physically plausible robot trajectories,

URLhttps://arxiv.org/ abs/2512.05094. 14 Abby O’Neill, Abdul Rehman, Abhinav Gupta, et al. Open x-embodiment: Robotic learning datasets and rt-x models,

work page arXiv
[16]

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Abhishek Padalkar, Acorn Pooley, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models.arXiv preprint arXiv:2310.08864,

work page internal anchor Pith review arXiv
[17]

Wilbert Pumacay, Ishika Singh, Jiafei Duan, Ranjay Krishna, Jesse Thomason, and Dieter Fox

URLhttps://arxiv.org/abs/2511.16223. Wilbert Pumacay, Ishika Singh, Jiafei Duan, Ranjay Krishna, Jesse Thomason, and Dieter Fox. The colosseum: A benchmark for evaluating generalization for robotic manipulation,

work page arXiv
[18]

arXiv preprint arXiv:2402.08191

URLhttps: //arxiv.org/abs/2402.08191. Yajvan Ravan, Adam Rashid, Alan Yu, Kai McClennen, Gio Huh, Kevin Yang, Zhutian Yang, Qinxi Yu, Xiaolong Wang, Phillip Isola, and Ge Yang. Lucid-XR: An extended-reality data engine for robotic manipulation. In9th Annual Conference on Robot Learning,

work page arXiv
[19]

arXiv preprint arXiv:2508.13073 , year=

URLhttps://openreview.net/ forum?id=3p7rTnLJM8. Rui Shao, Wei Li, Lingsen Zhang, Renshan Zhang, Zhiyang Liu, Ran Chen, and Liqiang Nie. Large vlm- based vision-language-action models for robotic manipulation: A survey.arXiv preprint arXiv:2508.13073,

work page arXiv
[20]

Rocket: Residual-oriented multi-layer alignment for spatially- aware vision-language-action models.arXiv preprint arXiv:2602.17951, 2026

Guoheng Sun, Tingting Du, Kaixi Feng, Chenxiang Luo, Xingguo Ding, Zheyu Shen, Ziyao Wang, Yexiao He, and Ang Li. Rocket: Residual-oriented multi-layer alignment for spatially-aware vision-language-action models.arXiv preprint arXiv:2602.17951,

work page arXiv
[21]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314,

work page internal anchor Pith review arXiv
[22]

Dexcap: Scalable and portable mocap data collection system for dexterous manipulation.arXiv preprint arXiv:2403.07788, 2024

Chen Wang, Haochen Shi, Weizhuo Wang, Ruohan Zhang, Li Fei-Fei, and C. Karen Liu. Dexcap: Scalable and portable mocap data collection system for dexterous manipulation, 2024a. URLhttps://arxiv.org/ abs/2403.07788. Guodong Wang, Chenkai Zhang, Qingjie Liu, Jinjin Zhang, Jiancheng Cai, Junjie Liu, and Xinmin Liu. Libero-x: Robustness litmus for vision-langu...

work page arXiv
[23]

Gensim: Generating robotic simulation tasks via large language models,

Lirui Wang, Yiyang Ling, Zhecheng Yuan, Mohit Shridhar, Chen Bao, Yuzhe Qin, Bailin Wang, Huazhe Xu, and Xiaolong Wang. Gensim: Generating robotic simulation tasks via large language models, 2024b. URLhttps://arxiv.org/abs/2310.01361. Yufei Wang et al. Robogen: Towards unleashing infinite data for automated robot learning via generative simulation.Interna...

work page arXiv
[24]

URLhttps://arxiv.org/abs/2309. 13037. Tian-Yu Xiang, Ao-Qun Jin, Xiao-Hu Zhou, Mei-Jiang Gui, Xiao-Liang Xie, Shi-Qi Liu, Shuang-Yi Wang, Sheng-Bin Duan, Fu-Chao Xie, Wen-Kai Wang, et al. Parallels between vla model post-training and human motor learning: Progress, challenges, and trends.arXiv preprint arXiv:2506.20966,

work page arXiv
[25]

Demogen: Synthetic demonstration generation for data-efficient vi- suomotor policy learning, 2025

URLhttps: //arxiv.org/abs/2502.16932. Adina Yakefu, Bin Xie, Chongyang Xu, Enwen Zhang, Erjin Zhou, Fan Jia, Haitao Yang, Haoqiang Fan, Haowei Zhang, Hongyang Peng, et al. Robochallenge: Large-scale real-robot evaluation of embodied policies.arXiv preprint arXiv:2510.17950,

work page arXiv
[26]

X-humanoid: Robotize human videos to generate humanoid videos at scale.arXiv preprint arXiv:2512.04537, 2025

URLhttps://arxiv.org/abs/2512.04537. Sherry Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Leslie Kaelbling, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators,

work page arXiv
[27]

Conference on Robot Learning , year=

URLhttps://arxiv.org/abs/1910.10897. Tianhe Yu, Ted Xiao, Austin Stone, Jonathan Tompson, Anthony Brohan, Su Wang, Jaspiar Singh, Clayton Tan, Dee M, Jodilyn Peralta, Brian Ichter, Karol Hausman, and Fei Xia. Scaling robot learning with semantically imagined experience,

work page arXiv 1910
[28]

Scaling robot learning with semantically imagined experience.arXiv preprint arXiv:2302.11550, 2023

URLhttps://arxiv.org/abs/2302.11550. Chengbo Yuan, Suraj Joshi, Shaoting Zhu, Hang Su, Hang Zhao, and Yang Gao. Roboengine: Plug-and- play robot data augmentation with semantic robot segmentation and background generation,

work page arXiv
[29]

Roboengine: Plug-and-play robot data augmentation with semantic robot segmentation and background generation,

URL https://arxiv.org/abs/2503.18738. Dapeng Zhang, Jing Sun, Chenghui Hu, Xiaoyan Wu, Zhenlong Yuan, Rui Zhou, Fei Shen, and Qingguo Zhou. Purevisionlanguageaction(vla)models: Acomprehensivesurvey.arXiv preprint arXiv:2509.19012, 2025a. Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. Vision-language models for vision tasks: A survey. IEEE transac...

work page arXiv
[30]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

URLhttps://arxiv.org/abs/2304.13705. Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, and Chuang Gan. 3d-vla: A 3d vision-language-action generative world model,

work page internal anchor Pith review arXiv
[31]

3D-VLA: A 3D Vision-Language-Action Generative World Model

URLhttps://arxiv.org/abs/ 2403.09631. 16 Xueyang Zhou, Yangming Xu, Guiyao Tie, Yongchao Chen, Guowen Zhang, Duanfeng Chu, Pan Zhou, and Lichao Sun. Libero-pro: Towards robust and fair evaluation of vision-language-action models beyond memorization.arXiv preprint arXiv:2510.03827,

work page internal anchor Pith review arXiv
[32]

Irasim: Learning interactive real-robot action simulators.arXiv preprint arXiv:2406.14540, 2024

URLhttps://arxiv.org/abs/2406.14540. Yuke Zhu, Josiah Wong, Ajay Mandlekar, Roberto Martín-Martín, Abhishek Joshi, Soroush Nasiriany, and Yifeng Zhu. robosuite: A modular simulation framework and benchmark for robot learning.arXiv preprint arXiv:2009.12293,

work page arXiv 2009