pith. machine review for the scientific record. sign in

arxiv: 2412.14058 · v4 · pith:E3T33REDnew · submitted 2024-12-18 · 💻 cs.RO · cs.CV

What Matters in Building Vision-Language-Action Models for Generalist Robots

Pith reviewed 2026-05-17 21:33 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords vision-language-action modelsrobot manipulationVLAcross-embodiment datageneralist robotsfoundation modelspolicy architectures
0
0 comments X

The pith

Specific choices in backbones, architectures, and data timing let simple Vision-Language-Action models set new robot manipulation records.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates which design decisions most shape the success of Vision-Language-Action models on robotic tasks. Through more than 600 controlled experiments spanning eight different vision-language backbones and four policy structures, the authors map out reliable patterns for backbone selection, model layout, and the inclusion of data from multiple robot bodies. These patterns produce a lightweight framework called RoboVLMs that needs little hand-tuning yet outperforms earlier methods on both simulated benchmarks and physical robot trials. A reader would care because the results supply a concrete, experiment-backed set of rules for building capable general-purpose robots instead of relying on intuition or repeated trial and error.

Core claim

The authors show that performance of Vision-Language-Action models on manipulation problems is strongly determined by three factors: the choice of underlying vision-language backbone, the way action outputs are integrated into the architecture, and the stage at which cross-embodiment data are introduced during training. Systematic comparison across hundreds of runs identifies combinations that minimize manual engineering while raising success rates. The resulting RoboVLMs family, built from these preferred choices, reaches new state-of-the-art scores on three simulation suites and on real-world experiments.

What carries the argument

The RoboVLMs framework, a flexible structure that lets researchers plug in different vision-language models and freely combine the tested design options for producing action outputs.

If this is right

  • Future Vision-Language-Action models should prioritize the backbones and architectural layouts that ranked highest in the controlled tests.
  • Cross-embodiment data should be added at the training stage identified as most effective rather than at arbitrary points.
  • Minimal-manual-design models built this way can exceed the performance of more heavily engineered alternatives on standard manipulation benchmarks.
  • Open release of the framework and all training recipes allows direct reuse of the winning combinations on new problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same systematic testing approach could shorten development cycles for robot skills beyond the manipulation tasks examined here.
  • Results on a broader set of robot embodiments would test whether the observed design rankings remain stable when the hardware changes substantially.
  • The guidebook of choices might serve as a starting point for adapting similar models to non-manipulation domains such as navigation or assembly.

Load-bearing premise

The chosen simulation tasks and real robot experiments are representative enough that the ranking of design choices will hold for new tasks, robot bodies, and settings not tested here.

What would settle it

Apply the top-ranked backbone, architecture, and data-timing choices from the study to a previously unseen manipulation task or robot embodiment and check whether the resulting model still outperforms the alternatives that ranked lower in the original experiments.

read the original abstract

To utilize Foundation Vision Language Models (VLMs) for robotic tasks and motion planning, the community has proposed different methods for injecting action components into VLMs and building the Vision-Language-Action models (VLAs). In this work, we disclose the key factors that significantly influence the performance of VLA on robot manipulation problems and focus on answering three essential design choices: which backbone to select, how to formulate the VLA architectures, and when to add cross-embodiment data. The obtained results convince us firmly to explain why we prefer VLA and develop a new family of VLAs, RoboVLMs, which require very few manual designs and achieve a new state-of-the-art performance in three simulation tasks and real-world experiments. Through our extensive experiments, which include over 8 VLM backbones, 4 policy architectures, and over 600 distinct designed experiments, we provide a detailed guidebook for the future design of VLAs. In addition to the study, the highly flexible RoboVLMs framework, which supports easy integrations of new VLMs and free combinations of various design choices, is made public to facilitate future research. We open-source all details, including codes, models, datasets, and toolkits, along with detailed training and evaluation recipes at: robovlms.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript empirically investigates key design choices for Vision-Language-Action (VLA) models in robotic manipulation: VLM backbone selection, policy architecture formulation, and timing for incorporating cross-embodiment data. Through more than 600 experiments spanning 8 VLM backbones and 4 policy architectures, the authors derive a guidebook of design recommendations, introduce the flexible RoboVLMs framework (which requires minimal manual design), and report new state-of-the-art results on three simulation tasks plus real-world experiments. All code, models, datasets, and training recipes are open-sourced.

Significance. If the empirical rankings prove robust, the work supplies a practical, large-scale reference for VLA construction that could accelerate development of generalist robot policies. The open-sourcing of a modular framework supporting easy VLM integration and free combination of design choices is a concrete strength that lowers barriers for follow-on research.

major comments (2)
  1. Abstract and Results sections: the SOTA performance claims and the reliability of the resulting design guidebook rest on comparisons whose statistical significance, error bars, and exact train/validation/test splits are not reported. Without these, it is impossible to determine whether observed deltas reflect genuine factor importance or post-hoc selection and baseline tuning effects.
  2. Evaluation and Discussion sections: the central claim that the observed rankings of backbones, architectures, and cross-embodiment timing constitute a transferable guidebook assumes the chosen simulation tasks and real-world setups are representative of broader generalist manipulation. No additional held-out tasks, embodiments, or distribution-shift experiments are described to test whether the identified preferences are artifacts of the narrow task distribution (short-horizon pick-and-place, limited object diversity).
minor comments (2)
  1. The abstract states 'over 600 distinct designed experiments' but does not clarify how many runs per configuration or whether hyperparameter sweeps were performed uniformly across all backbones.
  2. Notation for the four policy architectures and the exact integration points for action tokens would benefit from an explicit diagram or table early in the methods section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We have carefully reviewed the major comments and provide point-by-point responses below. We agree that additional reporting and discussion will strengthen the manuscript and propose specific revisions to address these points.

read point-by-point responses
  1. Referee: Abstract and Results sections: the SOTA performance claims and the reliability of the resulting design guidebook rest on comparisons whose statistical significance, error bars, and exact train/validation/test splits are not reported. Without these, it is impossible to determine whether observed deltas reflect genuine factor importance or post-hoc selection and baseline tuning effects.

    Authors: We agree that statistical rigor is important for validating the observed performance differences and the resulting design recommendations. In the revised manuscript, we will add error bars computed from multiple independent training runs for the key comparative experiments, explicitly report the train/validation/test splits used across all tasks, and include statistical significance tests (such as paired t-tests) for the main deltas. These additions will be incorporated into the Results section and referenced in the Abstract where appropriate, allowing readers to better assess the robustness of the findings against potential selection effects. revision: yes

  2. Referee: Evaluation and Discussion sections: the central claim that the observed rankings of backbones, architectures, and cross-embodiment timing constitute a transferable guidebook assumes the chosen simulation tasks and real-world setups are representative of broader generalist manipulation. No additional held-out tasks, embodiments, or distribution-shift experiments are described to test whether the identified preferences are artifacts of the narrow task distribution (short-horizon pick-and-place, limited object diversity).

    Authors: We acknowledge that the current evaluation focuses on established simulation benchmarks and real-world setups with variations in objects and embodiments, which already incorporate some cross-embodiment transfer testing. To address concerns about broader generalizability, we will revise the Discussion section to explicitly analyze the scope of the task distribution, discuss potential limitations regarding transfer to longer-horizon or more diverse tasks, and outline future directions for held-out evaluations. While we cannot introduce entirely new large-scale experiments on additional distribution shifts within the revision timeline, the expanded discussion will clarify the assumptions and boundaries of the guidebook based on the existing 600+ experiments. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical ablation study

full rationale

The paper reports results from over 600 ablation experiments across 8 VLM backbones and 4 policy architectures to identify design preferences for VLAs and to introduce RoboVLMs. No mathematical derivation chain, first-principles equations, or predictions are claimed; performance rankings and SOTA results are presented as direct outcomes of the controlled experiments on the chosen simulation and real-world tasks. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided text. The work is self-contained as an empirical guidebook whose validity rests on the reproducibility of the reported runs rather than any reduction to prior inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard supervised learning assumptions plus the premise that the selected robot tasks and metrics capture generalist performance. No new physical laws or mathematical axioms are introduced.

free parameters (1)
  • Training hyperparameters across 600+ runs
    Learning rates, batch sizes, and architecture-specific scaling factors are fitted or chosen per backbone and policy variant.
axioms (1)
  • domain assumption Standard supervised imitation learning on demonstration data produces policies that generalize to unseen instructions and scenes.
    Invoked when claiming real-world and simulation success from training on collected trajectories.

pith-pipeline@v0.9.0 · 5570 in / 1355 out tokens · 38312 ms · 2026-05-17T21:33:48.446845+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Foundation.DimensionForcing dimension_forced unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    we focus on answering three essential design choices: which backbone to select, how to formulate the VLA architectures, and when to add cross-embodiment data.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. RotVLA: Rotational Latent Action for Vision-Language-Action Model

    cs.RO 2026-05 unverdicted novelty 7.0

    RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.

  2. From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation

    cs.RO 2026-05 unverdicted novelty 7.0

    MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.

  3. VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts

    cs.RO 2026-05 unverdicted novelty 7.0

    VLA-GSE improves VLA adaptation by initializing generalized shared experts and specialized routed experts via spectral decomposition of the backbone, outperforming full fine-tuning and other PEFT methods on robotic be...

  4. PhysMem: Scaling Test-Time Memory for Embodied Physical Reasoning

    cs.RO 2026-02 unverdicted novelty 7.0

    PhysMem enables VLM-based robot planners to learn and verify physical properties through test-time interaction and hypothesis testing, raising success on a brick insertion task from 23% to 76%.

  5. Bimanual Robot Manipulation via Multi-Agent In-Context Learning

    cs.RO 2026-04 unverdicted novelty 6.0

    BiCICLe frames bimanual robot control as a multi-agent leader-follower problem with Arms' Debate and an LLM judge, achieving up to 71.1% success on 13 TWIN benchmark tasks without fine-tuning.

  6. A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model

    cs.RO 2026-04 unverdicted novelty 6.0

    A1 is a transparent VLA framework achieving state-of-the-art robot manipulation success with up to 72% lower latency via adaptive layer truncation and inter-layer flow matching.

  7. Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model

    cs.RO 2026-04 conditional novelty 6.0

    MV-VDP jointly predicts multi-view RGB and heatmap videos via diffusion to achieve data-efficient, robust robotic manipulation policies.

  8. HiF-VLA: Hindsight, Insight and Foresight through Motion Representation for Vision-Language-Action Models

    cs.RO 2025-12 unverdicted novelty 6.0

    HiF-VLA improves long-horizon robotic manipulation by encoding past motion as hindsight priors and anticipating future motion through foresight reasoning inside a VLA framework.

  9. AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models

    cs.RO 2025-11 unverdicted novelty 6.0

    AsyncVLA adds asynchronous flow matching and a confidence rater to VLA models so they can generate actions on flexible schedules and selectively refine low-confidence tokens before execution.

  10. villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models

    cs.RO 2025-07 unverdicted novelty 6.0

    villa-X enhances latent action modeling in VLA models to support zero-shot action planning for unseen robot embodiments and open-vocabulary instructions, yielding better manipulation results in simulation and real-wor...

  11. DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

    cs.CV 2025-07 unverdicted novelty 6.0

    DreamVLA uses dynamic-region-guided world knowledge prediction, block-wise attention to disentangle information types, and a diffusion transformer for actions, reaching 76.7% success on real robot tasks and 4.44 avera...

  12. UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

    cs.RO 2025-05 unverdicted novelty 6.0

    UniVLA trains cross-embodiment vision-language-action policies from unlabeled videos via a latent action model in DINO space, beating OpenVLA on benchmarks with 1/20th pretraining compute and 1/10th downstream data.

  13. HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model

    cs.CV 2025-03 unverdicted novelty 6.0

    HybridVLA unifies diffusion and autoregression in a single VLA model via collaborative training and ensemble to raise robot manipulation success rates by 14% in simulation and 19% in real-world tasks.

  14. Drift is a Sampling Error: SNR-Aware Power Distributions for Long-Horizon Robotic Planning

    cs.RO 2026-05 unverdicted novelty 5.0

    Instruction drift is a sampling error in VLA models that CAPS mitigates via power distributions and SNR-based metacognitive MCMC switching, yielding better long-horizon results without retraining.

  15. VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts

    cs.RO 2026-05 unverdicted novelty 5.0

    VLA-GSE uses spectral decomposition of the VLA backbone to create generalized and specialized experts, enabling effective robot task adaptation while updating only 2.51% of parameters and achieving 81.2% zero-shot suc...

  16. GR-3 Technical Report

    cs.RO 2025-07 unverdicted novelty 5.0

    GR-3 is a VLA model that generalizes to novel objects, environments, and abstract instructions, outperforms the π0 baseline, and integrates with the new ByteMini bi-manual mobile robot.

  17. SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

    cs.RO 2025-01 unverdicted novelty 5.0

    SpatialVLA adds 3D-aware position encoding and adaptive discretized action grids to visual-language-action models, enabling strong zero-shot performance and fine-tuning on new robot setups after pre-training on 1.1 mi...

  18. JoyAI-RA 0.1: A Foundation Model for Robotic Autonomy

    cs.RO 2026-04 unverdicted novelty 4.0

    JoyAI-RA is a multi-source pretrained VLA model that claims to bridge human-to-robot embodiment gaps via data unification and outperforms prior methods on generalization-heavy robotic tasks.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · cited by 17 Pith papers · 25 internal anchors

  1. [1]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736, 2022

  2. [2]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023

  3. [3]

    PaliGemma: A versatile 3B VLM for transfer

    Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer. arXiv preprint arXiv:2407.07726, 2024

  4. [4]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

  5. [5]

    Robocat: A self-improving foundation agent for robotic manipulation

    Konstantinos Bousmalis, Giulia Vezzani, Dushyant Rao, Coline Devin, Alex X Lee, Maria Bauza, Todor Davchev, Yuxiang Zhou, Agrim Gupta, Akhil Raju, et al. Robocat: A self-improving foundation agent for robotic manipulation. arXiv preprint arXiv:2306.11706, 2023

  6. [6]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022

  7. [7]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023

  8. [8]

    GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

    Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, et al. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation. arXiv preprint arXiv:2410.06158, 2024

  9. [9]

    Diffusion policy: Visuomotor policy learning via action diffusion

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, page 02783649241273668, 2023

  10. [10]

    Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

    Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014

  11. [11]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020

  12. [12]

    Model-agnostic meta-learning for fast adaptation of deep networks

    Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pages 1126–1135. PMLR, 2017

  13. [13]

    Gpt-3: Its nature, scope, limits, and consequences

    Luciano Floridi and Massimo Chiriatti. Gpt-3: Its nature, scope, limits, and consequences. Minds and Machines, 30: 681–694, 2020

  14. [14]

    Long short-term memory

    Alex Graves and Alex Graves. Long short-term memory. Supervised sequence labelling with recurrent neural networks, pages 37–45, 2012

  15. [15]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. A vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025

  16. [16]

    Perceiver: General perception with iterative attention

    Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. Perceiver: General perception with iterative attention. In International conference on machine learning, pages 4651–4664. PMLR, 2021

  17. [17]

    Bc-z: Zero-shot task generalization with robotic imitation learning, 2022

    Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. Bc-z: Zero-shot task generalization with robotic imitation learning, 2022

  18. [18]

    Vima: General robot manipulation with multimodal prompts

    Yunfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi Wang, Yongqiang Dou, Yanjun Chen, Li Fei-Fei, Anima Anandkumar, Yuke Zhu, and Linxi Fan. Vima: General robot manipulation with multimodal prompts. arXiv preprint arXiv:2210.03094, 2022

  19. [19]

    3D Diffuser Actor: Policy Diffusion with 3D Scene Representations

    Tsung-Wei Ke, Nikolaos Gkanatsios, and Katerina Fragkiadaki. 3d diffuser actor: Policy diffusion with 3d scene representations. arXiv preprint arXiv:2402.10885, 2024

  20. [20]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246, 2024

  21. [21]

    URL https://arxiv.org/abs/ 2408.14368

    Peiyan Li, Hongtao Wu, Yan Huang, Chilam Cheang, Liang Wang, and Tao Kong. Gr-mg: Leveraging partially annotated data via multi-modal goal conditioned policy. arXiv preprint arXiv:2408.14368, 2024

  22. [22]

    Vision-Language Foundation Models as Effective Robot Imitators

    Xinghang Li, Minghuan Liu, Hanbo Zhang, Cunjun Yu, Jie Xu, Hongtao Wu, Chilam Cheang, Ya Jing, Weinan Zhang, Huaping Liu, et al. Vision-language foundation models as effective robot imitators. arXiv preprint arXiv:2311.01378, 2023. 24

  23. [23]

    Evaluating Real-World Robot Manipulation Policies in Simulation

    Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, et al. Evaluating real-world robot manipulation policies in simulation. arXiv preprint arXiv:2405.05941, 2024

  24. [24]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022

  25. [25]

    Robouniview: Visual-language model with unified view representation for robotic manipulation

    Fanfan Liu, Feng Yan, Liming Zheng, Chengjian Feng, Yiyang Huang, and Lin Ma. Robouniview: Visual-language model with unified view representation for robotic manipulation. arXiv preprint arXiv:2406.18977, 2024

  26. [26]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024

  27. [27]

    Embodied intelligence: A synergy of morphology, action, perception and learning

    Huaping Liu, Di Guo, and Angelo Cangelosi. Embodied intelligence: A synergy of morphology, action, perception and learning. ACM Computing Surveys, 57(7):1–36, 2025

  28. [28]

    Robomamba: Multimodal state space model for efficient robot reasoning and manipulation

    Jiaming Liu, Mengzhen Liu, Zhenyu Wang, Lily Lee, Kaichen Zhou, Pengju An, Senqiao Yang, Renrui Zhang, Yandong Guo, and Shanghang Zhang. Robomamba: Multimodal state space model for efficient robot reasoning and manipulation. arXiv preprint arXiv:2406.04339, 2024

  29. [29]

    Recurrent neural networks

    Larry R Medsker, Lakhmi Jain, et al. Recurrent neural networks. Design and Applications, 5(64-67):2, 2001

  30. [30]

    Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks

    Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. IEEE Robotics and Automation Letters, 7(3):7327–7334, 2022

  31. [31]

    Attention bottlenecks for multimodal fusion

    Arsha Nagrani, Shan Yang, Anurag Arnab, Aren Jansen, Cordelia Schmid, and Chen Sun. Attention bottlenecks for multimodal fusion. Advances in neural information processing systems, 34:14200–14213, 2021

  32. [32]

    R3M: A Universal Visual Representation for Robot Manipulation

    Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual representation for robot manipulation. arXiv preprint arXiv:2203.12601, 2022

  33. [33]

    Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    Abby O’Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, et al. Open x-embodiment: Robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864, 2023

  34. [34]

    Kosmos-2: Grounding Multimodal Large Language Models to the World

    Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023

  35. [35]

    Real-world robot learning with masked visual pre-training, 2023

    Ilija Radosavovic, Tete Xiao, Stephen James, Pieter Abbeel, Jitendra Malik, and Trevor Darrell. Real-world robot learning with masked visual pre-training, 2023

  36. [36]

    A Generalist Agent

    Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gomez Colmenarejo, Alexander Novikov, Gabriel Barth-Maron, Mai Gimenez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, et al. A generalist agent. arXiv preprint arXiv:2205.06175, 2022

  37. [37]

    Outrageously large neural networks: The sparsely-gated mixture-of-experts layer

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. International Conference on Learning Representations, 2017

  38. [38]

    Octo: An Open-Source Generalist Robot Policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy. arXiv preprint arXiv:2405.12213, 2024

  39. [39]

    Reconciling reality through simulation: A real-to-sim-to-real approach for robust manipulation

    Marcel Torne, Anthony Simeonov, Zechu Li, April Chan, Tao Chen, Abhishek Gupta, and Pulkit Agrawal. Reconciling reality through simulation: A real-to-sim-to-real approach for robust manipulation. arXiv preprint arXiv:2403.03949, 2024

  40. [40]

    Uform: Pocket-sized multimodal ai for content understanding and generation, 2024

    Unum-cloud. Uform: Pocket-sized multimodal ai for content understanding and generation, 2024. URL https://huggingface. co/unum-cloud/uform-gen2-qwen-500m

  41. [41]

    Attention is all you need

    A Vaswani. Attention is all you need. Advances in Neural Information Processing Systems, 2017

  42. [42]

    Moondream, tiny vision language model, 2024

    Vikhyat. Moondream, tiny vision language model, 2024. URL https://github.com/vikhyat/moondream

  43. [43]

    Bridgedata v2: A dataset for robot learning at scale, 2023

    Homer Walke, Kevin Black, Abraham Lee, Moo Jin Kim, Max Du, Chongyi Zheng, Tony Zhao, Philippe Hansen-Estruch, Quan Vuong, Andre He, Vivek Myers, Kuan Fang, Chelsea Finn, and Sergey Levine. Bridgedata v2: A dataset for robot learning at scale, 2023

  44. [44]

    Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework, 2022

    Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework, 2022

  45. [45]

    Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation

    Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre-training for visual robot manipulation. arXiv preprint arXiv:2312.13139, 2023

  46. [46]

    Vlm: Task-agnostic video-language model pre-training for video understanding

    Hu Xu, Gargi Ghosh, Po-Yao Huang, Prahal Arora, Masoumeh Aminzadeh, Christoph Feichtenhofer, Florian Metze, and Luke Zettlemoyer. Vlm: Task-agnostic video-language model pre-training for video understanding. arXiv preprint arXiv:2105.09996, 2021

  47. [47]

    The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)

    Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9(1):1, 2023

  48. [48]

    Latent Action Pretraining from Videos

    Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Sejune Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, et al. Latent action pretraining from videos. arXiv preprint arXiv:2410.11758, 2024

  49. [49]

    Deer-vla: Dynamic inference of multimodal large language models for efficient robot execution

    Yang Yue, Yulin Wang, Bingyi Kang, Yizeng Han, Shenzhi Wang, Shiji Song, Jiashi Feng, and Gao Huang. Deer-vla: Dynamic inference of multimodal large language models for efficient robot execution. arXiv preprint arXiv:2411.02359, 25 2024

  50. [50]

    Robotic Control via Embodied Chain-of-Thought Reasoning

    Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning. arXiv preprint arXiv:2407.08693, 2024

  51. [51]

    Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

    Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. arXiv preprint arXiv:2304.13705, 2023

  52. [52]

    Sim-to-real transfer in deep reinforcement learning for robotics: a survey, 2020

    Wenshuai Zhao, Jorge Peña Queralta, and Tomi Westerlund. Sim-to-real transfer in deep reinforcement learning for robotics: a survey, 2020

  53. [53]

    3D-VLA: A 3D Vision-Language-Action Generative World Model

    Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, and Chuang Gan. 3d-vla: A 3d vision-language-action generative world model. arXiv preprint arXiv:2403.09631, 2024

  54. [54]

    EP" denotes the epoch. “Iters

    Zhongyi Zhou, Yichen Zhu, Junjie Wen, Chaomin Shen, and Yi Xu. Vision-language-action model with open-world embodied reasoning from pretrained knowledge. arXiv preprint arXiv:2505.21906, 2025. 26 APPENDIXA DISCUSSION This empirical study mainly focuses on what matters in building Visual-Language-Action models (VLAs). We raise four essential questions for ...