pith. machine review for the scientific record.
sign in

arxiv: 2409.12514 · v5 · pith:7NLFFWOBnew · submitted 2024-09-19 · 💻 cs.RO · cs.CV

TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation

Pith reviewed 2026-05-17 16:07 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords Vision-Language-Action modelsRobotic manipulationData-efficient learningFast inferenceDiffusion policyMultimodal initializationNo pre-training
0
0 comments X

The pith

TinyVLA reaches OpenVLA-level performance on robot tasks by initializing from fast multimodal models and adding a diffusion action decoder, removing the pre-training stage entirely.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents TinyVLA, a compact family of vision-language-action models for robotic manipulation. It establishes that starting the policy from existing high-speed multimodal vision-language models and attaching a diffusion decoder only during fine-tuning produces faster inference and higher data efficiency than prior VLA systems. These models match or exceed the task success rates of OpenVLA across simulation and real-robot evaluations while generalizing to new language instructions, objects, positions, appearances, backgrounds, and environments. A reader would care because current VLA approaches remain too slow and data-hungry for practical robot deployment.

Core claim

TinyVLA shows that vision-language-action policies can be obtained by directly fine-tuning robust high-speed multimodal backbones with a diffusion policy decoder on task-specific robot data, thereby eliminating the separate large-scale robotic pre-training stage required by earlier models while delivering faster inference and comparable or better manipulation performance.

What carries the argument

Policy backbone initialized from high-speed multimodal vision-language models together with a diffusion policy decoder attached during fine-tuning.

If this is right

  • Inference latency drops enough for closed-loop control on standard robot hardware.
  • Effective VLA policies can be trained from far smaller collections of demonstration trajectories.
  • Real-world deployment becomes feasible without access to large-scale robot data centers.
  • Generalization holds across language, object, position, and scene variations at levels matching or exceeding prior models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same initialization-plus-diffusion recipe may transfer to other continuous control domains such as navigation or mobile manipulation.
  • Further compression of the backbone could yield models suitable for on-device inference on low-power robots.
  • The diffusion decoder's ability to model multimodal action distributions might reduce failure modes in contact-rich tasks.

Load-bearing premise

High-speed multimodal models already encode enough general visuomotor knowledge that fine-tuning alone, without any robot-specific pre-training, can reach or surpass the performance of models trained on massive robotic datasets.

What would settle it

A controlled experiment in which a model with the same multimodal initialization and diffusion decoder is trained on the identical limited dataset as OpenVLA but still requires an additional pre-training stage to match its success rate.

read the original abstract

Vision-Language-Action (VLA) models have shown remarkable potential in visuomotor control and instruction comprehension through end-to-end learning processes. However, current VLA models face significant challenges: they are slow during inference and require extensive pre-training on large amounts of robotic data, making real-world deployment difficult. In this paper, we introduce a new family of compact vision-language-action models, called TinyVLA, which offers two key advantages over existing VLA models: (1) faster inference speeds, and (2) improved data efficiency, eliminating the need for pre-training stage. Our framework incorporates two essential components to build TinyVLA: (1) initializing the policy backbone with robust, high-speed multimodal models, and (2) integrating a diffusion policy decoder during fine-tuning to enable precise robot actions. We conducted extensive evaluations of TinyVLA in both simulation and on real robots, demonstrating that our approach significantly outperforms the state-of-the-art VLA model, OpenVLA, in terms of speed and data efficiency, while delivering comparable or superior performance. Additionally, TinyVLA exhibits strong generalization capabilities across various dimensions, including language instructions, novel objects, unseen positions, changes in object appearance, background variations, and environmental shifts, often matching or exceeding the performance of OpenVLA. We believe that \methodname offers an interesting perspective on utilizing pre-trained multimodal models for policy learning. Our project is at https://tiny-vla.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces TinyVLA, a family of compact vision-language-action models for robotic manipulation. It proposes initializing the policy backbone with pre-trained high-speed multimodal models and integrating a diffusion policy decoder during fine-tuning to eliminate the robotic pre-training stage. The approach is evaluated in both simulation and real-robot settings, claiming faster inference, improved data efficiency, and comparable or superior performance to OpenVLA, along with strong generalization across language instructions, novel objects, unseen positions, object appearances, backgrounds, and environments.

Significance. If the results hold under rigorous verification, this work would be significant for enabling practical deployment of VLA models in real-world robotics by addressing inference speed and data requirements. The strategy of transferring features from existing multimodal models to visuomotor policies offers a promising direction for data-efficient robot learning and could reduce dependence on large-scale robotic datasets.

major comments (2)
  1. [Experiments / Ablation Studies] The central claim that TinyVLA eliminates pre-training while matching or exceeding OpenVLA performance relies on the transferability of features from high-speed multimodal models plus the diffusion decoder. However, no ablation study compares against a randomly initialized or from-scratch policy backbone trained on identical fine-tuning data and splits. Without this control, it is impossible to isolate whether the reported data efficiency and generalization gains are due to the proposed architecture or simply the pre-trained weights (see weakest assumption in reader's report).
  2. [§5, Real-Robot Experiments] Table 1 and real-robot results: success rates and inference speeds are reported as outperforming OpenVLA, but the manuscript provides no error bars, number of evaluation trials, or details on training data volume and splits used for the data-efficiency comparisons. This weakens confidence in the quantitative claims of outperformance and generalization.
minor comments (2)
  1. [Abstract] The abstract states 'significantly outperforms' without any numerical values for speed or success rates; including one or two key metrics would strengthen the summary.
  2. [Figures 4-6] Figure captions for qualitative generalization examples should explicitly state the number of trials and success criteria to allow readers to assess the visual results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our work. We have carefully considered each comment and provide detailed responses below, along with plans for revisions to address the concerns raised.

read point-by-point responses
  1. Referee: [Experiments / Ablation Studies] The central claim that TinyVLA eliminates pre-training while matching or exceeding OpenVLA performance relies on the transferability of features from high-speed multimodal models plus the diffusion decoder. However, no ablation study compares against a randomly initialized or from-scratch policy backbone trained on identical fine-tuning data and splits. Without this control, it is impossible to isolate whether the reported data efficiency and generalization gains are due to the proposed architecture or simply the pre-trained weights (see weakest assumption in reader's report).

    Authors: We agree that an ablation comparing against a randomly initialized backbone trained on the same fine-tuning data would help isolate the contribution of the pre-trained multimodal weights. Our primary comparisons are against OpenVLA, which requires large-scale robotic pre-training, to highlight the benefit of bypassing that stage via high-speed multimodal initialization. To strengthen the evidence, we will add this ablation study to the revised manuscript. revision: yes

  2. Referee: [§5, Real-Robot Experiments] Table 1 and real-robot results: success rates and inference speeds are reported as outperforming OpenVLA, but the manuscript provides no error bars, number of evaluation trials, or details on training data volume and splits used for the data-efficiency comparisons. This weakens confidence in the quantitative claims of outperformance and generalization.

    Authors: We appreciate this feedback on experimental reporting. In the revised manuscript, we will include error bars for success rates, report the number of evaluation trials for each experiment, and provide additional details on training data volumes and splits used in the data-efficiency comparisons to improve transparency and confidence in the results. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents TinyVLA as an empirical architecture that initializes a policy backbone from existing high-speed multimodal models and adds a diffusion decoder at fine-tuning time, then reports comparative results against OpenVLA on speed, data efficiency, and task success. No equations, fitted parameters, or self-citations are shown to reduce the central performance claims to inputs by construction. The reported advantages are measured outcomes from simulation and real-robot experiments rather than self-definitional re-labeling or load-bearing uniqueness theorems imported from the same authors' prior work. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claims rest on the empirical success of two design choices whose effectiveness is demonstrated only through the reported experiments; no explicit free parameters or invented entities are named in the abstract.

pith-pipeline@v0.9.0 · 5602 in / 1168 out tokens · 52610 ms · 2026-05-17T16:07:08.380660+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities

    cs.LG 2026-04 unverdicted novelty 7.0

    π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.

  2. PriorVLA: Prior-Preserving Adaptation for Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    PriorVLA preserves pretrained priors in VLA models through a frozen Prior Expert and trained Adaptation Expert, delivering better robot manipulation performance than full fine-tuning with only 25% of the parameter updates.

  3. A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model

    cs.RO 2026-04 unverdicted novelty 6.0

    A1 is a transparent VLA framework achieving state-of-the-art robot manipulation success with up to 72% lower latency via adaptive layer truncation and inter-layer flow matching.

  4. SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

    cs.LG 2025-06 unverdicted novelty 6.0

    SmolVLA is a small efficient VLA model that achieves performance comparable to 10x larger models while training on one GPU and deploying on consumer hardware via community data and chunked asynchronous action prediction.

  5. FLARE: Robot Learning with Implicit World Modeling

    cs.RO 2025-05 unverdicted novelty 6.0

    FLARE integrates predictive latent world modeling into diffusion transformer policies for robots, delivering up to 26% gains on multitask manipulation benchmarks and enabling co-training with action-free human videos.

  6. GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data

    cs.RO 2025-05 unverdicted novelty 6.0

    GraspVLA shows that pretraining a grasping model on a billion synthetic action frames enables zero-shot open-vocabulary performance and sim-to-real transfer.

  7. Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets

    cs.RO 2025-04 unverdicted novelty 6.0

    Unified World Models couple video and action diffusion inside one transformer with independent timesteps, enabling pretraining on heterogeneous robot datasets that include action-free video and producing more generali...

  8. CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models

    cs.CV 2025-03 unverdicted novelty 6.0

    CoT-VLA is a 7B VLA that generates future visual frames autoregressively as planning goals before actions, outperforming prior VLAs by 17% on real-world tasks and 6% in simulation.

  9. GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    cs.RO 2025-03 unverdicted novelty 6.0

    GR00T N1 is a new open VLA foundation model for humanoid robots that outperforms imitation learning baselines in simulation and shows strong performance on real-world bimanual manipulation tasks.

  10. HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model

    cs.CV 2025-03 unverdicted novelty 6.0

    HybridVLA unifies diffusion and autoregression in a single VLA model via collaborative training and ensemble to raise robot manipulation success rates by 14% in simulation and 19% in real-world tasks.

  11. Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

    cs.RO 2025-02 accept novelty 6.0

    OpenVLA-OFT fine-tuning boosts LIBERO success rate from 76.5% to 97.1%, speeds action generation 26x, and outperforms baselines on real bimanual dexterous tasks.

  12. Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models

    cs.RO 2025-02 unverdicted novelty 6.0

    A hierarchical VLA architecture lets robots follow complex instructions and situated feedback by separating high-level reasoning from low-level control.

  13. DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control

    cs.RO 2025-02 unverdicted novelty 6.0

    DexVLA combines a scaled diffusion action expert with embodiment curriculum learning to achieve better generalization and performance than prior VLA models on diverse robot hardware and long-horizon tasks.

  14. FAST: Efficient Action Tokenization for Vision-Language-Action Models

    cs.RO 2025-01 unverdicted novelty 6.0

    FAST applies discrete cosine transform to robot action sequences for efficient tokenization, enabling autoregressive VLAs to succeed on high-frequency dexterous tasks and scale to 10k hours of data while matching diff...

  15. CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

    cs.RO 2024-11 unverdicted novelty 6.0

    CogACT is a new VLA model that uses a conditioned diffusion action transformer to achieve over 35% higher average success rates than OpenVLA in simulation and 55% in real-robot experiments while generalizing to new ro...

  16. $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    cs.LG 2024-10 unverdicted novelty 6.0

    π₀ is a vision-language-action flow model trained on diverse multi-platform robot data that supports zero-shot task performance, language instruction following, and efficient fine-tuning for dexterous tasks.

  17. Nautilus: From One Prompt to Plug-and-Play Robot Learning

    cs.RO 2026-05 unverdicted novelty 5.0

    NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.

  18. ReFineVLA: Multimodal Reasoning-Aware Generalist Robotic Policies via Teacher-Guided Fine-Tuning

    cs.RO 2026-04 unverdicted novelty 5.0

    ReFineVLA adds teacher-generated reasoning steps to VLA training and reports state-of-the-art success rates on SimplerEnv WidowX and Google Robot benchmarks.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · cited by 18 Pith papers · 12 internal anchors

  1. [1]

    Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking,

    H. Bharadhwaj, J. Vakil et al., “Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking,” in ICRA 2024. IEEE, 2024, pp. 4788–4795

  2. [2]

    Bridge data: Boosting generalization of robotic skills with cross-domain datasets,

    F. Ebert, Y . Yang et al., “Bridge data: Boosting generalization of robotic skills with cross-domain datasets,” RSS, 2022

  3. [3]

    Diffusion policy: Visuomotor policy learning via action diffusion,

    C. Chi, S. Feng et al. , “Diffusion policy: Visuomotor policy learning via action diffusion,” RSS, 2023

  4. [4]

    3d diffusion policy: Generalizable visuomo- tor policy learning via simple 3d representations,

    Y . Ze, G. Zhang et al. , “3d diffusion policy: Generalizable visuomo- tor policy learning via simple 3d representations,” in Proceedings of Robotics: Science and Systems (RSS) , 2024

  5. [5]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    H. Touvron, L. Martin et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288 , 2023

  6. [6]

    Mistral 7B

    A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier et al., “Mistral 7b,” arXiv preprint arXiv:2310.06825 , 2023

  7. [7]

    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

    M. Ahn, A. Brohan et al., “Do as i can, not as i say: Grounding language in robotic affordances,” arXiv preprint arXiv:2204.01691 , 2022

  8. [8]

    arXiv preprint arXiv:2310.20587 , year=

    R. Shi, Y . Liu, Y . Ze, S. S. Du, and H. Xu, “Unleashing the power of pre-trained language models for offline reinforcement learning,” arXiv preprint arXiv:2310.20587, 2023

  9. [9]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    A. Brohan, N. Brown et al., “Rt-2: Vision-language-action models trans- fer web knowledge to robotic control,”arXiv preprint arXiv:2307.15818, 2023

  10. [10]

    Openvla: An open-source vision-language- action model,

    M. J. Kim, K. Pertsch et al., “Openvla: An open-source vision-language- action model,” 8th Annual Conference on Robot Learning , 2024

  11. [11]

    Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    A. Padalkar, A. Pooley et al. , “Open x-embodiment: Robotic learning datasets and rt-x models,” arXiv preprint arXiv:2310.08864 , 2023

  12. [12]

    MiniGPT-4: Enhancing vision-language un- derstanding with advanced large language models,

    D. Zhu, J. Chen et al. , “MiniGPT-4: Enhancing vision-language un- derstanding with advanced large language models,” in The Twelfth International Conference on Learning Representations , 2024

  13. [13]

    Visual instruction tuning,

    H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” in Thirty-seventh Conference on Neural Information Processing Systems , 2023

  14. [14]

    Gemini: A Family of Highly Capable Multimodal Models

    G. Team, R. Anil et al., “Gemini: a family of highly capable multimodal models,” arXiv preprint arXiv:2312.11805 , 2023

  15. [15]

    MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

    J. Chen, D. Zhu et al. , “Minigpt-v2: large language model as a unified interface for vision-language multi-task learning,” arXiv preprint arXiv:2310.09478, 2023

  16. [16]

    MobileVLM V2: Faster and Stronger Baseline for Vision Language Model

    X. Chu, L. Qiao et al., “Mobilevlm v2: Faster and stronger baseline for vision language model,” arXiv preprint arXiv:2402.03766 , 2024

  17. [17]

    The Kitchen Store

    Y . Zhu, M. Zhu, N. Liu, Z. Ou, X. Mou, and J. Tang, “Llava- phi: Efficient multi-modal assistant with small language model,” arXiv preprint arXiv:2401.02330, 2024

  18. [18]

    Robotic control via embodied chain-of-thought reasoning,

    M. Zawalski, W. Chen, K. Pertsch, O. Mees, C. Finn, and S. Levine, “Robotic control via embodied chain-of-thought reasoning,” in Confer- ence on Robot Learning (CoRL) , vol. 270, 2024, pp. 3157–3181

  19. [19]

    Object-centric instruction augmentation for robotic manipulation,

    J. Wen, Y . Zhu, M. Zhu, J. Li, Z. Xu, Z. Che, C. Shen, Y . Peng, D. Liu, F. Feng et al. , “Object-centric instruction augmentation for robotic manipulation,” arXiv preprint arXiv:2401.02814 , 2024

  20. [20]

    Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation

    Z. Fu, T. Z. Zhao, and C. Finn, “Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation,” arXiv preprint arXiv:2401.02117, 2024

  21. [21]

    Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models

    K. Black, M. Nakamoto, P. Atreya, H. Walke, C. Finn, A. Kumar, and S. Levine, “Zero-shot robotic manipulation with pretrained image- editing diffusion models,” arXiv preprint arXiv:2310.10639 , 2023

  22. [22]

    Sayplan: Grounding large language models using 3d scene graphs for scalable task planning,

    K. Rana, J. Haviland et al., “Sayplan: Grounding large language models using 3d scene graphs for scalable task planning,” in 7th Annual Conference on Robot Learning , 2023

  23. [23]

    RT-1: Robotics Transformer for Real-World Control at Scale

    A. Brohan, N. Brown et al., “Rt-1: Robotics transformer for real-world control at scale,” arXiv preprint arXiv:2212.06817 , 2022

  24. [24]

    H-index: Visual reinforcement learning with hand-informed representations for dexterous manipulation,

    Y . Ze, Y . Liu et al. , “H-index: Visual reinforcement learning with hand-informed representations for dexterous manipulation,” Advances in Neural Information Processing Systems , vol. 36, 2024

  25. [25]

    Aloha 2: An enhanced low-cost hardware for bimanual teleoperation.arXiv preprint arXiv:2405.02292,

    J. Aldaco, T. Armstrong et al., “Aloha 2: An enhanced low-cost hardware for bimanual teleoperation,” arXiv preprint arXiv:2405.02292 , 2024

  26. [26]

    Bc-z: Zero-shot task generalization with robotic imitation learning,

    E. Jang, A. Irpan et al., “Bc-z: Zero-shot task generalization with robotic imitation learning,” in Conference on Robot Learning . PMLR, 2022, pp. 991–1002

  27. [27]

    Pre- training for robots: Offline rl enables learning new tasks from a handful of trials,

    A. Kumar, A. Singh, F. Ebert, Y . Yang, C. Finn, and S. Levine, “Pre- training for robots: Offline rl enables learning new tasks from a handful of trials,” arXiv preprint arXiv:2210.05178 , 2022

  28. [28]

    Octo: An open- source generalist robot policy,

    Octo Model Team, D. Ghosh, H. Walke et al. , “Octo: An open- source generalist robot policy,” in Proceedings of Robotics: Science and Systems, Delft, Netherlands, 2024

  29. [29]

    LoRA: Low-rank adaptation of large language models,

    E. J. Hu, yelong shen et al. , “LoRA: Low-rank adaptation of large language models,” in International Conference on Learning Represen- tations, 2022

  30. [30]

    Pythia: A suite for analyzing large language models across training and scaling,

    S. Biderman, H. Schoelkopf et al. , “Pythia: A suite for analyzing large language models across training and scaling,” in International Conference on Machine Learning . PMLR, 2023, pp. 2397–2430

  31. [31]

    Unified-io: A unified model for vision, language, and multi-modal tasks,

    J. Lu, C. Clark, R. Zellers, R. Mottaghi, and A. Kembhavi, “Unified-io: A unified model for vision, language, and multi-modal tasks,” in The Eleventh International Conference on Learning Representations , 2022

  32. [32]

    A unified sequence interface for vision tasks,

    T. Chen, S. Saxena, L. Li, T.-Y . Lin, D. J. Fleet, and G. E. Hinton, “A unified sequence interface for vision tasks,” Advances in Neural Information Processing Systems , vol. 35, pp. 31 333–31 346, 2022

  33. [33]

    A generalist framework for panoptic segmentation of images and videos,

    T. Chen, L. Li et al., “A generalist framework for panoptic segmentation of images and videos,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2023, pp. 909–919

  34. [34]

    Pix2seq: A language modeling framework for object detection,

    T. Chen, S. Saxena, L. Li, D. J. Fleet, and G. Hinton, “Pix2seq: A language modeling framework for object detection,” arXiv preprint arXiv:2109.10852, 2021

  35. [35]

    Denoising diffusion probabilistic models,

    J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems , vol. 33, pp. 6840– 6851, 2020

  36. [36]

    Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning,

    T. Yu, D. Quillen et al., “Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning,” in Conference on robot learning. PMLR, 2020, pp. 1094–1100

  37. [37]

    Masked world models for visual control,

    Y . Seo, D. Hafner, H. Liu, F. Liu, S. James, K. Lee, and P. Abbeel, “Masked world models for visual control,” in Conference on Robot Learning. PMLR, 2023, pp. 1332–1344

  38. [38]

    Multimodal diffusion transformer: Learning versatile behavior from multimodal goals,

    M. Reuss, ¨O. E. Ya ˘gmurlu, F. Wenzel, and R. Lioutikov, “Multimodal diffusion transformer: Learning versatile behavior from multimodal goals,” Robotics: Science and Systems , 2024

  39. [39]

    Yell at your robot: Improving on-the-fly from language corrections,

    L. X. Shi, Z. Hu et al., “Yell at your robot: Improving on-the-fly from language corrections,” arXiv preprint arXiv:2403.12910 , 2024

  40. [40]

    Film: Visual reasoning with a general conditioning layer,

    E. Perez, F. Strub et al. , “Film: Visual reasoning with a general conditioning layer,” in Proceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018

  41. [41]

    A theory of relation learning and cross-domain generalization

    L. A. Doumas, G. Puebla, A. E. Martin, and J. E. Hummel, “A theory of relation learning and cross-domain generalization.” Psychological review, vol. 129, no. 5, p. 999, 2022

  42. [42]

    The magical benchmark for robust imitation,

    S. Toyer, R. Shah, A. Critch, and S. Russell, “The magical benchmark for robust imitation,” Advances in Neural Information Processing Systems , vol. 33, pp. 18 284–18 295, 2020

  43. [43]

    Spatial generalization of visual imitation learning with position-invariant regularization,

    Z.-H. Yin, Y . Gao, and Q. Chen, “Spatial generalization of visual imitation learning with position-invariant regularization,” in RSS 2023 Workshop on Symmetries in Robot Learning , 2023

  44. [44]

    Image augmentation is all you need: Regularizing deep reinforcement learning from pixels,

    D. Yarats, I. Kostrikov, and R. Fergus, “Image augmentation is all you need: Regularizing deep reinforcement learning from pixels,” in International conference on learning representations , 2020

  45. [45]

    PaliGemma: A versatile 3B VLM for transfer

    L. Beyer, A. Steiner et al., “Paligemma: A versatile 3b vlm for transfer,” arXiv preprint arXiv:2407.07726 , 2024

  46. [46]

    Aloha unleashed: A simple recipe for robot dexterity,

    T. Z. Zhao, J. Tompson et al. , “Aloha unleashed: A simple recipe for robot dexterity,” in 8th Annual Conference on Robot Learning , 2024