Recognition: unknown
VLA Foundry: A Unified Framework for Training Vision-Language-Action Models
Pith reviewed 2026-05-10 02:05 UTC · model grok-4.3
The pith
VLA Foundry supplies a single open codebase for training vision-language-action models from language pretraining through to action fine-tuning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VLA Foundry is an open-source framework that provides a shared training stack with end-to-end control from language pretraining to action-expert fine-tuning of vision-language-action models. It supports both fully from-scratch training through an LLM to VLM to VLA pipeline and the use of pretrained backbones from Hugging Face. Two models trained with the framework are evaluated in closed-loop on the LBM Eval simulator: a fully open from-scratch model that reaches performance on par with prior closed-source work in the nominal setting, and a model built on the Qwen3-VL backbone that produces a strong multi-task tabletop manipulation policy outperforming the baseline by a wide margin.
What carries the argument
the shared training stack that handles end-to-end control from language pretraining to action-expert fine-tuning
If this is right
- Fully open from-scratch models can reach performance levels comparable to previous closed-source work on tabletop manipulation tasks.
- Substituting a strong pretrained vision-language backbone produces multi-task policies that substantially outperform a baseline on the same simulator.
- Releasing the full codebase and model weights allows other researchers to train and evaluate their own VLA models without building separate pipelines for each stage.
- Usability improvements to the simulator and analysis tools make it simpler for the community to reproduce and extend the evaluations.
Where Pith is reading between the lines
- Other groups could adopt the same unified stack to test new backbones or tasks without rebuilding the entire training pipeline from scratch each time.
- Running the released models on physical robot hardware would provide a direct check on whether the simulator results translate to real environments.
- The framework's support for swapping components might enable faster iteration when newer vision-language models become available.
Load-bearing premise
Performance gains measured in the LBM Eval simulator reflect meaningful improvements in real-world robot capabilities, and the unified framework structure itself, rather than other implementation details, drives the reported results.
What would settle it
A side-by-side evaluation on the same LBM Eval tasks showing that the from-scratch model does not match prior closed-source success rates or that the Qwen3-VL model does not outperform the baseline by the reported margin.
read the original abstract
We present VLA Foundry, an open-source framework that unifies LLM, VLM, and VLA training in a single codebase. Most open-source VLA efforts specialize on the action training stage, often stitching together incompatible pretraining pipelines. VLA Foundry instead provides a shared training stack with end-to-end control, from language pretraining to action-expert fine-tuning. VLA Foundry supports both from-scratch training and pretrained backbones from Hugging Face. To demonstrate the utility of our framework, we train and release two types of models: the first trained fully from scratch through our LLM-->VLM-->VLA pipeline and the second built on the pretrained Qwen3-VL backbone. We evaluate closed-loop policy performance of both models on LBM Eval, an open-data, open-source simulator. We also contribute usability improvements to the simulator and the STEP analysis tools for easier public use. In the nominal evaluation setting, our fully-open from-scratch model is on par with our prior closed-source work and substituting in the Qwen3-VL backbone leads to a strong multi-task table top manipulation policy outperforming our baseline by a wide margin. The VLA Foundry codebase is available at https://github.com/TRI-ML/vla_foundry and all multi-task model weights are released on https://huggingface.co/collections/TRI-ML/vla-foundry. Additional qualitative videos are available on the project website https://tri-ml.github.io/vla_foundry.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces VLA Foundry, an open-source framework that unifies LLM, VLM, and VLA training within a single codebase supporting end-to-end control from language pretraining to action fine-tuning. It enables both fully from-scratch training via an LLM-to-VLM-to-VLA pipeline and integration of pretrained backbones such as Qwen3-VL. The authors train and release two models, evaluate closed-loop performance on the LBM Eval simulator for multi-task tabletop manipulation, contribute simulator usability improvements and STEP analysis tools, and claim that the from-scratch model matches prior closed-source work while the Qwen3-VL variant substantially outperforms the baseline. Code and weights are publicly released.
Significance. If the performance attribution holds, the work provides a reusable open engineering artifact that could reduce fragmentation in VLA development by replacing stitched pipelines with a shared stack. Explicit strengths include the public release of the full codebase, model weights, and simulator enhancements, which directly support reproducibility and community extension in robotics research.
major comments (2)
- [Abstract and Experiments] Abstract and Experiments: The positioning of the shared training stack and end-to-end control as the key innovation is load-bearing, yet no ablation studies are presented that hold data volume, compute budget, and backbone fixed while varying only the unification aspect of the pipeline. This makes it impossible to credit the reported gains (from-scratch parity and Qwen3-VL outperformance) specifically to VLA Foundry rather than implementation details or backbone strength.
- [Evaluation] Evaluation section: All quantitative claims rest on closed-loop policy performance within the LBM Eval simulator alone; no real-robot transfer experiments or physical deployment results are included, so the practical significance for tabletop manipulation remains untested even if simulator numbers are accurate.
minor comments (2)
- The phrase 'nominal evaluation setting' appears in the abstract without a clear definition or pointer to the corresponding experimental protocol in the main text.
- Training hyperparameter tables or configuration files could be expanded to list exact data mixtures and optimizer settings used for the from-scratch versus backbone-substituted runs.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback on our manuscript. We are pleased that the open-source nature and potential for reducing fragmentation in VLA development are recognized as strengths. Below, we provide point-by-point responses to the major comments, indicating planned revisions where appropriate.
read point-by-point responses
-
Referee: [Abstract and Experiments] Abstract and Experiments: The positioning of the shared training stack and end-to-end control as the key innovation is load-bearing, yet no ablation studies are presented that hold data volume, compute budget, and backbone fixed while varying only the unification aspect of the pipeline. This makes it impossible to credit the reported gains (from-scratch parity and Qwen3-VL outperformance) specifically to VLA Foundry rather than implementation details or backbone strength.
Authors: We agree that an ablation isolating the unification aspect—while controlling for data, compute, and backbone—would more definitively attribute performance improvements to the shared stack. Our experiments instead demonstrate the framework's end-to-end capabilities by training competitive models from scratch and with pretrained backbones, achieving parity with prior closed-source results and outperforming baselines in the Qwen3-VL case. These results highlight the practical utility of VLA Foundry for unified training. To address the concern, we will revise the abstract and experiments section to better scope the claims to the framework's enabling role and add a limitations paragraph discussing the lack of such fine-grained ablations, along with our intent to pursue them in follow-up work. revision: partial
-
Referee: [Evaluation] Evaluation section: All quantitative claims rest on closed-loop policy performance within the LBM Eval simulator alone; no real-robot transfer experiments or physical deployment results are included, so the practical significance for tabletop manipulation remains untested even if simulator numbers are accurate.
Authors: We concur that real-robot validation is essential for establishing practical significance beyond simulation. The manuscript's evaluations are deliberately scoped to the LBM Eval simulator to enable reproducible, large-scale multi-task assessment in a controlled environment, consistent with many prior VLA works. We will update the evaluation and conclusion sections to explicitly acknowledge this limitation and to highlight that the released codebase and models are designed to facilitate future real-world deployment and transfer studies. revision: yes
Circularity Check
No circularity: engineering framework with external-benchmark evaluations only
full rationale
The paper introduces an open-source codebase for unified LLM/VLM/VLA training and reports empirical results on the external LBM Eval simulator. No mathematical derivations, fitted parameters renamed as predictions, self-definitional claims, or load-bearing self-citation chains appear in the provided text. Performance statements (on-par with prior closed work, Qwen3-VL variant outperforming baseline) are direct comparisons to external references rather than reductions to the paper's own inputs. The contribution is an artifact release plus simulator runs; the derivation chain is empty.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
AlexAndonianetal.GPT-NeoX: Large Scale Autoregressive Language Modeling in PyTorch.Version2.0.0. Sept. 2023.doi:10.5281/zenodo.5879544.url:https://www.github.com/eleutherai/gpt-neox
work page doi:10.5281/zenodo.5879544.url:https://www.github.com/eleutherai/gpt-neox 2023
-
[2]
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
Anas Awadalla et al. “OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models”. In:arXiv preprint arXiv:2308.01390(2023)
work page internal anchor Pith review arXiv 2023
-
[3]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai et al. “Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond”. In:arXiv preprint arXiv:2308.12966(2023)
work page internal anchor Pith review arXiv 2023
-
[4]
Shuai Bai et al. “Qwen3-vl technical report”. In:arXiv preprint arXiv:2511.21631(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Foundation Models for Robotics: Vision-Language-Action (VLA)
Rohit Bandaru. “Foundation Models for Robotics: Vision-Language-Action (VLA)”. In: (Sept. 2025). url:https://rohitbandaru.github.io/blog/Foundation-Models-for-Robotics-VLA/
2025
-
[6]
Significance Tests for 2×2 Tables
G. A. Barnard. “Significance Tests for 2×2 Tables”. In:Biometrika34.1-2 (Jan. 1947), pp. 123–138. issn: 0006-3444.doi:10.1093/biomet/34.1-2.123. (Visited on 01/20/2025)
-
[7]
Lucas Beyer et al.PaliGemma: A versatile 3B VLM for transfer. en. July 2024.url:https://arxiv. org/abs/2407.07726v1(visited on 09/05/2024)
work page internal anchor Pith review arXiv 2024
-
[8]
PIQA: Reasoning about Physical Commonsense in Natural Language
Yonatan Bisk et al. “PIQA: Reasoning about Physical Commonsense in Natural Language”. In:The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educa- tional Advances in Artificial Intelligence, EAAI 2020, New York, NY,...
2020
-
[9]
2020.url:https://github.com/webdataset/webdataset
Thomas Breuel.WebDataset: A High-Performance Python-Based I/O System for Large Deep Learning Problems. 2020.url:https://github.com/webdataset/webdataset. 15
2020
-
[10]
https://github.com/huggingface/lerobot
Remi Cadene et al.LeRobot: State-of-the-art Machine Learning for Real-World Robotics in Pytorch. https://github.com/huggingface/lerobot. 2024
2024
-
[11]
Microsoft COCO Captions: Data Collection and Evaluation Server
Xinlei Chen et al. “Microsoft coco captions: Data collection and evaluation server”. In:arXiv preprint arXiv:1504.00325(2015)
work page internal anchor Pith review arXiv 2015
-
[12]
Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks
Zhe Chen et al. “Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks”. In:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024, pp. 24185–24198
2024
-
[13]
Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots
Cheng Chi et al. “Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots”. In:Proceedings of Robotics: Science and Systems (RSS). 2024
2024
-
[14]
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
Christopher Clark et al. “BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions”. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, 2019, pp....
-
[15]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark et al. “Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge”. In:ArXiv preprintabs/1803.05457 (2018).url:https://arxiv.org/abs/1803.05457
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[16]
Open X-Embodiment Collaboration et al.Open X-Embodiment: Robotic Learning Datasets and RT-X Models.https://arxiv.org/abs/2310.08864. 2023
work page internal anchor Pith review arXiv 2023
-
[17]
StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing
StarVLA Community. “StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing”. In:arXiv preprint arXiv:2604.05014(2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[18]
Mustafa Shukor Dana Aubakirova, Jade Cholgari, and Leandro von Werra.VLAb: Your Laboratory for Pretraining VLAs.https://github.com/huggingface/vlab. 2025
2025
-
[19]
Alexey Dosovitskiy et al.An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
-
[20]
arXiv:2010.11929 [cs.CV].url:https://arxiv.org/abs/2010.11929
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[21]
arXiv preprint arXiv:1902.04023 , year=
Ted Dunning and Otmar Ertl. “Computing Extremely Accurate Quantiles Using t-Digests”. In:arXiv preprint arXiv:1902.04023(2019).url:https://arxiv.org/abs/1902.04023
-
[22]
https://github.com/EGalahad/vla-scratch
EGalahad.VLA-Scratch: A Modular, Performant, Efficient Stack For Vision-Language-Action Models. https://github.com/EGalahad/vla-scratch. GitHub repository. 2025
2025
-
[23]
Datacomp: In search of the next generation of multimodal datasets
Samir Yitzhak Gadre et al. “Datacomp: In search of the next generation of multimodal datasets”. In: Advances in Neural Information Processing Systems36 (2023), pp. 27092–27112
2023
-
[24]
GitHub repository
Suchin Gururangan et al.open_lm: a minimal but performative language modeling (LM) repository. GitHub repository. 2023.url:https://github.com/mlfoundations/open_lm/
2023
-
[25]
Measuring Massive Multitask Language Understanding
Dan Hendrycks et al. “Measuring Massive Multitask Language Understanding”. In:9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenRe- view.net, 2021.url:https://openreview.net/forum?id=d7KBjmI3GmQ
2021
-
[26]
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
Shengding Hu et al. “Minicpm: Unveiling the potential of small language models with scalable training strategies”. In:arXiv preprint arXiv:2404.06395(2024)
work page internal anchor Pith review arXiv 2024
-
[27]
Scaling Laws for Neural Language Models
Jared Kaplan et al. “Scaling laws for neural language models”. In:arXiv preprint arXiv:2001.08361 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[28]
Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models
Siddharth Karamcheti et al. “Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models”. In:International Conference on Machine Learning (ICML). 2024
2024
-
[29]
2022.url:https://github.com/karpathy/nanoGPT
Andrej Karpathy.nanoGPT: The Simplest, Fastest Repository for Training/Finetuning Medium-Sized GPTs. 2022.url:https://github.com/karpathy/nanoGPT
2022
-
[30]
Should VLMs be Pre-trained with Image Data?
Sedrick Keh et al. “Should VLMs be Pre-trained with Image Data?” In:arXiv preprint arXiv:2503.07603 (2025)
-
[31]
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim et al. “Openvla: An open-source vision-language-action model”. In:arXiv preprint arXiv:2406.09246(2024)
work page internal anchor Pith review arXiv 2024
-
[32]
Accessed: 2026-04-17
Pepijn Kooijmans et al.Unfolding Robotics: The Open-Source Recipe for Teaching a Robot to Fold Your Clothes. Accessed: 2026-04-17. 2026.url:https://huggingface.co/spaces/lerobot/robot-folding
2026
-
[33]
Jason Lee et al.MolmoAct: Action Reasoning Models that can Reason in Space. 2025. arXiv:2508.07917 [cs.RO].url:https://arxiv.org/abs/2508.07917
work page internal anchor Pith review arXiv 2025
-
[34]
Datacomp-lm: In search of the next generation of training sets for language models
Jeffrey Li et al. “Datacomp-lm: In search of the next generation of training sets for language models”. In:Advances in Neural Information Processing Systems37 (2024), pp. 14200–14282. 16
2024
-
[35]
Junnan Li et al. “BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models”. In:Proceedings of the 40th International Conference on Machine Learning (ICML). 2023.url:https://arxiv.org/abs/2301.12597
work page internal anchor Pith review arXiv 2023
-
[36]
Fanqi Lin et al. “A Systematic Study of Data Modalities and Strategies for Co-training Large Behavior Models for Robot Manipulation”. In:arXiv preprint arXiv:2602.01067(2026)
-
[37]
HoloBrain-0 technical report.arXiv preprint arXiv:2602.12062, 2026
Xuewu Lin et al. “HoloBrain-0 Technical Report”. In:arXiv preprint arXiv:2602.12062(2026).url: https://arxiv.org/abs/2602.12062
-
[38]
Flow Matching for Generative Modeling
Yaron Lipman et al. “Flow matching for generative modeling”. In:arXiv preprint arXiv:2210.02747 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[39]
Visual instruction tuning
Haotian Liu et al. “Visual instruction tuning”. In:Advances in neural information processing systems36 (2023), pp. 34892–34916
2023
-
[40]
LLM360 K2: Building a 65B 360-Open-Source Large Language Model from Scratch
Zhengzhong Liu et al. “LLM360 K2: Building a 65B 360-Open-Source Large Language Model from Scratch”. In:arXiv preprint arXiv:2501.07124(2025).doi: 10.48550/arXiv.2501.07124.url: https: //arxiv.org/abs/2501.07124
-
[41]
arXiv preprint arXiv:2312.06550 , year=
Zhengzhong Liu et al. “LLM360: Towards Fully Transparent Open-Source LLMs”. In:arXiv preprint arXiv:2312.06550(2023).doi:10.48550/arXiv.2312.06550.url:https://arxiv.org/abs/2312.06550
work page doi:10.48550/arxiv.2312.06550.url:https://arxiv.org/abs/2312.06550 2023
-
[42]
SmolVLM: Redefining small and efficient multimodal models
Andrés Marafioti et al. “Smolvlm: Redefining small and efficient multimodal models”. In:arXiv preprint arXiv:2504.05299(2025)
work page internal anchor Pith review arXiv 2025
-
[43]
marin-community.Draccus: Configuration with Dataclasses+YAML+Argparse.https://github.com/ marin-community/draccus. 2026
2026
-
[44]
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering
Todor Mihaylov et al. “Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering”. In:Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: Association for Computational Linguistics, 2018, pp. 2381–2391.doi: 10.18653/v1/D18-1260.url:https://aclanthology.org/D18-1260
work page doi:10.18653/v1/d18-1260.url:https://aclanthology.org/d18-1260 2018
-
[45]
Ray: A Distributed Framework for Emerging AI Applications
Philipp Moritz et al. “Ray: A Distributed Framework for Emerging AI Applications”. In:Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 2018, pp. 561–577.url:https://www.usenix.org/conference/osdi18/presentation/moritz
2018
-
[46]
Haruki Nishimura and Masha Itkina.Statistical Thinking for Robot Policy Evaluation: From Rigorous A/B Testing to Effective Visualization. Medium. Accessed: 2026-04-17. 2026.url:https://medium. com/toyotaresearch/statistical-thinking-for-robot-policy-evaluation-from-rigorous-a-b-testing-to- effective-0ae886fbd68d
2026
-
[47]
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
NVIDIA et al. “GR00T N1: An Open Foundation Model for Generalist Humanoid Robots”. In:ArXiv Preprint. Mar. 2025. arXiv:2503.14734
work page internal anchor Pith review arXiv 2025
-
[48]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab et al. “Dinov2: Learning robust visual features without supervision”. In:arXiv preprint arXiv:2304.07193(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[49]
The fineweb datasets: Decanting the web for the finest text data at scale
Guilherme Penedo et al. “The fineweb datasets: Decanting the web for the finest text data at scale”. In: Advances in Neural Information Processing Systems37 (2024), pp. 30811–30849
2024
-
[50]
https://github.com/nepfaff/drake-blender-tools
Nicholas Pfaff and Peter Werner.Drake Blender Tools: Importing Drake Simulations into Blender. https://github.com/nepfaff/drake-blender-tools. 2025
2025
-
[51]
GitHub repository
Physical Intelligence.openpi: Open-Source Models and Packages for Robotics.https://github.com/ Physical-Intelligence/openpi. GitHub repository. Apache-2.0 License. 2025
2025
-
[52]
Physical Intelligence et al.π0.5: a Vision-Language-Action Model with Open-World Generalization. 2025. arXiv:2504.16054 [cs.RO].url:https://arxiv.org/abs/2504.16054
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[53]
Physical Intelligence et al.π∗ 0.6: a VLA That Learns From Experience. 2025. arXiv:2511.14759 [cs.LG]. url:https://arxiv.org/abs/2511.14759
work page Pith review arXiv 2025
-
[54]
An algorithm for a letter-based representation of all-pairwise comparisons
Hans-Peter Piepho. “An algorithm for a letter-based representation of all-pairwise comparisons”. In: Journal of Computational and Graphical Statistics13.2 (2004), pp. 456–466
2004
-
[55]
Learning transferable visual models from natural language supervision
Alec Radford et al. “Learning transferable visual models from natural language supervision”. In:Inter- national conference on machine learning. PmLR. 2021, pp. 8748–8763
2021
-
[56]
Manning, 2024.isbn: 978-1633437166
Sebastian Raschka.Build A Large Language Model (From Scratch). Manning, 2024.isbn: 978-1633437166. url:https://www.manning.com/books/build-a-large-language-model-from-scratch
2024
-
[57]
Jeff Rasley et al. “DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters”. In:Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 2020, pp. 3505–3506.doi:10.1145/3394486.3406703.url: https://github.com/deepspeedai/DeepSpeed. 17
-
[58]
WinoGrande: An Adversarial Winograd Schema Challenge at Scale
Keisuke Sakaguchi et al. “WinoGrande: An Adversarial Winograd Schema Challenge at Scale”. In:The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educa- tional Advances in Artificial Intelligence, EAAI 2020, New York,...
2020
-
[59]
2024.url:https: //github.com/ServiceNow/Fast-LLM
ServiceNow Research.Fast-LLM: Accelerating Your LLM Training to Full Speed. 2024.url:https: //github.com/ServiceNow/Fast-LLM
2024
-
[60]
Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network
Wenzhe Shi et al. “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network”. In:Proceedings of the IEEE conference on computer vision and pattern recognition. 2016, pp. 1874–1883
2016
-
[61]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Mohammad Shoeybi et al. “Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism”. In:arXiv preprint arXiv:1909.08053(2019)
work page internal anchor Pith review arXiv 1909
-
[62]
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
Mustafa Shukor et al. “SmolVLA: A vision-language-action model for affordable and efficient robotics”. In:arXiv preprint(2025). arXiv:2506.01844 [cs.RO]
work page internal anchor Pith review arXiv 2025
-
[63]
Oriane Siméoni et al. “Dinov3”. In:arXiv preprint arXiv:2508.10104(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[64]
Is Your Imitation Learning Policy Better Than Mine? Policy Comparison with Near-Optimal Stopping
David Snyder et al. “Is Your Imitation Learning Policy Better Than Mine? Policy Comparison with Near-Optimal Stopping”. In:Proceedings of the Robotics: Science and Systems Conference (RSS) XXI. 2025
2025
-
[65]
Big little lies: A compendium and simulation of p-hacking strategies
Angelika M Stefan and Felix D Schönbrodt. “Big little lies: A compendium and simulation of p-hacking strategies”. In:Royal Society Open Science10.2 (2023)
2023
-
[66]
TRI LBM Team et al. “A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation”. In: (2025). arXiv:2507.05331 [cs.RO].url:https://arxiv.org/abs/2507.05331
-
[67]
Toyota Research Institute
TRI LBM Team et al.LBM Eval: A Simulation Benchmark for Large Behavior Model Policies.https: //github.com/ToyotaResearchInstitute/lbm_eval. Toyota Research Institute. Version 1.1.0. 2025
2025
-
[68]
Team OLMo et al.2 OLMo 2 Furious. 2024. arXiv:2501.00656 [cs.CL].url: https://arxiv.org/abs/ 2501.00656
work page internal anchor Pith review arXiv 2024
-
[69]
2019.url:https://drake.mit.edu
Russ Tedrake and the Drake Development Team.Drake: Model-based design and verification for robotics. 2019.url:https://drake.mit.edu
2019
-
[70]
GitHub repository
Haoyang Weng et al.VLA-Scratch: Modular, Performant and Efficient Stack.https://github.com/ EGalahad/vla-scratch. GitHub repository. 2026
2026
-
[71]
Luis Wiedmann and Juyoung Suk.nanoVLM: The simplest repository to train your VLM in pure PyTorch.https://github.com/huggingface/nanoVLM. 2024
2024
-
[72]
Dexbotic: Open-source vision-language-action toolbox.arXiv preprint arXiv:2510.23511, 2025
BinXieetal.“Dexbotic:Open-sourcevision-language-actiontoolbox”.In:arXiv preprint arXiv:2510.23511 (2025)
-
[73]
Seonghyeon Ye et al.World Action Models are Zero-shot Policies. 2026. arXiv:2602.15922 [cs.RO]. url:https://arxiv.org/abs/2602.15922
work page internal anchor Pith review arXiv 2026
-
[74]
HellaSwag: Can a Machine Really Finish Your Sentence?
Rowan Zellers et al. “HellaSwag: Can a Machine Really Finish Your Sentence?” In:Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, 2019, pp. 4791–4800.doi: 10.18653/v1/P19-1472 .url: https: //aclanthology.org/P19-1472
-
[75]
Sigmoid loss for language image pre-training
Xiaohua Zhai et al. “Sigmoid loss for language image pre-training”. In:Proceedings of the IEEE/CVF international conference on computer vision. 2023, pp. 11975–11986
2023
-
[76]
Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware
Tony Z. Zhao et al. “Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware”. In: Proceedings of Robotics: Science and Systems. Daegu, Republic of Korea, July 2023.doi:10.15607/ RSS.2023.XIX.016
2023
-
[77]
On the Continuity of Rotation Representations in Neural Networks
Yi Zhou et al. “On the Continuity of Rotation Representations in Neural Networks”. In:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). June 2019, pp. 5745– 5753. Appendix 18 A VLA Foundry – Detailed Reference This appendix expands on the description of VLA Foundry in Section 3. Section A.1 covers the core framework...
2019
-
[78]
for further details. Table 7 shows the number of training samples per dataset split used to train VLA Foundry models; the number of samples generated by a single demonstration episode depends on the length of each demonstration and preprocessing configurations such as padding. While the internally collected real and simulated data is largely shared betwee...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.