arxiv: 2604.19728 · v1 · submitted 2026-04-21 · 💻 cs.RO · cs.AI· cs.CV· cs.LG· cs.SE

Recognition: unknown

VLA Foundry: A Unified Framework for Training Vision-Language-Action Models

Haruki Nishimura, Isabella Huang, Jean Mercat, Katherine Liu, Kushal Arora, Paarth Shah, Sedrick Keh, Shun Iwase

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:05 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CVcs.LGcs.SE

keywords vision-language-actionrobot learningopen-source frameworkvision-language modellarge language modeltabletop manipulationsimulator evaluationpolicy training

0 comments

The pith

VLA Foundry supplies a single open codebase for training vision-language-action models from language pretraining through to action fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents VLA Foundry as a framework that brings language model pretraining, vision-language pretraining, and vision-language-action fine-tuning into one shared training stack instead of requiring separate incompatible pipelines. This setup supports training entirely from scratch as well as swapping in existing pretrained backbones. The authors train and release two models, then test them in closed-loop on a tabletop manipulation simulator where the from-scratch version matches prior closed-source performance and the backbone-augmented version exceeds a baseline by a large margin. A sympathetic reader would care because the approach reduces the custom engineering needed to build capable robot policies and makes the full training chain publicly accessible. The work also adds usability fixes to the evaluation simulator and analysis tools so others can run similar experiments more easily.

Core claim

VLA Foundry is an open-source framework that provides a shared training stack with end-to-end control from language pretraining to action-expert fine-tuning of vision-language-action models. It supports both fully from-scratch training through an LLM to VLM to VLA pipeline and the use of pretrained backbones from Hugging Face. Two models trained with the framework are evaluated in closed-loop on the LBM Eval simulator: a fully open from-scratch model that reaches performance on par with prior closed-source work in the nominal setting, and a model built on the Qwen3-VL backbone that produces a strong multi-task tabletop manipulation policy outperforming the baseline by a wide margin.

What carries the argument

the shared training stack that handles end-to-end control from language pretraining to action-expert fine-tuning

If this is right

Fully open from-scratch models can reach performance levels comparable to previous closed-source work on tabletop manipulation tasks.
Substituting a strong pretrained vision-language backbone produces multi-task policies that substantially outperform a baseline on the same simulator.
Releasing the full codebase and model weights allows other researchers to train and evaluate their own VLA models without building separate pipelines for each stage.
Usability improvements to the simulator and analysis tools make it simpler for the community to reproduce and extend the evaluations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Other groups could adopt the same unified stack to test new backbones or tasks without rebuilding the entire training pipeline from scratch each time.
Running the released models on physical robot hardware would provide a direct check on whether the simulator results translate to real environments.
The framework's support for swapping components might enable faster iteration when newer vision-language models become available.

Load-bearing premise

Performance gains measured in the LBM Eval simulator reflect meaningful improvements in real-world robot capabilities, and the unified framework structure itself, rather than other implementation details, drives the reported results.

What would settle it

A side-by-side evaluation on the same LBM Eval tasks showing that the from-scratch model does not match prior closed-source success rates or that the Qwen3-VL model does not outperform the baseline by the reported margin.

read the original abstract

We present VLA Foundry, an open-source framework that unifies LLM, VLM, and VLA training in a single codebase. Most open-source VLA efforts specialize on the action training stage, often stitching together incompatible pretraining pipelines. VLA Foundry instead provides a shared training stack with end-to-end control, from language pretraining to action-expert fine-tuning. VLA Foundry supports both from-scratch training and pretrained backbones from Hugging Face. To demonstrate the utility of our framework, we train and release two types of models: the first trained fully from scratch through our LLM-->VLM-->VLA pipeline and the second built on the pretrained Qwen3-VL backbone. We evaluate closed-loop policy performance of both models on LBM Eval, an open-data, open-source simulator. We also contribute usability improvements to the simulator and the STEP analysis tools for easier public use. In the nominal evaluation setting, our fully-open from-scratch model is on par with our prior closed-source work and substituting in the Qwen3-VL backbone leads to a strong multi-task table top manipulation policy outperforming our baseline by a wide margin. The VLA Foundry codebase is available at https://github.com/TRI-ML/vla_foundry and all multi-task model weights are released on https://huggingface.co/collections/TRI-ML/vla-foundry. Additional qualitative videos are available on the project website https://tri-ml.github.io/vla_foundry.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VLA Foundry is a useful open engineering release with unified training code and models, but the results don't isolate the framework's impact and remain simulator-only.

read the letter

The main takeaway is that this paper delivers an open unified framework for VLA training along with released models, which is practically useful, though the evidence for why the unification matters is thin and there's no real robot testing. They have created a single codebase that goes from LLM pretraining to VLM to VLA fine-tuning with shared components and end-to-end control. This is an improvement over typical open-source setups that combine mismatched tools. The from-scratch model matches their previous closed work on the LBM Eval simulator, and the Qwen3-VL version shows strong gains on multi-task tabletop manipulation. They also open up the simulator improvements and STEP tools. Having the code at GitHub and weights on Hugging Face is a real plus for the field. The weak points are the missing controls and the simulation boundary. Without ablations that hold everything else constant and only change the unified pipeline, it's tough to attribute the performance to the framework rather than other factors like the backbone or training details. All evaluations are in simulation, so we don't know how well these policies would work on actual robots. The paper's emphasis on the shared stack as the key would benefit from direct comparisons. This is for researchers and developers in robotics who want an accessible VLA training setup. Anyone building language-conditioned policies or looking for reproducible baselines will find value in the releases. It has enough concrete work to go to peer review, where referees could ask for more on the ablations and transfer. I would send it for review.

Referee Report

2 major / 2 minor

Summary. The paper introduces VLA Foundry, an open-source framework that unifies LLM, VLM, and VLA training within a single codebase supporting end-to-end control from language pretraining to action fine-tuning. It enables both fully from-scratch training via an LLM-to-VLM-to-VLA pipeline and integration of pretrained backbones such as Qwen3-VL. The authors train and release two models, evaluate closed-loop performance on the LBM Eval simulator for multi-task tabletop manipulation, contribute simulator usability improvements and STEP analysis tools, and claim that the from-scratch model matches prior closed-source work while the Qwen3-VL variant substantially outperforms the baseline. Code and weights are publicly released.

Significance. If the performance attribution holds, the work provides a reusable open engineering artifact that could reduce fragmentation in VLA development by replacing stitched pipelines with a shared stack. Explicit strengths include the public release of the full codebase, model weights, and simulator enhancements, which directly support reproducibility and community extension in robotics research.

major comments (2)

[Abstract and Experiments] Abstract and Experiments: The positioning of the shared training stack and end-to-end control as the key innovation is load-bearing, yet no ablation studies are presented that hold data volume, compute budget, and backbone fixed while varying only the unification aspect of the pipeline. This makes it impossible to credit the reported gains (from-scratch parity and Qwen3-VL outperformance) specifically to VLA Foundry rather than implementation details or backbone strength.
[Evaluation] Evaluation section: All quantitative claims rest on closed-loop policy performance within the LBM Eval simulator alone; no real-robot transfer experiments or physical deployment results are included, so the practical significance for tabletop manipulation remains untested even if simulator numbers are accurate.

minor comments (2)

The phrase 'nominal evaluation setting' appears in the abstract without a clear definition or pointer to the corresponding experimental protocol in the main text.
Training hyperparameter tables or configuration files could be expanded to list exact data mixtures and optimizer settings used for the from-scratch versus backbone-substituted runs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We are pleased that the open-source nature and potential for reducing fragmentation in VLA development are recognized as strengths. Below, we provide point-by-point responses to the major comments, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract and Experiments] Abstract and Experiments: The positioning of the shared training stack and end-to-end control as the key innovation is load-bearing, yet no ablation studies are presented that hold data volume, compute budget, and backbone fixed while varying only the unification aspect of the pipeline. This makes it impossible to credit the reported gains (from-scratch parity and Qwen3-VL outperformance) specifically to VLA Foundry rather than implementation details or backbone strength.

Authors: We agree that an ablation isolating the unification aspect—while controlling for data, compute, and backbone—would more definitively attribute performance improvements to the shared stack. Our experiments instead demonstrate the framework's end-to-end capabilities by training competitive models from scratch and with pretrained backbones, achieving parity with prior closed-source results and outperforming baselines in the Qwen3-VL case. These results highlight the practical utility of VLA Foundry for unified training. To address the concern, we will revise the abstract and experiments section to better scope the claims to the framework's enabling role and add a limitations paragraph discussing the lack of such fine-grained ablations, along with our intent to pursue them in follow-up work. revision: partial
Referee: [Evaluation] Evaluation section: All quantitative claims rest on closed-loop policy performance within the LBM Eval simulator alone; no real-robot transfer experiments or physical deployment results are included, so the practical significance for tabletop manipulation remains untested even if simulator numbers are accurate.

Authors: We concur that real-robot validation is essential for establishing practical significance beyond simulation. The manuscript's evaluations are deliberately scoped to the LBM Eval simulator to enable reproducible, large-scale multi-task assessment in a controlled environment, consistent with many prior VLA works. We will update the evaluation and conclusion sections to explicitly acknowledge this limitation and to highlight that the released codebase and models are designed to facilitate future real-world deployment and transfer studies. revision: yes

Circularity Check

0 steps flagged

No circularity: engineering framework with external-benchmark evaluations only

full rationale

The paper introduces an open-source codebase for unified LLM/VLM/VLA training and reports empirical results on the external LBM Eval simulator. No mathematical derivations, fitted parameters renamed as predictions, self-definitional claims, or load-bearing self-citation chains appear in the provided text. Performance statements (on-par with prior closed work, Qwen3-VL variant outperforming baseline) are direct comparisons to external references rather than reductions to the paper's own inputs. The contribution is an artifact release plus simulator runs; the derivation chain is empty.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an engineering and empirical release paper. No free parameters, mathematical axioms, or invented entities are introduced; the framework consists of software design choices and standard training practices.

pith-pipeline@v0.9.0 · 5601 in / 1014 out tokens · 59337 ms · 2026-05-10T02:05:56.453087+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

78 extracted references · 39 canonical work pages · 24 internal anchors

[1]

AlexAndonianetal.GPT-NeoX: Large Scale Autoregressive Language Modeling in PyTorch.Version2.0.0. Sept. 2023.doi:10.5281/zenodo.5879544.url:https://www.github.com/eleutherai/gpt-neox

work page doi:10.5281/zenodo.5879544.url:https://www.github.com/eleutherai/gpt-neox 2023
[2]

OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

Anas Awadalla et al. “OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models”. In:arXiv preprint arXiv:2308.01390(2023)

work page internal anchor Pith review arXiv 2023
[3]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai et al. “Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond”. In:arXiv preprint arXiv:2308.12966(2023)

work page internal anchor Pith review arXiv 2023
[4]

Qwen3-VL Technical Report

Shuai Bai et al. “Qwen3-vl technical report”. In:arXiv preprint arXiv:2511.21631(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Foundation Models for Robotics: Vision-Language-Action (VLA)

Rohit Bandaru. “Foundation Models for Robotics: Vision-Language-Action (VLA)”. In: (Sept. 2025). url:https://rohitbandaru.github.io/blog/Foundation-Models-for-Robotics-VLA/

2025
[6]

Significance Tests for 2×2 Tables

G. A. Barnard. “Significance Tests for 2×2 Tables”. In:Biometrika34.1-2 (Jan. 1947), pp. 123–138. issn: 0006-3444.doi:10.1093/biomet/34.1-2.123. (Visited on 01/20/2025)

work page doi:10.1093/biomet/34.1-2.123 1947
[7]

Lucas Beyer et al.PaliGemma: A versatile 3B VLM for transfer. en. July 2024.url:https://arxiv. org/abs/2407.07726v1(visited on 09/05/2024)

work page internal anchor Pith review arXiv 2024
[8]

PIQA: Reasoning about Physical Commonsense in Natural Language

Yonatan Bisk et al. “PIQA: Reasoning about Physical Commonsense in Natural Language”. In:The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educa- tional Advances in Artificial Intelligence, EAAI 2020, New York, NY,...

2020
[9]

2020.url:https://github.com/webdataset/webdataset

Thomas Breuel.WebDataset: A High-Performance Python-Based I/O System for Large Deep Learning Problems. 2020.url:https://github.com/webdataset/webdataset. 15

2020
[10]

https://github.com/huggingface/lerobot

Remi Cadene et al.LeRobot: State-of-the-art Machine Learning for Real-World Robotics in Pytorch. https://github.com/huggingface/lerobot. 2024

2024
[11]

Microsoft COCO Captions: Data Collection and Evaluation Server

Xinlei Chen et al. “Microsoft coco captions: Data collection and evaluation server”. In:arXiv preprint arXiv:1504.00325(2015)

work page internal anchor Pith review arXiv 2015
[12]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen et al. “Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks”. In:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024, pp. 24185–24198

2024
[13]

Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots

Cheng Chi et al. “Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots”. In:Proceedings of Robotics: Science and Systems (RSS). 2024

2024
[14]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Christopher Clark et al. “BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions”. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, 2019, pp....

work page doi:10.18653/v1/n19-1300 2019
[15]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark et al. “Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge”. In:ArXiv preprintabs/1803.05457 (2018).url:https://arxiv.org/abs/1803.05457

work page internal anchor Pith review Pith/arXiv arXiv 2018
[16]

Open X-Embodiment Collaboration et al.Open X-Embodiment: Robotic Learning Datasets and RT-X Models.https://arxiv.org/abs/2310.08864. 2023

work page internal anchor Pith review arXiv 2023
[17]

StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing

StarVLA Community. “StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing”. In:arXiv preprint arXiv:2604.05014(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[18]

Mustafa Shukor Dana Aubakirova, Jade Cholgari, and Leandro von Werra.VLAb: Your Laboratory for Pretraining VLAs.https://github.com/huggingface/vlab. 2025

2025
[19]

Alexey Dosovitskiy et al.An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
[20]

arXiv:2010.11929 [cs.CV].url:https://arxiv.org/abs/2010.11929

work page internal anchor Pith review Pith/arXiv arXiv 2010
[21]

arXiv preprint arXiv:1902.04023 , year=

Ted Dunning and Otmar Ertl. “Computing Extremely Accurate Quantiles Using t-Digests”. In:arXiv preprint arXiv:1902.04023(2019).url:https://arxiv.org/abs/1902.04023

work page arXiv 1902
[22]

https://github.com/EGalahad/vla-scratch

EGalahad.VLA-Scratch: A Modular, Performant, Efficient Stack For Vision-Language-Action Models. https://github.com/EGalahad/vla-scratch. GitHub repository. 2025

2025
[23]

Datacomp: In search of the next generation of multimodal datasets

Samir Yitzhak Gadre et al. “Datacomp: In search of the next generation of multimodal datasets”. In: Advances in Neural Information Processing Systems36 (2023), pp. 27092–27112

2023
[24]

GitHub repository

Suchin Gururangan et al.open_lm: a minimal but performative language modeling (LM) repository. GitHub repository. 2023.url:https://github.com/mlfoundations/open_lm/

2023
[25]

Measuring Massive Multitask Language Understanding

Dan Hendrycks et al. “Measuring Massive Multitask Language Understanding”. In:9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenRe- view.net, 2021.url:https://openreview.net/forum?id=d7KBjmI3GmQ

2021
[26]

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

Shengding Hu et al. “Minicpm: Unveiling the potential of small language models with scalable training strategies”. In:arXiv preprint arXiv:2404.06395(2024)

work page internal anchor Pith review arXiv 2024
[27]

Scaling Laws for Neural Language Models

Jared Kaplan et al. “Scaling laws for neural language models”. In:arXiv preprint arXiv:2001.08361 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2001
[28]

Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models

Siddharth Karamcheti et al. “Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models”. In:International Conference on Machine Learning (ICML). 2024

2024
[29]

2022.url:https://github.com/karpathy/nanoGPT

Andrej Karpathy.nanoGPT: The Simplest, Fastest Repository for Training/Finetuning Medium-Sized GPTs. 2022.url:https://github.com/karpathy/nanoGPT

2022
[30]

Should VLMs be Pre-trained with Image Data?

Sedrick Keh et al. “Should VLMs be Pre-trained with Image Data?” In:arXiv preprint arXiv:2503.07603 (2025)

work page arXiv 2025
[31]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim et al. “Openvla: An open-source vision-language-action model”. In:arXiv preprint arXiv:2406.09246(2024)

work page internal anchor Pith review arXiv 2024
[32]

Accessed: 2026-04-17

Pepijn Kooijmans et al.Unfolding Robotics: The Open-Source Recipe for Teaching a Robot to Fold Your Clothes. Accessed: 2026-04-17. 2026.url:https://huggingface.co/spaces/lerobot/robot-folding

2026
[33]

Jason Lee et al.MolmoAct: Action Reasoning Models that can Reason in Space. 2025. arXiv:2508.07917 [cs.RO].url:https://arxiv.org/abs/2508.07917

work page internal anchor Pith review arXiv 2025
[34]

Datacomp-lm: In search of the next generation of training sets for language models

Jeffrey Li et al. “Datacomp-lm: In search of the next generation of training sets for language models”. In:Advances in Neural Information Processing Systems37 (2024), pp. 14200–14282. 16

2024
[35]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li et al. “BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models”. In:Proceedings of the 40th International Conference on Machine Learning (ICML). 2023.url:https://arxiv.org/abs/2301.12597

work page internal anchor Pith review arXiv 2023
[36]

A Systematic Study of Data Modalities and Strategies for Co-training Large Behavior Models for Robot Manipulation

Fanqi Lin et al. “A Systematic Study of Data Modalities and Strategies for Co-training Large Behavior Models for Robot Manipulation”. In:arXiv preprint arXiv:2602.01067(2026)

work page arXiv 2026
[37]

HoloBrain-0 technical report.arXiv preprint arXiv:2602.12062, 2026

Xuewu Lin et al. “HoloBrain-0 Technical Report”. In:arXiv preprint arXiv:2602.12062(2026).url: https://arxiv.org/abs/2602.12062

work page arXiv 2026
[38]

Flow Matching for Generative Modeling

Yaron Lipman et al. “Flow matching for generative modeling”. In:arXiv preprint arXiv:2210.02747 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[39]

Visual instruction tuning

Haotian Liu et al. “Visual instruction tuning”. In:Advances in neural information processing systems36 (2023), pp. 34892–34916

2023
[40]

LLM360 K2: Building a 65B 360-Open-Source Large Language Model from Scratch

Zhengzhong Liu et al. “LLM360 K2: Building a 65B 360-Open-Source Large Language Model from Scratch”. In:arXiv preprint arXiv:2501.07124(2025).doi: 10.48550/arXiv.2501.07124.url: https: //arxiv.org/abs/2501.07124

work page doi:10.48550/arxiv.2501.07124.url: 2025
[41]

arXiv preprint arXiv:2312.06550 , year=

Zhengzhong Liu et al. “LLM360: Towards Fully Transparent Open-Source LLMs”. In:arXiv preprint arXiv:2312.06550(2023).doi:10.48550/arXiv.2312.06550.url:https://arxiv.org/abs/2312.06550

work page doi:10.48550/arxiv.2312.06550.url:https://arxiv.org/abs/2312.06550 2023
[42]

SmolVLM: Redefining small and efficient multimodal models

Andrés Marafioti et al. “Smolvlm: Redefining small and efficient multimodal models”. In:arXiv preprint arXiv:2504.05299(2025)

work page internal anchor Pith review arXiv 2025
[43]

marin-community.Draccus: Configuration with Dataclasses+YAML+Argparse.https://github.com/ marin-community/draccus. 2026

2026
[44]

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

Todor Mihaylov et al. “Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering”. In:Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: Association for Computational Linguistics, 2018, pp. 2381–2391.doi: 10.18653/v1/D18-1260.url:https://aclanthology.org/D18-1260

work page doi:10.18653/v1/d18-1260.url:https://aclanthology.org/d18-1260 2018
[45]

Ray: A Distributed Framework for Emerging AI Applications

Philipp Moritz et al. “Ray: A Distributed Framework for Emerging AI Applications”. In:Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 2018, pp. 561–577.url:https://www.usenix.org/conference/osdi18/presentation/moritz

2018
[46]

Haruki Nishimura and Masha Itkina.Statistical Thinking for Robot Policy Evaluation: From Rigorous A/B Testing to Effective Visualization. Medium. Accessed: 2026-04-17. 2026.url:https://medium. com/toyotaresearch/statistical-thinking-for-robot-policy-evaluation-from-rigorous-a-b-testing-to- effective-0ae886fbd68d

2026
[47]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

NVIDIA et al. “GR00T N1: An Open Foundation Model for Generalist Humanoid Robots”. In:ArXiv Preprint. Mar. 2025. arXiv:2503.14734

work page internal anchor Pith review arXiv 2025
[48]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab et al. “Dinov2: Learning robust visual features without supervision”. In:arXiv preprint arXiv:2304.07193(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

The fineweb datasets: Decanting the web for the finest text data at scale

Guilherme Penedo et al. “The fineweb datasets: Decanting the web for the finest text data at scale”. In: Advances in Neural Information Processing Systems37 (2024), pp. 30811–30849

2024
[50]

https://github.com/nepfaff/drake-blender-tools

Nicholas Pfaff and Peter Werner.Drake Blender Tools: Importing Drake Simulations into Blender. https://github.com/nepfaff/drake-blender-tools. 2025

2025
[51]

GitHub repository

Physical Intelligence.openpi: Open-Source Models and Packages for Robotics.https://github.com/ Physical-Intelligence/openpi. GitHub repository. Apache-2.0 License. 2025

2025
[52]

Physical Intelligence et al.π0.5: a Vision-Language-Action Model with Open-World Generalization. 2025. arXiv:2504.16054 [cs.RO].url:https://arxiv.org/abs/2504.16054

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

Physical Intelligence et al.π∗ 0.6: a VLA That Learns From Experience. 2025. arXiv:2511.14759 [cs.LG]. url:https://arxiv.org/abs/2511.14759

work page Pith review arXiv 2025
[54]

An algorithm for a letter-based representation of all-pairwise comparisons

Hans-Peter Piepho. “An algorithm for a letter-based representation of all-pairwise comparisons”. In: Journal of Computational and Graphical Statistics13.2 (2004), pp. 456–466

2004
[55]

Learning transferable visual models from natural language supervision

Alec Radford et al. “Learning transferable visual models from natural language supervision”. In:Inter- national conference on machine learning. PmLR. 2021, pp. 8748–8763

2021
[56]

Manning, 2024.isbn: 978-1633437166

Sebastian Raschka.Build A Large Language Model (From Scratch). Manning, 2024.isbn: 978-1633437166. url:https://www.manning.com/books/build-a-large-language-model-from-scratch

2024
[57]

Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters

Jeff Rasley et al. “DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters”. In:Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 2020, pp. 3505–3506.doi:10.1145/3394486.3406703.url: https://github.com/deepspeedai/DeepSpeed. 17

work page doi:10.1145/3394486.3406703.url: 2020
[58]

WinoGrande: An Adversarial Winograd Schema Challenge at Scale

Keisuke Sakaguchi et al. “WinoGrande: An Adversarial Winograd Schema Challenge at Scale”. In:The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educa- tional Advances in Artificial Intelligence, EAAI 2020, New York,...

2020
[59]

2024.url:https: //github.com/ServiceNow/Fast-LLM

ServiceNow Research.Fast-LLM: Accelerating Your LLM Training to Full Speed. 2024.url:https: //github.com/ServiceNow/Fast-LLM

2024
[60]

Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network

Wenzhe Shi et al. “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network”. In:Proceedings of the IEEE conference on computer vision and pattern recognition. 2016, pp. 1874–1883

2016
[61]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi et al. “Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism”. In:arXiv preprint arXiv:1909.08053(2019)

work page internal anchor Pith review arXiv 1909
[62]

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Mustafa Shukor et al. “SmolVLA: A vision-language-action model for affordable and efficient robotics”. In:arXiv preprint(2025). arXiv:2506.01844 [cs.RO]

work page internal anchor Pith review arXiv 2025
[63]

DINOv3

Oriane Siméoni et al. “Dinov3”. In:arXiv preprint arXiv:2508.10104(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[64]

Is Your Imitation Learning Policy Better Than Mine? Policy Comparison with Near-Optimal Stopping

David Snyder et al. “Is Your Imitation Learning Policy Better Than Mine? Policy Comparison with Near-Optimal Stopping”. In:Proceedings of the Robotics: Science and Systems Conference (RSS) XXI. 2025

2025
[65]

Big little lies: A compendium and simulation of p-hacking strategies

Angelika M Stefan and Felix D Schönbrodt. “Big little lies: A compendium and simulation of p-hacking strategies”. In:Royal Society Open Science10.2 (2023)

2023
[66]

A careful examination of large behavior models for multitask dexterous manipulation.arXiv preprint arXiv:2507.05331, 2025

TRI LBM Team et al. “A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation”. In: (2025). arXiv:2507.05331 [cs.RO].url:https://arxiv.org/abs/2507.05331

work page arXiv 2025
[67]

Toyota Research Institute

TRI LBM Team et al.LBM Eval: A Simulation Benchmark for Large Behavior Model Policies.https: //github.com/ToyotaResearchInstitute/lbm_eval. Toyota Research Institute. Version 1.1.0. 2025

2025
[68]

Team OLMo et al.2 OLMo 2 Furious. 2024. arXiv:2501.00656 [cs.CL].url: https://arxiv.org/abs/ 2501.00656

work page internal anchor Pith review arXiv 2024
[69]

2019.url:https://drake.mit.edu

Russ Tedrake and the Drake Development Team.Drake: Model-based design and verification for robotics. 2019.url:https://drake.mit.edu

2019
[70]

GitHub repository

Haoyang Weng et al.VLA-Scratch: Modular, Performant and Efficient Stack.https://github.com/ EGalahad/vla-scratch. GitHub repository. 2026

2026
[71]

Luis Wiedmann and Juyoung Suk.nanoVLM: The simplest repository to train your VLM in pure PyTorch.https://github.com/huggingface/nanoVLM. 2024

2024
[72]

Dexbotic: Open-source vision-language-action toolbox.arXiv preprint arXiv:2510.23511, 2025

BinXieetal.“Dexbotic:Open-sourcevision-language-actiontoolbox”.In:arXiv preprint arXiv:2510.23511 (2025)

work page arXiv 2025
[73]

Seonghyeon Ye et al.World Action Models are Zero-shot Policies. 2026. arXiv:2602.15922 [cs.RO]. url:https://arxiv.org/abs/2602.15922

work page internal anchor Pith review arXiv 2026
[74]

HellaSwag: Can a Machine Really Finish Your Sentence?

Rowan Zellers et al. “HellaSwag: Can a Machine Really Finish Your Sentence?” In:Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, 2019, pp. 4791–4800.doi: 10.18653/v1/P19-1472 .url: https: //aclanthology.org/P19-1472

work page doi:10.18653/v1/p19-1472 2019
[75]

Sigmoid loss for language image pre-training

Xiaohua Zhai et al. “Sigmoid loss for language image pre-training”. In:Proceedings of the IEEE/CVF international conference on computer vision. 2023, pp. 11975–11986

2023
[76]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Tony Z. Zhao et al. “Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware”. In: Proceedings of Robotics: Science and Systems. Daegu, Republic of Korea, July 2023.doi:10.15607/ RSS.2023.XIX.016

2023
[77]

On the Continuity of Rotation Representations in Neural Networks

Yi Zhou et al. “On the Continuity of Rotation Representations in Neural Networks”. In:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). June 2019, pp. 5745– 5753. Appendix 18 A VLA Foundry – Detailed Reference This appendix expands on the description of VLA Foundry in Section 3. Section A.1 covers the core framework...

2019
[78]

for further details. Table 7 shows the number of training samples per dataset split used to train VLA Foundry models; the number of samples generated by a single demonstration episode depends on the length of each demonstration and preprocessing configurations such as padding. While the internally collected real and simulated data is largely shared betwee...