arxiv: 2605.07474 · v1 · submitted 2026-05-08 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

ForgeVLA: Federated Vision-Language-Action Learning without Language Annotations

Dan Si, Jiancheng Lyu, Jian Lan, Jindi Lyu, Qing Ye, Thomas Seidl, Yang Zhou, Yuhao Zhou, Yunpeng Zhu, Zhangyuan Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:47 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords federated learningvision-language-actionroboticspseudo-labelingfeature collapseembodied AIdistributed training

0 comments

The pith

ForgeVLA trains vision-language-action models across distributed robots using only vision-action pairs by locally recovering language labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language-action models promise general robotic intelligence but are held back by the high cost of collecting language-annotated data. ForgeVLA sidesteps this by letting each robot run a local classifier that turns its own vision and action streams into complete triplets drawn from a fixed instruction set. It further counters the tendency of vision and language features to lose distinctiveness during federated training by adding a contrastive planning loss on each client and an adaptive aggregation step on the server. The result is a scalable training process that never moves raw data or requires manual labels.

Core claim

Each client uses an embodied instruction classifier to map raw vision-action pairs to a predefined instruction set, thereby constructing vision-language-action triplets without central data sharing; a client-side contrastive planning loss together with server-side adaptive aggregation then prevents vision-language feature collapse, allowing the federated model to learn task-discriminative representations that outperform baselines across multiple benchmarks.

What carries the argument

An embodied instruction classifier that recovers the language modality from vision-action pairs, combined with a contrastive planning loss and adaptive aggregation to preserve task-discriminative features.

If this is right

VLA models can be trained at larger scale by using the vision-action data that robots already collect without extra annotation effort.
Raw sensor data stays local, satisfying privacy and bandwidth constraints across different robot deployments.
Task-discriminative representations emerge even when clients hold data from dissimilar environments.
Ablation results confirm that removing either the local classifier or the contrastive-adaptive components degrades performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same local-modality-recovery pattern could be tested in other federated multimodal settings where one data type is missing.
Performance on robots whose instruction distributions differ sharply from the predefined set would reveal how much the fixed vocabulary constrains generality.
Replacing the fixed instruction set with a learned, expanding vocabulary on the server might further improve adaptability without centralizing raw data.

Load-bearing premise

That an embodied instruction classifier trained on a fixed set of instructions can recover language labels from vision-action pairs with enough accuracy and consistency to support effective VLA learning on heterogeneous clients.

What would settle it

Run the classifier on held-out vision-action sequences from new environments and measure whether its instruction-prediction accuracy correlates directly with the final performance of the trained VLA policy; a weak correlation would undermine the central claim.

Figures

Figures reproduced from arXiv: 2605.07474 by Dan Si, Jiancheng Lyu, Jian Lan, Jindi Lyu, Qing Ye, Thomas Seidl, Yang Zhou, Yuhao Zhou, Yunpeng Zhu, Zhangyuan Wang.

**Figure 1.** Figure 1: [Left]: The key bottleneck for scalable VLA training is data scarcity, as high-quality annotated VLA data are limited, while large volumes of vision–action logs remain underutilized. [Right]: ForgeVLA across N clients: ① Train an embodied instruction classifier on the central server; ② Clients download the pretrained classifier and the initialized global VLA model; ③ Perform on-device task classification t… view at source ↗

**Figure 2.** Figure 2: The Instruction Classifier. On each client, cϕ classifies local vision–action logs into the predefined instruction set, forming a VLA training corpus while keeping all raw data on-device to preserve privacy. The client then performs local VLA training with a contrastive planning loss to counteract heterogeneityinduced degradation. After local training, clients upload only model updates, and the server app… view at source ↗

**Figure 3.** Figure 3: The illustrated distances and T-SNE [65] projections of latent feature representations learned by different models across different tasks. 4.2 Vision-Language Feature Collapse Real-world robotic deployments exhibit pronounced cross-client heterogeneity. For example, manufacturing robots operating in different factories often share only a small subset of overlapping tasks, yielding highly skewed and largel… view at source ↗

**Figure 4.** Figure 4: Illustrations of Aggregations. Adaptive Aggregation Strategy: As shown in Figure 4 (a), simple averaging of heterogeneous client updates can cancel conflicting directions, producing a global update whose projection onto each client’s update direction is substantially attenuated. This cancellation slows convergence and degrades performance. Based on this observation, we propose a server-side adaptive agg… view at source ↗

**Figure 5.** Figure 5: Real-world deployment of the trained ForgeVLA on the SO-ARM101 robotic platform. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: The training loss curves of ForgeVLA and FedAvg. [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗

read the original abstract

Vision-Language-Action (VLA) models hold great promise for general-purpose robotic intelligence, yet scaling up such models is severely bottlenecked by the high cost of acquiring annotated training data. Fortunately, vision-equipped robots deployed across various domains already produce abundant vision-action pairs that can be leveraged to scale up VLA training more efficiently. However, these raw data cannot be centrally aggregated due to various constraints and also exhibit severe heterogeneity. To address these challenges, in this paper, we propose ForgeVLA, a federated VLA training framework that learns VLA models from distributed vision-action pairs without centralizing raw data or requiring manual annotations. Specifically, each client in ForgeVLA is equipped with an embodied instruction classifier that maps vision-action pairs to a predefined instruction set, recovering the missing language modality and forming complete vision-language-action triplets. Beyond triplet construction, we also identify vision-language feature collapse as a critical challenge that has been largely overlooked in prior federated VLA research. To mitigate this issue, ForgeVLA combines a client-side contrastive planning loss with a server-side adaptive aggregation strategy to learn task-discriminative representations efficiently. Extensive experiments across multiple benchmarks show that ForgeVLA significantly outperforms other baselines, and ablation studies further validate the contribution of each component.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ForgeVLA sketches a federated VLA pipeline that recovers language via client classifiers and adds contrastive plus adaptive tricks to handle collapse, but the classifier training step is left vague enough to undermine the no-annotation claim.

read the letter

ForgeVLA's core idea is to train vision-language-action models across distributed robots by keeping raw vision-action pairs local and using an embodied instruction classifier on each client to map those pairs onto a fixed instruction set. This creates the missing language triplets without centralizing data or collecting new annotations. They then add a client-side contrastive planning loss and server-side adaptive aggregation to keep representations task-discriminative instead of collapsing under heterogeneity. That combination is the main new piece; prior federated robotics work has not tied language recovery, contrastive planning, and adaptive aggregation together in one VLA framework quite like this. The practical framing is useful because it directly targets the annotation cost and privacy barriers that actually limit scaling these models in the field. The experiments are described as showing clear gains over baselines with ablations that isolate each component, which at least gives a concrete starting point for discussion. The soft spot is the embodied instruction classifier itself. The abstract gives no procedure for training or initializing it, and nothing rules out the possibility that it was built on some seed language-annotated data even if the main training run avoids new labels. If that is the case, the headline claim of learning without language annotations does not fully hold. The stress-test note is fair here: without details on how the classifier generalizes across client-specific visual and action shifts, the heterogeneous federated setting rests on an untested assumption. The paper is aimed at people working on practical VLA deployment or federated robotics who need to work with real distributed data rather than curated central datasets. A reader in that area would find the engineering choices worth examining even if the results need verification. It deserves a serious referee because the problem is timely, the framework is specific enough to evaluate, and the empirical claims are stated clearly enough to check. I would send it to review but ask the authors to clarify the classifier training and any pre-training data it used.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes ForgeVLA, a federated framework for training Vision-Language-Action (VLA) models from distributed vision-action pairs without centralizing raw data or requiring manual language annotations. Each client deploys an embodied instruction classifier that maps vision-action pairs onto a predefined instruction set to recover language modality and form VLA triplets. The framework additionally introduces a client-side contrastive planning loss and server-side adaptive aggregation to mitigate vision-language feature collapse. Extensive experiments across multiple benchmarks are reported to show significant outperformance over baselines, with ablation studies validating the contribution of each component.

Significance. If the empirical results hold and the method generalizes across heterogeneous client distributions, ForgeVLA could enable scalable VLA training by exploiting abundant unlabeled robot data in a privacy-preserving federated setting. This directly targets the annotation-cost and data-heterogeneity bottlenecks in embodied AI, with potential impact on general-purpose robotic intelligence. The identification of feature collapse as an overlooked issue in federated VLA is a useful conceptual contribution.

major comments (2)

[Abstract and §3] Abstract and §3 (method description): The training procedure, data requirements, and initialization for the embodied instruction classifier are unspecified. This is load-bearing for the central claim of learning 'without Language Annotations,' because any reliance on language-annotated data (even for pre-training or a seed set) would collapse the premise. The manuscript must clarify whether the classifier is obtained without annotations and how it handles client-specific visual/action shifts without retraining or additional labels.
[§4 (Experiments) and Table 1/2] §4 (Experiments) and Table 1/2: The abstract asserts 'significant outperformance' and 'ablation studies further validate the contribution of each component,' yet the provided summary supplies no quantitative metrics, error bars, dataset sizes, or specific ablation numbers. Without these, the strength of the empirical support for the federated VLA claims cannot be assessed; the full manuscript must include them with clear baselines and statistical significance tests.

minor comments (2)

[Abstract] The abstract would be strengthened by including one or two key quantitative results (e.g., success-rate deltas) rather than qualitative statements alone.
[§3] Notation for the contrastive planning loss and adaptive aggregation should be introduced with explicit equations in §3 to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment below and will update the manuscript accordingly.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (method description): The training procedure, data requirements, and initialization for the embodied instruction classifier are unspecified. This is load-bearing for the central claim of learning 'without Language Annotations,' because any reliance on language-annotated data (even for pre-training or a seed set) would collapse the premise. The manuscript must clarify whether the classifier is obtained without annotations and how it handles client-specific visual/action shifts without retraining or additional labels.

Authors: We agree that the description of the embodied instruction classifier in §3 is too brief and will expand it substantially in the revision. The revised text will specify that the classifier is initialized from a publicly available pre-trained vision-language model and trained via cross-entropy loss on a fixed, publicly released set of vision-action-instruction triplets drawn from standard benchmarks. No client-specific language annotations are used at any stage. Client-specific visual and action distribution shifts are handled by keeping the classifier weights frozen after initialization; adaptation occurs exclusively through the client-side contrastive planning loss, which aligns the fixed language embeddings with local vision-action features without requiring retraining or new labels. revision: yes
Referee: [§4 (Experiments) and Table 1/2] §4 (Experiments) and Table 1/2: The abstract asserts 'significant outperformance' and 'ablation studies further validate the contribution of each component,' yet the provided summary supplies no quantitative metrics, error bars, dataset sizes, or specific ablation numbers. Without these, the strength of the empirical support for the federated VLA claims cannot be assessed; the full manuscript must include them with clear baselines and statistical significance tests.

Authors: The full manuscript already reports concrete metrics, ablation numbers, and baseline comparisons in Tables 1 and 2 together with dataset sizes. To address the concern directly, the revision will add error bars computed over multiple random seeds, explicitly list the number of runs, and include statistical significance tests (paired t-tests with p-values) for the primary performance gains over baselines. revision: yes

Circularity Check

0 steps flagged

No circularity detected; framework is engineering-based with independent empirical validation.

full rationale

The paper describes ForgeVLA as a federated framework that equips clients with an embodied instruction classifier to map vision-action pairs onto a predefined instruction set, thereby forming VLA triplets without centralizing data or manual annotations. It further proposes a client-side contrastive planning loss and server-side adaptive aggregation to address vision-language feature collapse. No equations, derivations, or first-principles predictions are presented that reduce by construction to fitted inputs, self-definitions, or self-citation chains. The central claims rest on the design of these components and their experimental performance across benchmarks, which constitutes independent content rather than tautological equivalence to the inputs. The classifier's training procedure is not specified in the abstract, but this is an unelaborated assumption, not a circular reduction in any claimed derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review provides no explicit free parameters, axioms, or independent evidence for new entities; the classifier and losses are described at a conceptual level only.

invented entities (1)

embodied instruction classifier no independent evidence
purpose: Maps vision-action pairs to a predefined instruction set to recover the language modality
Introduced as a core client-side component to form complete VLA triplets without manual annotations

pith-pipeline@v0.9.0 · 5550 in / 1330 out tokens · 61553 ms · 2026-05-11T01:47:39.954117+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
client-side contrastive planning loss with a server-side adaptive aggregation strategy

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages · 7 internal anchors

[1]

Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang

Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. InACM SIGSAC Conference on Computer and Communications Security, 2016

work page 2016
[2]

doi:10.1038/s44160-022-00231-0

Milad Abolhasani and Eugenia Kumacheva. The rise of self-driving labs in chemical and materials sciences. Nature Synthesis, 2(6):483–492, 2023. doi: 10.1038/s44160-022-00231-0

work page doi:10.1038/s44160-022-00231-0 2023
[3]

Whatmough, and Venkatesh Saligrama

Durmus Alp Emre Acar, Yue Zhao, Ramon Matas Navarro, Matthew Mattina, Paul N. Whatmough, and Venkatesh Saligrama. Federated learning based on dynamic regularization. InInternational Conference on Learning Representations, 2021

work page 2021
[4]

Amazon launches a new AI foundation model to power its robotic fleet and de- ploys its 1 millionth robot, 2025

Amazon. Amazon launches a new AI foundation model to power its robotic fleet and de- ploys its 1 millionth robot, 2025. URL https://www.aboutamazon.com/news/operations/ amazon-million-robots-ai-foundation-model

work page 2025
[5]

Non-stationary stochastic optimization.Operations Research, 63(5):1227–1244, 2015

Omar Besbes, Yonatan Gur, and Assaf Zeevi. Non-stationary stochastic optimization.Operations Research, 63(5):1227–1244, 2015. doi: 10.1287/opre.2015.1408. URL https://pubsonline.informs.org/ doi/10.1287/opre.2015.1408

work page doi:10.1287/opre.2015.1408 2015
[6]

Zero-shot robotic manipulation with pretrained image-editing diffusion models

Kevin Black, Mitsuhiko Nakamoto, Pranav Atreya, Homer Walke, Chelsea Finn, Aviral Kumar, and Sergey Levine. Zero-shot robotic manipulation with pretrained image-editing diffusion models. InInternational Conference on Learning Representations (ICLR), 2024

work page 2024
[7]

π0: A vision-language-action flow model for general robot control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control. InRobotics: Science and Systems (RSS), 2025

work page 2025
[8]

Brendan McMahan, Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth

Keith Bonawitz, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H. Brendan McMahan, Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth. Practical secure aggregation for privacy-preserving machine learning. InACM SIGSAC Conference on Computer and Communications Security, 2017

work page 2017
[9]

Towards federated learning at scale: System design.Proceedings of Machine Learning and Systems, 1:374–388, 2019

Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir Ivanov, Chloe Kiddon, Jakub Koneˇcn`y, Stefano Mazzocchi, Brendan McMahan, et al. Towards federated learning at scale: System design.Proceedings of Machine Learning and Systems, 1:374–388, 2019

work page 2019
[10]

FLAME: A federated learning benchmark for robotic manipulation.arXiv preprint arXiv:2503.01729, 2025

Santiago Bou Betran, Alberta Longhini, Miguel Vasco, Yuchong Zhang, and Danica Kragic. FLAME: A federated learning benchmark for robotic manipulation.arXiv preprint arXiv:2503.01729, 2025

work page arXiv 2025
[11]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. InAdvances in Neural Information Processing Systems, volume 33, pages 1877–1901, 2020

work page 1901
[12]

InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy

Xinyi Chen, Yilun Chen, Yanwei Fu, Ning Gao, Jiaya Jia, Weiyang Jin, Hao Li, Yao Mu, Jiangmiao Pang, Yu Qiao, et al. InternVLA-M1: A spatially guided vision-language-action framework for generalist robot policy.arXiv preprint arXiv:2510.13778, 2025

work page internal anchor Pith review arXiv 2025
[13]

FedVLA: Federated vision-language-action learning with dual gating mixture-of-experts for robotic manipulation

Miao Cui, Tao Chang, Meihan Wu, Hongbin Xu, Chun Li, Ming Li, and Xiaodong Wang. FedVLA: Federated vision-language-action learning with dual gating mixture-of-experts for robotic manipulation. In International Conference on Computer Vision (ICCV), 2025

work page 2025
[14]

RoboNet: Large-scale multi-robot learning

Sudeep Dasari, Frederik Ebert, Stephen Tian, Suraj Nair, Bernadette Bucher, Karl Schmeckpeper, Siddharth Singh, Sergey Levine, and Chelsea Finn. RoboNet: Large-scale multi-robot learning. InConference on Robot Learning (CoRL), 2019. 10

work page 2019
[15]

Scaling cross-embodied learning: One policy for manipulation, navigation, locomotion and aviation

Ria Doshi, Homer Rich Walke, Oier Mees, Sudeep Dasari, and Sergey Levine. Scaling cross-embodied learning: One policy for manipulation, navigation, locomotion and aviation. InProceedings of The 8th Conference on Robot Learning, volume 270 ofProceedings of Machine Learning Research, pages 496–512. PMLR, 2024

work page 2024
[16]

Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. PaLM-E: An embodied multimodal language model. InInternational Conference on Machine Learning (ICML), 2023

work page 2023
[17]

Learning universal policies via text-guided video generation.Advances in neural information processing systems, 36:9156–9172, 2023

Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation.Advances in neural information processing systems, 36:9156–9172, 2023

work page 2023
[18]

Agricultural robotics: The future of robotic agriculture

Tom Duckett, Simon Pearson, Simon Blackmore, and Bruce Grieve. Agricultural robotics: The future of robotic agriculture. Uk-ras white paper, EPSRC UK-RAS Network, 2018

work page 2018
[19]

Inverting gradients – how easy is it to break privacy in federated learning? InAdvances in Neural Information Processing Systems, 2020

Jonas Geiping, Hartmut Bauermeister, Hannah Dröge, and Michael Moeller. Inverting gradients – how easy is it to break privacy in federated learning? InAdvances in Neural Information Processing Systems, 2020

work page 2020
[20]

Training compute-optimal large language models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[21]

LoRA: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2022

work page 2022
[22]

World robotics 2025 – industrial robots

International Federation of Robotics. World robotics 2025 – industrial robots. Techni- cal report, IFR Statistical Department, Frankfurt, 2025. 4,664,000 industrial robots in op- erational use worldwide in 2024. Available at https://ifr.org/ifr-press-releases/news/ global-robot-demand-in-factories-doubles-over-10-years

work page 2025
[23]

2025 annual report, 2025

Intuitive Surgical, Inc. 2025 annual report, 2025. URL https://isrg.intuitive.com/ static-files/d01bbc25-f8cf-433b-8ebb-b5afc1926236

work page 2025
[24]

Stephen James, Zicong Ma, David Rovick Arrojo, and Andrew J. Davison. RLBench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Letters, 2020

work page 2020
[25]

Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al

Peter Kairouz, H. Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning.Foundations and Trends in Machine Learning, 14(1–2), 2021

work page 2021
[26]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[27]

SCAFFOLD: Stochastic controlled averaging for federated learning

Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. SCAFFOLD: Stochastic controlled averaging for federated learning. InProceed- ings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 5132–5143. PMLR, 2020. URLhttps://proceedi...

work page 2020
[28]

A survey of research on cloud robotics and automation.IEEE Transactions on automation science and engineering, 12(2):398–409, 2015

Ben Kehoe, Sachin Patil, Pieter Abbeel, and Ken Goldberg. A survey of research on cloud robotics and automation.IEEE Transactions on automation science and engineering, 12(2):398–409, 2015

work page 2015
[29]

Tighter theory for local SGD on identical and heterogeneous data

Ahmed Khaled, Konstantin Mishchenko, and Peter Richtarik. Tighter theory for local SGD on identical and heterogeneous data. InProceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 ofProceedings of Machine Learning Research, pages 4519–4529. PMLR, 2020. URLhttps://proceedings.mlr.press/v108/bayoumi20a.html

work page 2020
[30]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

work page internal anchor Pith review arXiv 2024
[31]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024. 11

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

work page internal anchor Pith review arXiv 2025
[33]

Tenenbaum

Po-Chen Ko, Jiayuan Mao, Yilun Du, Shao-Hua Sun, and Joshua B. Tenenbaum. Learning to act from actionless videos through dense correspondences.arXiv preprint arXiv:2310.08576, 2023

work page arXiv 2023
[34]

Image augmentation is all you need: Regularizing deep reinforcement learning from pixels

Ilya Kostrikov, Denis Yarats, and Rob Fergus. Image augmentation is all you need: Regularizing deep reinforcement learning from pixels. InInternational Conference on Learning Representations (ICLR), 2021

work page 2021
[35]

Reinforce- ment learning with augmented data

Michael Laskin, Kimin Lee, Adam Stooke, Lerrel Pinto, Pieter Abbeel, and Aravind Srinivas. Reinforce- ment learning with augmented data. InAdvances in Neural Information Processing Systems (NeurIPS), volume 33, pages 19884–19895, 2020

work page 2020
[36]

Preservation of the global knowledge by not-true distillation in federated learning

Gihun Lee, Minchan Jeong, Yongjin Shin, Sangmin Bae, and Se-Young Yun. Preservation of the global knowledge by not-true distillation in federated learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[37]

Global and local prompts cooperation via optimal transport for federated learning

Hongxia Li, Wei Huang, Jingya Wang, and Ye Shi. Global and local prompts cooperation via optimal transport for federated learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12151–12161, 2024

work page 2024
[38]

Model-contrastive federated learning

Qinbin Li, Bingsheng He, and Dawn Song. Model-contrastive federated learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10713–10722, 2021

work page 2021
[39]

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, Xiaofan Wang, Bei Liu, Jianlong Fu, Jianmin Bao, Dong Chen, Yuanchun Shi, Jiaolong Yang, and Baining Guo. CogACT: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:24...

work page Pith review arXiv 2024
[40]

Federated optimization in heterogeneous networks

Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. InConference on Machine Learning and Systems (MLSys), 2020

work page 2020
[41]

Non-convex bilevel optimization with time-varying objective functions.Advances in Neural Information Processing Systems, 36:29692–29717, 2023

Sen Lin, Daouda Sow, Kaiyi Ji, Yingbin Liang, and Ness Shroff. Non-convex bilevel optimization with time-varying objective functions.Advances in Neural Information Processing Systems, 36:29692–29717, 2023

work page 2023
[42]

Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

work page 2023
[43]

Federated imitation learning: A novel framework for cloud robotic systems with heterogeneous sensor data.IEEE Robotics and Automation Letters, 2020

Boyi Liu, Lujia Wang, Ming Liu, and Cheng-Zhong Xu. Federated imitation learning: A novel framework for cloud robotic systems with heterogeneous sensor data.IEEE Robotics and Automation Letters, 2020

work page 2020
[44]

Optimization with first-order surrogate functions

Julien Mairal. Optimization with first-order surrogate functions. InProceedings of the 30th International Conference on Machine Learning, volume 28 ofProceedings of Machine Learning Research, pages 783–791. PMLR, 2013

work page 2013
[45]

MimicGen: A data generation system for scalable robot learning using human demonstrations

Ajay Mandlekar, Soroush Nasiriany, Bowen Wen, Iretiayo Akinola, Yashraj Narang, Linxi Fan, Yuke Zhu, and Dieter Fox. MimicGen: A data generation system for scalable robot learning using human demonstrations. InConference on Robot Learning (CoRL), 2023

work page 2023
[46]

Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas

H. Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. InInternational Conference on Artificial Intelligence and Statistics (AISTATS), 2017

work page 2017
[47]

Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations

Tongzhou Mu, Zhan Ling, Fanbo Xiang, Derek Yang, Xuanlin Li, Stone Tao, Zhiao Huang, Zhiwei Jia, and Hao Su. Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations. In Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2021

work page 2021
[48]

Rush, Boaz Barak, Teven Le Scao, Nouamane Tazi, Aleksandra Piktus, Sampo Pyysalo, Thomas Wolf, and Colin Raffel

Niklas Muennighoff, Alexander M. Rush, Boaz Barak, Teven Le Scao, Nouamane Tazi, Aleksandra Piktus, Sampo Pyysalo, Thomas Wolf, and Colin Raffel. Scaling data-constrained language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[49]

Octo: An open-source generalist robot policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, You Liang Tan, Lawrence Yunliang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy. InProceedings of Robotics: Science and...

work page 2024
[50]

Open X-embodiment: Robotic learning datasets and RT-X models

Open X-Embodiment Collaboration. Open X-embodiment: Robotic learning datasets and RT-X models. IEEE International Conference on Robotics and Automation (ICRA), 2024

work page 2024
[51]

SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, and Xuelong Li. SpatialVLA: Exploring spatial representations for visual-language- action model.arXiv preprint arXiv:2501.15830, 2025

work page internal anchor Pith review arXiv 2025
[52]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, pages 8748–8763. PMLR, 2021

work page 2021
[53]

A unified convergence analysis of block successive minimization methods for nonsmooth optimization.SIAM Journal on Optimization, 23(2):1126–1153,

Meisam Razaviyayn, Mingyi Hong, and Zhi-Quan Luo. A unified convergence analysis of block successive minimization methods for nonsmooth optimization.SIAM Journal on Optimization, 23(2):1126–1153,

work page
[54]

doi: 10.1137/120891009

work page doi:10.1137/120891009
[55]

Brendan McMahan

Sashank Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Kone ˇcný, Sanjiv Kumar, and H. Brendan McMahan. Adaptive federated optimization. InInternational Conference on Learning Representations (ICLR), 2021. URLhttps://openreview.net/forum?id=LkFG3lB13U5

work page 2021
[56]

FEDORA: Federated ensemble-directed offline reinforcement learning

Desik Rengarajan, Nitin Ragothaman, Dileep Kalathil, and Srinivas Shakkottai. FEDORA: Federated ensemble-directed offline reinforcement learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[57]

Multimodal diffusion transformer: Learning versatile behavior from multimodal goals

Moritz Reuss, Ömer Erdinç Ya ˘gmurlu, Fabian Wenzel, and Rudolf Lioutikov. Multimodal diffusion transformer: Learning versatile behavior from multimodal goals. InRobotics: Science and Systems (RSS), 2024

work page 2024
[58]

Relaxed contrastive learning for federated learning

Seonguk Seo, Jinkyu Kim, Geeho Kim, and Bohyung Han. Relaxed contrastive learning for federated learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[59]

Prior: Personalized prior for reactivating the information overlooked in federated learning.Advances in Neural Information Processing Systems, 36:28378–28392, 2023

Mingjia Shi, Yuhao Zhou, Kai Wang, Huaizheng Zhang, Shudong Huang, Qing Ye, and Jiancheng Lv. Prior: Personalized prior for reactivating the information overlooked in federated learning.Advances in Neural Information Processing Systems, 36:28378–28392, 2023

work page 2023
[60]

Yujun Shi, Jian Liang, Wenqing Zhang, Vincent Y . F. Tan, and Song Bai. Towards understanding and mitigating dimensional collapse in heterogeneous federated learning. InInternational Conference on Learning Representations (ICLR), 2023

work page 2023
[61]

Sebastian U. Stich. Local SGD converges fast and communicates little. InInternational Conference on Learning Representations (ICLR), 2019. URLhttps://openreview.net/forum?id=S1g2JnRcFX

work page 2019
[62]

Scalability in perception for autonomous driving: Waymo open dataset

Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2446–2454, 2020

work page 2020
[63]

Fedproto: Federated prototype learning across heterogeneous clients

Yue Tan, Guodong Long, Lu Liu, Tianyi Zhou, Qinghua Lu, Jing Jiang, and Chengqi Zhang. Fedproto: Federated prototype learning across heterogeneous clients. InProceedings of the AAAI conference on artificial intelligence, volume 36, pages 8432–8440, 2022

work page 2022
[64]

Domain randomization for transferring deep neural networks from simulation to the real world

Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 23–30, 2017

work page 2017
[65]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[66]

Visualizing data using t-sne.Journal of machine learning research, 9(11), 2008

Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of machine learning research, 9(11), 2008

work page 2008
[67]

Will we run out of data? an analysis of the limits of scaling datasets in machine learning

Pablo Villalobos, Anson Ho, Jaime Sevilla, Tamay Besiroglu, Lennart Heim, and Marius Hobbhahn. Will we run out of data? limits of LLM scaling based on human-generated data.arXiv preprint arXiv:2211.04325, 2022. 13

work page arXiv 2022
[68]

Bridgedata v2: A dataset for robot learning at scale

Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, Andre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning, pages 1723–1736. PMLR, 2023

work page 2023
[69]

Vincent Poor

Jianyu Wang, Qinghua Liu, Hao Liang, Gauri Joshi, and H. Vincent Poor. Tackling the objective inconsis- tency problem in heterogeneous federated optimization. InAdvances in Neural Information Processing Systems, volume 33, 2020

work page 2020
[70]

Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers

Lirui Wang, Xinlei Chen, Jialiang Zhao, and Kaiming He. Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers. InAdvances in Neural Information Processing Systems, 2024

work page 2024
[71]

DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control

Junjie Wen, Yichen Zhu, Jinming Li, Zhibin Tang, Chaomin Shen, and Feifei Feng. DexVLA: Vision- language model with plug-in diffusion expert for general robot control.arXiv preprint arXiv:2502.05855, 2025

work page Pith review arXiv 2025
[72]

Wynn, Veerle A.I

Russell B. Wynn, Veerle A.I. Huvenne, Timothy P. Le Bas, Bramley J. Murton, Douglas P. Connelly, Brian J. Bett, Henry A. Ruhl, Kirsty J. Morris, Jeffrey Peakall, Daniel R. Parsons, Esther J. Sumner, Stephen E. Darby, Robert M. Dorrell, and James E. Hunt. Autonomous underwater vehicles (AUVs): Their past, present and future contributions to the advancement...

work page doi:10.1016/j.margeo.2014.03.012 2014
[73]

Federated model hetero- geneous matryoshka representation learning.Advances in Neural Information Processing Systems, 37: 66431–66454, 2024

Liping Yi, Han Yu, Chao Ren, Gang Wang, Xiaoguang Liu, and Xiaoxiao Li. Federated model hetero- geneous matryoshka representation learning.Advances in Neural Information Processing Systems, 37: 66431–66454, 2024

work page 2024
[74]

Multimodal federated learning via contrastive representation ensemble

Qiying Yu, Yang Liu, Yimu Wang, Ke Xu, and Jingjing Liu. Multimodal federated learning via contrastive representation ensemble. InInternational Conference on Learning Representations (ICLR), 2023

work page 2023
[75]

Scaling robot learning with semantically imagined experience.arXiv preprint arXiv:2302.11550, 2023

Tianhe Yu, Ted Xiao, Austin Stone, Jonathan Tompson, Anthony Brohan, Su Wang, Jaspiar Singh, Clayton Tan, Jodilyn Peralta, Brian Ichter, et al. Scaling robot learning with semantically imagined experience. arXiv preprint arXiv:2302.11550, 2023

work page arXiv 2023
[76]

arXiv preprint arXiv:1806.00582 (2018)

Yue Zhao, Meng Li, Liangzhen Lai, Naveen Suda, Damon Civin, and Vikas Chandra. Federated learning with non-IID data.arXiv preprint arXiv:1806.00582, 2018

work page arXiv 2018
[77]

EgoScale: Scaling Dexterous Manipulation with Diverse Ego- centric Human Data,

Ruijie Zheng, Dantong Niu, Yuqi Xie, Jing Wang, Mengda Xu, Yunfan Jiang, Fernando Castañeda, Fengyuan Hu, You Liang Tan, Letian Fu, et al. Egoscale: Scaling dexterous manipulation with diverse egocentric human data.arXiv preprint arXiv:2602.16710, 2026

work page arXiv 2026
[78]

FedVLN: Privacy-preserving federated vision-and-language navigation

Kaiwen Zhou and Xin Eric Wang. FedVLN: Privacy-preserving federated vision-and-language navigation. InEuropean Conference on Computer Vision (ECCV), 2022

work page 2022
[79]

Communication-efficient federated learning with compensated overlap-fedavg.IEEE Transactions on Parallel and Distributed Systems, 33(1):192–205, 2021

Yuhao Zhou, Qing Ye, and Jiancheng Lv. Communication-efficient federated learning with compensated overlap-fedavg.IEEE Transactions on Parallel and Distributed Systems, 33(1):192–205, 2021

work page 2021
[80]

Communication-efficient federated learning with single-step synthetic features compressor for faster convergence

Yuhao Zhou, Mingjia Shi, Yuanxi Li, Yanan Sun, Qing Ye, and Jiancheng Lv. Communication-efficient federated learning with single-step synthetic features compressor for faster convergence. InProceedings of the IEEE/CVF international conference on computer vision, pages 5008–5017, 2023

work page 2023

Showing first 80 references.