pith. machine review for the scientific record. sign in

arxiv: 2605.07474 · v1 · submitted 2026-05-08 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

ForgeVLA: Federated Vision-Language-Action Learning without Language Annotations

Dan Si, Jiancheng Lyu, Jian Lan, Jindi Lyu, Qing Ye, Thomas Seidl, Yang Zhou, Yuhao Zhou, Yunpeng Zhu, Zhangyuan Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:47 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords federated learningvision-language-actionroboticspseudo-labelingfeature collapseembodied AIdistributed training
0
0 comments X

The pith

ForgeVLA trains vision-language-action models across distributed robots using only vision-action pairs by locally recovering language labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language-action models promise general robotic intelligence but are held back by the high cost of collecting language-annotated data. ForgeVLA sidesteps this by letting each robot run a local classifier that turns its own vision and action streams into complete triplets drawn from a fixed instruction set. It further counters the tendency of vision and language features to lose distinctiveness during federated training by adding a contrastive planning loss on each client and an adaptive aggregation step on the server. The result is a scalable training process that never moves raw data or requires manual labels.

Core claim

Each client uses an embodied instruction classifier to map raw vision-action pairs to a predefined instruction set, thereby constructing vision-language-action triplets without central data sharing; a client-side contrastive planning loss together with server-side adaptive aggregation then prevents vision-language feature collapse, allowing the federated model to learn task-discriminative representations that outperform baselines across multiple benchmarks.

What carries the argument

An embodied instruction classifier that recovers the language modality from vision-action pairs, combined with a contrastive planning loss and adaptive aggregation to preserve task-discriminative features.

If this is right

  • VLA models can be trained at larger scale by using the vision-action data that robots already collect without extra annotation effort.
  • Raw sensor data stays local, satisfying privacy and bandwidth constraints across different robot deployments.
  • Task-discriminative representations emerge even when clients hold data from dissimilar environments.
  • Ablation results confirm that removing either the local classifier or the contrastive-adaptive components degrades performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same local-modality-recovery pattern could be tested in other federated multimodal settings where one data type is missing.
  • Performance on robots whose instruction distributions differ sharply from the predefined set would reveal how much the fixed vocabulary constrains generality.
  • Replacing the fixed instruction set with a learned, expanding vocabulary on the server might further improve adaptability without centralizing raw data.

Load-bearing premise

That an embodied instruction classifier trained on a fixed set of instructions can recover language labels from vision-action pairs with enough accuracy and consistency to support effective VLA learning on heterogeneous clients.

What would settle it

Run the classifier on held-out vision-action sequences from new environments and measure whether its instruction-prediction accuracy correlates directly with the final performance of the trained VLA policy; a weak correlation would undermine the central claim.

Figures

Figures reproduced from arXiv: 2605.07474 by Dan Si, Jiancheng Lyu, Jian Lan, Jindi Lyu, Qing Ye, Thomas Seidl, Yang Zhou, Yuhao Zhou, Yunpeng Zhu, Zhangyuan Wang.

Figure 1
Figure 1. Figure 1: [Left]: The key bottleneck for scalable VLA training is data scarcity, as high-quality annotated VLA data are limited, while large volumes of vision–action logs remain underutilized. [Right]: ForgeVLA across N clients: ① Train an embodied instruction classifier on the central server; ② Clients download the pretrained classifier and the initialized global VLA model; ③ Perform on-device task classification t… view at source ↗
Figure 2
Figure 2. Figure 2: The Instruction Classifier. On each client, cϕ classifies local vision–action logs into the predefined instruction set, forming a VLA training corpus while keeping all raw data on-device to preserve privacy. The client then performs local VLA training with a contrastive planning loss to counteract heterogeneity￾induced degradation. After local training, clients upload only model updates, and the server app… view at source ↗
Figure 3
Figure 3. Figure 3: The illustrated distances and T-SNE [65] projections of latent feature representations learned by different models across different tasks. 4.2 Vision-Language Feature Collapse Real-world robotic deployments exhibit pronounced cross-client heterogeneity. For example, manu￾facturing robots operating in different factories often share only a small subset of overlapping tasks, yielding highly skewed and largel… view at source ↗
Figure 4
Figure 4. Figure 4: Illustrations of Aggregations. Adaptive Aggregation Strategy: As shown in Fig￾ure 4 (a), simple averaging of heterogeneous client updates can cancel conflicting directions, producing a global update whose projection onto each client’s up￾date direction is substantially attenuated. This cancel￾lation slows convergence and degrades performance. Based on this observation, we propose a server-side adaptive agg… view at source ↗
Figure 5
Figure 5. Figure 5: Real-world deployment of the trained ForgeVLA on the SO-ARM101 robotic platform. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The training loss curves of ForgeVLA and FedAvg. [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗
read the original abstract

Vision-Language-Action (VLA) models hold great promise for general-purpose robotic intelligence, yet scaling up such models is severely bottlenecked by the high cost of acquiring annotated training data. Fortunately, vision-equipped robots deployed across various domains already produce abundant vision-action pairs that can be leveraged to scale up VLA training more efficiently. However, these raw data cannot be centrally aggregated due to various constraints and also exhibit severe heterogeneity. To address these challenges, in this paper, we propose ForgeVLA, a federated VLA training framework that learns VLA models from distributed vision-action pairs without centralizing raw data or requiring manual annotations. Specifically, each client in ForgeVLA is equipped with an embodied instruction classifier that maps vision-action pairs to a predefined instruction set, recovering the missing language modality and forming complete vision-language-action triplets. Beyond triplet construction, we also identify vision-language feature collapse as a critical challenge that has been largely overlooked in prior federated VLA research. To mitigate this issue, ForgeVLA combines a client-side contrastive planning loss with a server-side adaptive aggregation strategy to learn task-discriminative representations efficiently. Extensive experiments across multiple benchmarks show that ForgeVLA significantly outperforms other baselines, and ablation studies further validate the contribution of each component.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes ForgeVLA, a federated framework for training Vision-Language-Action (VLA) models from distributed vision-action pairs without centralizing raw data or requiring manual language annotations. Each client deploys an embodied instruction classifier that maps vision-action pairs onto a predefined instruction set to recover language modality and form VLA triplets. The framework additionally introduces a client-side contrastive planning loss and server-side adaptive aggregation to mitigate vision-language feature collapse. Extensive experiments across multiple benchmarks are reported to show significant outperformance over baselines, with ablation studies validating the contribution of each component.

Significance. If the empirical results hold and the method generalizes across heterogeneous client distributions, ForgeVLA could enable scalable VLA training by exploiting abundant unlabeled robot data in a privacy-preserving federated setting. This directly targets the annotation-cost and data-heterogeneity bottlenecks in embodied AI, with potential impact on general-purpose robotic intelligence. The identification of feature collapse as an overlooked issue in federated VLA is a useful conceptual contribution.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (method description): The training procedure, data requirements, and initialization for the embodied instruction classifier are unspecified. This is load-bearing for the central claim of learning 'without Language Annotations,' because any reliance on language-annotated data (even for pre-training or a seed set) would collapse the premise. The manuscript must clarify whether the classifier is obtained without annotations and how it handles client-specific visual/action shifts without retraining or additional labels.
  2. [§4 (Experiments) and Table 1/2] §4 (Experiments) and Table 1/2: The abstract asserts 'significant outperformance' and 'ablation studies further validate the contribution of each component,' yet the provided summary supplies no quantitative metrics, error bars, dataset sizes, or specific ablation numbers. Without these, the strength of the empirical support for the federated VLA claims cannot be assessed; the full manuscript must include them with clear baselines and statistical significance tests.
minor comments (2)
  1. [Abstract] The abstract would be strengthened by including one or two key quantitative results (e.g., success-rate deltas) rather than qualitative statements alone.
  2. [§3] Notation for the contrastive planning loss and adaptive aggregation should be introduced with explicit equations in §3 to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment below and will update the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (method description): The training procedure, data requirements, and initialization for the embodied instruction classifier are unspecified. This is load-bearing for the central claim of learning 'without Language Annotations,' because any reliance on language-annotated data (even for pre-training or a seed set) would collapse the premise. The manuscript must clarify whether the classifier is obtained without annotations and how it handles client-specific visual/action shifts without retraining or additional labels.

    Authors: We agree that the description of the embodied instruction classifier in §3 is too brief and will expand it substantially in the revision. The revised text will specify that the classifier is initialized from a publicly available pre-trained vision-language model and trained via cross-entropy loss on a fixed, publicly released set of vision-action-instruction triplets drawn from standard benchmarks. No client-specific language annotations are used at any stage. Client-specific visual and action distribution shifts are handled by keeping the classifier weights frozen after initialization; adaptation occurs exclusively through the client-side contrastive planning loss, which aligns the fixed language embeddings with local vision-action features without requiring retraining or new labels. revision: yes

  2. Referee: [§4 (Experiments) and Table 1/2] §4 (Experiments) and Table 1/2: The abstract asserts 'significant outperformance' and 'ablation studies further validate the contribution of each component,' yet the provided summary supplies no quantitative metrics, error bars, dataset sizes, or specific ablation numbers. Without these, the strength of the empirical support for the federated VLA claims cannot be assessed; the full manuscript must include them with clear baselines and statistical significance tests.

    Authors: The full manuscript already reports concrete metrics, ablation numbers, and baseline comparisons in Tables 1 and 2 together with dataset sizes. To address the concern directly, the revision will add error bars computed over multiple random seeds, explicitly list the number of runs, and include statistical significance tests (paired t-tests with p-values) for the primary performance gains over baselines. revision: yes

Circularity Check

0 steps flagged

No circularity detected; framework is engineering-based with independent empirical validation.

full rationale

The paper describes ForgeVLA as a federated framework that equips clients with an embodied instruction classifier to map vision-action pairs onto a predefined instruction set, thereby forming VLA triplets without centralizing data or manual annotations. It further proposes a client-side contrastive planning loss and server-side adaptive aggregation to address vision-language feature collapse. No equations, derivations, or first-principles predictions are presented that reduce by construction to fitted inputs, self-definitions, or self-citation chains. The central claims rest on the design of these components and their experimental performance across benchmarks, which constitutes independent content rather than tautological equivalence to the inputs. The classifier's training procedure is not specified in the abstract, but this is an unelaborated assumption, not a circular reduction in any claimed derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review provides no explicit free parameters, axioms, or independent evidence for new entities; the classifier and losses are described at a conceptual level only.

invented entities (1)
  • embodied instruction classifier no independent evidence
    purpose: Maps vision-action pairs to a predefined instruction set to recover the language modality
    Introduced as a core client-side component to form complete VLA triplets without manual annotations

pith-pipeline@v0.9.0 · 5550 in / 1330 out tokens · 61553 ms · 2026-05-11T01:47:39.954117+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages · 7 internal anchors

  1. [1]

    Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang

    Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. InACM SIGSAC Conference on Computer and Communications Security, 2016

  2. [2]

    doi:10.1038/s44160-022-00231-0

    Milad Abolhasani and Eugenia Kumacheva. The rise of self-driving labs in chemical and materials sciences. Nature Synthesis, 2(6):483–492, 2023. doi: 10.1038/s44160-022-00231-0

  3. [3]

    Whatmough, and Venkatesh Saligrama

    Durmus Alp Emre Acar, Yue Zhao, Ramon Matas Navarro, Matthew Mattina, Paul N. Whatmough, and Venkatesh Saligrama. Federated learning based on dynamic regularization. InInternational Conference on Learning Representations, 2021

  4. [4]

    Amazon launches a new AI foundation model to power its robotic fleet and de- ploys its 1 millionth robot, 2025

    Amazon. Amazon launches a new AI foundation model to power its robotic fleet and de- ploys its 1 millionth robot, 2025. URL https://www.aboutamazon.com/news/operations/ amazon-million-robots-ai-foundation-model

  5. [5]

    Non-stationary stochastic optimization.Operations Research, 63(5):1227–1244, 2015

    Omar Besbes, Yonatan Gur, and Assaf Zeevi. Non-stationary stochastic optimization.Operations Research, 63(5):1227–1244, 2015. doi: 10.1287/opre.2015.1408. URL https://pubsonline.informs.org/ doi/10.1287/opre.2015.1408

  6. [6]

    Zero-shot robotic manipulation with pretrained image-editing diffusion models

    Kevin Black, Mitsuhiko Nakamoto, Pranav Atreya, Homer Walke, Chelsea Finn, Aviral Kumar, and Sergey Levine. Zero-shot robotic manipulation with pretrained image-editing diffusion models. InInternational Conference on Learning Representations (ICLR), 2024

  7. [7]

    π0: A vision-language-action flow model for general robot control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control. InRobotics: Science and Systems (RSS), 2025

  8. [8]

    Brendan McMahan, Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth

    Keith Bonawitz, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H. Brendan McMahan, Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth. Practical secure aggregation for privacy-preserving machine learning. InACM SIGSAC Conference on Computer and Communications Security, 2017

  9. [9]

    Towards federated learning at scale: System design.Proceedings of Machine Learning and Systems, 1:374–388, 2019

    Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir Ivanov, Chloe Kiddon, Jakub Koneˇcn`y, Stefano Mazzocchi, Brendan McMahan, et al. Towards federated learning at scale: System design.Proceedings of Machine Learning and Systems, 1:374–388, 2019

  10. [10]

    FLAME: A federated learning benchmark for robotic manipulation.arXiv preprint arXiv:2503.01729, 2025

    Santiago Bou Betran, Alberta Longhini, Miguel Vasco, Yuchong Zhang, and Danica Kragic. FLAME: A federated learning benchmark for robotic manipulation.arXiv preprint arXiv:2503.01729, 2025

  11. [11]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. InAdvances in Neural Information Processing Systems, volume 33, pages 1877–1901, 2020

  12. [12]

    InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy

    Xinyi Chen, Yilun Chen, Yanwei Fu, Ning Gao, Jiaya Jia, Weiyang Jin, Hao Li, Yao Mu, Jiangmiao Pang, Yu Qiao, et al. InternVLA-M1: A spatially guided vision-language-action framework for generalist robot policy.arXiv preprint arXiv:2510.13778, 2025

  13. [13]

    FedVLA: Federated vision-language-action learning with dual gating mixture-of-experts for robotic manipulation

    Miao Cui, Tao Chang, Meihan Wu, Hongbin Xu, Chun Li, Ming Li, and Xiaodong Wang. FedVLA: Federated vision-language-action learning with dual gating mixture-of-experts for robotic manipulation. In International Conference on Computer Vision (ICCV), 2025

  14. [14]

    RoboNet: Large-scale multi-robot learning

    Sudeep Dasari, Frederik Ebert, Stephen Tian, Suraj Nair, Bernadette Bucher, Karl Schmeckpeper, Siddharth Singh, Sergey Levine, and Chelsea Finn. RoboNet: Large-scale multi-robot learning. InConference on Robot Learning (CoRL), 2019. 10

  15. [15]

    Scaling cross-embodied learning: One policy for manipulation, navigation, locomotion and aviation

    Ria Doshi, Homer Rich Walke, Oier Mees, Sudeep Dasari, and Sergey Levine. Scaling cross-embodied learning: One policy for manipulation, navigation, locomotion and aviation. InProceedings of The 8th Conference on Robot Learning, volume 270 ofProceedings of Machine Learning Research, pages 496–512. PMLR, 2024

  16. [16]

    Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. PaLM-E: An embodied multimodal language model. InInternational Conference on Machine Learning (ICML), 2023

  17. [17]

    Learning universal policies via text-guided video generation.Advances in neural information processing systems, 36:9156–9172, 2023

    Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation.Advances in neural information processing systems, 36:9156–9172, 2023

  18. [18]

    Agricultural robotics: The future of robotic agriculture

    Tom Duckett, Simon Pearson, Simon Blackmore, and Bruce Grieve. Agricultural robotics: The future of robotic agriculture. Uk-ras white paper, EPSRC UK-RAS Network, 2018

  19. [19]

    Inverting gradients – how easy is it to break privacy in federated learning? InAdvances in Neural Information Processing Systems, 2020

    Jonas Geiping, Hartmut Bauermeister, Hannah Dröge, and Michael Moeller. Inverting gradients – how easy is it to break privacy in federated learning? InAdvances in Neural Information Processing Systems, 2020

  20. [20]

    Training compute-optimal large language models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

  21. [21]

    LoRA: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2022

  22. [22]

    World robotics 2025 – industrial robots

    International Federation of Robotics. World robotics 2025 – industrial robots. Techni- cal report, IFR Statistical Department, Frankfurt, 2025. 4,664,000 industrial robots in op- erational use worldwide in 2024. Available at https://ifr.org/ifr-press-releases/news/ global-robot-demand-in-factories-doubles-over-10-years

  23. [23]

    2025 annual report, 2025

    Intuitive Surgical, Inc. 2025 annual report, 2025. URL https://isrg.intuitive.com/ static-files/d01bbc25-f8cf-433b-8ebb-b5afc1926236

  24. [24]

    Stephen James, Zicong Ma, David Rovick Arrojo, and Andrew J. Davison. RLBench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Letters, 2020

  25. [25]

    Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al

    Peter Kairouz, H. Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning.Foundations and Trends in Machine Learning, 14(1–2), 2021

  26. [26]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

  27. [27]

    SCAFFOLD: Stochastic controlled averaging for federated learning

    Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. SCAFFOLD: Stochastic controlled averaging for federated learning. InProceed- ings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 5132–5143. PMLR, 2020. URLhttps://proceedi...

  28. [28]

    A survey of research on cloud robotics and automation.IEEE Transactions on automation science and engineering, 12(2):398–409, 2015

    Ben Kehoe, Sachin Patil, Pieter Abbeel, and Ken Goldberg. A survey of research on cloud robotics and automation.IEEE Transactions on automation science and engineering, 12(2):398–409, 2015

  29. [29]

    Tighter theory for local SGD on identical and heterogeneous data

    Ahmed Khaled, Konstantin Mishchenko, and Peter Richtarik. Tighter theory for local SGD on identical and heterogeneous data. InProceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 ofProceedings of Machine Learning Research, pages 4519–4529. PMLR, 2020. URLhttps://proceedings.mlr.press/v108/bayoumi20a.html

  30. [30]

    DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

  31. [31]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024. 11

  32. [32]

    Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

    Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

  33. [33]

    Tenenbaum

    Po-Chen Ko, Jiayuan Mao, Yilun Du, Shao-Hua Sun, and Joshua B. Tenenbaum. Learning to act from actionless videos through dense correspondences.arXiv preprint arXiv:2310.08576, 2023

  34. [34]

    Image augmentation is all you need: Regularizing deep reinforcement learning from pixels

    Ilya Kostrikov, Denis Yarats, and Rob Fergus. Image augmentation is all you need: Regularizing deep reinforcement learning from pixels. InInternational Conference on Learning Representations (ICLR), 2021

  35. [35]

    Reinforce- ment learning with augmented data

    Michael Laskin, Kimin Lee, Adam Stooke, Lerrel Pinto, Pieter Abbeel, and Aravind Srinivas. Reinforce- ment learning with augmented data. InAdvances in Neural Information Processing Systems (NeurIPS), volume 33, pages 19884–19895, 2020

  36. [36]

    Preservation of the global knowledge by not-true distillation in federated learning

    Gihun Lee, Minchan Jeong, Yongjin Shin, Sangmin Bae, and Se-Young Yun. Preservation of the global knowledge by not-true distillation in federated learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

  37. [37]

    Global and local prompts cooperation via optimal transport for federated learning

    Hongxia Li, Wei Huang, Jingya Wang, and Ye Shi. Global and local prompts cooperation via optimal transport for federated learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12151–12161, 2024

  38. [38]

    Model-contrastive federated learning

    Qinbin Li, Bingsheng He, and Dawn Song. Model-contrastive federated learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10713–10722, 2021

  39. [39]

    CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

    Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, Xiaofan Wang, Bei Liu, Jianlong Fu, Jianmin Bao, Dong Chen, Yuanchun Shi, Jiaolong Yang, and Baining Guo. CogACT: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:24...

  40. [40]

    Federated optimization in heterogeneous networks

    Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. InConference on Machine Learning and Systems (MLSys), 2020

  41. [41]

    Non-convex bilevel optimization with time-varying objective functions.Advances in Neural Information Processing Systems, 36:29692–29717, 2023

    Sen Lin, Daouda Sow, Kaiyi Ji, Yingbin Liang, and Ness Shroff. Non-convex bilevel optimization with time-varying objective functions.Advances in Neural Information Processing Systems, 36:29692–29717, 2023

  42. [42]

    Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

  43. [43]

    Federated imitation learning: A novel framework for cloud robotic systems with heterogeneous sensor data.IEEE Robotics and Automation Letters, 2020

    Boyi Liu, Lujia Wang, Ming Liu, and Cheng-Zhong Xu. Federated imitation learning: A novel framework for cloud robotic systems with heterogeneous sensor data.IEEE Robotics and Automation Letters, 2020

  44. [44]

    Optimization with first-order surrogate functions

    Julien Mairal. Optimization with first-order surrogate functions. InProceedings of the 30th International Conference on Machine Learning, volume 28 ofProceedings of Machine Learning Research, pages 783–791. PMLR, 2013

  45. [45]

    MimicGen: A data generation system for scalable robot learning using human demonstrations

    Ajay Mandlekar, Soroush Nasiriany, Bowen Wen, Iretiayo Akinola, Yashraj Narang, Linxi Fan, Yuke Zhu, and Dieter Fox. MimicGen: A data generation system for scalable robot learning using human demonstrations. InConference on Robot Learning (CoRL), 2023

  46. [46]

    Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas

    H. Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. InInternational Conference on Artificial Intelligence and Statistics (AISTATS), 2017

  47. [47]

    Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations

    Tongzhou Mu, Zhan Ling, Fanbo Xiang, Derek Yang, Xuanlin Li, Stone Tao, Zhiao Huang, Zhiwei Jia, and Hao Su. Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations. In Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2021

  48. [48]

    Rush, Boaz Barak, Teven Le Scao, Nouamane Tazi, Aleksandra Piktus, Sampo Pyysalo, Thomas Wolf, and Colin Raffel

    Niklas Muennighoff, Alexander M. Rush, Boaz Barak, Teven Le Scao, Nouamane Tazi, Aleksandra Piktus, Sampo Pyysalo, Thomas Wolf, and Colin Raffel. Scaling data-constrained language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  49. [49]

    Octo: An open-source generalist robot policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, You Liang Tan, Lawrence Yunliang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy. InProceedings of Robotics: Science and...

  50. [50]

    Open X-embodiment: Robotic learning datasets and RT-X models

    Open X-Embodiment Collaboration. Open X-embodiment: Robotic learning datasets and RT-X models. IEEE International Conference on Robotics and Automation (ICRA), 2024

  51. [51]

    SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

    Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, and Xuelong Li. SpatialVLA: Exploring spatial representations for visual-language- action model.arXiv preprint arXiv:2501.15830, 2025

  52. [52]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, pages 8748–8763. PMLR, 2021

  53. [53]

    A unified convergence analysis of block successive minimization methods for nonsmooth optimization.SIAM Journal on Optimization, 23(2):1126–1153,

    Meisam Razaviyayn, Mingyi Hong, and Zhi-Quan Luo. A unified convergence analysis of block successive minimization methods for nonsmooth optimization.SIAM Journal on Optimization, 23(2):1126–1153,

  54. [54]

    doi: 10.1137/120891009

  55. [55]

    Brendan McMahan

    Sashank Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Kone ˇcný, Sanjiv Kumar, and H. Brendan McMahan. Adaptive federated optimization. InInternational Conference on Learning Representations (ICLR), 2021. URLhttps://openreview.net/forum?id=LkFG3lB13U5

  56. [56]

    FEDORA: Federated ensemble-directed offline reinforcement learning

    Desik Rengarajan, Nitin Ragothaman, Dileep Kalathil, and Srinivas Shakkottai. FEDORA: Federated ensemble-directed offline reinforcement learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

  57. [57]

    Multimodal diffusion transformer: Learning versatile behavior from multimodal goals

    Moritz Reuss, Ömer Erdinç Ya ˘gmurlu, Fabian Wenzel, and Rudolf Lioutikov. Multimodal diffusion transformer: Learning versatile behavior from multimodal goals. InRobotics: Science and Systems (RSS), 2024

  58. [58]

    Relaxed contrastive learning for federated learning

    Seonguk Seo, Jinkyu Kim, Geeho Kim, and Bohyung Han. Relaxed contrastive learning for federated learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  59. [59]

    Prior: Personalized prior for reactivating the information overlooked in federated learning.Advances in Neural Information Processing Systems, 36:28378–28392, 2023

    Mingjia Shi, Yuhao Zhou, Kai Wang, Huaizheng Zhang, Shudong Huang, Qing Ye, and Jiancheng Lv. Prior: Personalized prior for reactivating the information overlooked in federated learning.Advances in Neural Information Processing Systems, 36:28378–28392, 2023

  60. [60]

    Yujun Shi, Jian Liang, Wenqing Zhang, Vincent Y . F. Tan, and Song Bai. Towards understanding and mitigating dimensional collapse in heterogeneous federated learning. InInternational Conference on Learning Representations (ICLR), 2023

  61. [61]

    Sebastian U. Stich. Local SGD converges fast and communicates little. InInternational Conference on Learning Representations (ICLR), 2019. URLhttps://openreview.net/forum?id=S1g2JnRcFX

  62. [62]

    Scalability in perception for autonomous driving: Waymo open dataset

    Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2446–2454, 2020

  63. [63]

    Fedproto: Federated prototype learning across heterogeneous clients

    Yue Tan, Guodong Long, Lu Liu, Tianyi Zhou, Qinghua Lu, Jing Jiang, and Chengqi Zhang. Fedproto: Federated prototype learning across heterogeneous clients. InProceedings of the AAAI conference on artificial intelligence, volume 36, pages 8432–8440, 2022

  64. [64]

    Domain randomization for transferring deep neural networks from simulation to the real world

    Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 23–30, 2017

  65. [65]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

  66. [66]

    Visualizing data using t-sne.Journal of machine learning research, 9(11), 2008

    Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of machine learning research, 9(11), 2008

  67. [67]

    Will we run out of data? an analysis of the limits of scaling datasets in machine learning

    Pablo Villalobos, Anson Ho, Jaime Sevilla, Tamay Besiroglu, Lennart Heim, and Marius Hobbhahn. Will we run out of data? limits of LLM scaling based on human-generated data.arXiv preprint arXiv:2211.04325, 2022. 13

  68. [68]

    Bridgedata v2: A dataset for robot learning at scale

    Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, Andre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning, pages 1723–1736. PMLR, 2023

  69. [69]

    Vincent Poor

    Jianyu Wang, Qinghua Liu, Hao Liang, Gauri Joshi, and H. Vincent Poor. Tackling the objective inconsis- tency problem in heterogeneous federated optimization. InAdvances in Neural Information Processing Systems, volume 33, 2020

  70. [70]

    Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers

    Lirui Wang, Xinlei Chen, Jialiang Zhao, and Kaiming He. Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers. InAdvances in Neural Information Processing Systems, 2024

  71. [71]

    DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control

    Junjie Wen, Yichen Zhu, Jinming Li, Zhibin Tang, Chaomin Shen, and Feifei Feng. DexVLA: Vision- language model with plug-in diffusion expert for general robot control.arXiv preprint arXiv:2502.05855, 2025

  72. [72]

    Wynn, Veerle A.I

    Russell B. Wynn, Veerle A.I. Huvenne, Timothy P. Le Bas, Bramley J. Murton, Douglas P. Connelly, Brian J. Bett, Henry A. Ruhl, Kirsty J. Morris, Jeffrey Peakall, Daniel R. Parsons, Esther J. Sumner, Stephen E. Darby, Robert M. Dorrell, and James E. Hunt. Autonomous underwater vehicles (AUVs): Their past, present and future contributions to the advancement...

  73. [73]

    Federated model hetero- geneous matryoshka representation learning.Advances in Neural Information Processing Systems, 37: 66431–66454, 2024

    Liping Yi, Han Yu, Chao Ren, Gang Wang, Xiaoguang Liu, and Xiaoxiao Li. Federated model hetero- geneous matryoshka representation learning.Advances in Neural Information Processing Systems, 37: 66431–66454, 2024

  74. [74]

    Multimodal federated learning via contrastive representation ensemble

    Qiying Yu, Yang Liu, Yimu Wang, Ke Xu, and Jingjing Liu. Multimodal federated learning via contrastive representation ensemble. InInternational Conference on Learning Representations (ICLR), 2023

  75. [75]

    Scaling robot learning with semantically imagined experience.arXiv preprint arXiv:2302.11550, 2023

    Tianhe Yu, Ted Xiao, Austin Stone, Jonathan Tompson, Anthony Brohan, Su Wang, Jaspiar Singh, Clayton Tan, Jodilyn Peralta, Brian Ichter, et al. Scaling robot learning with semantically imagined experience. arXiv preprint arXiv:2302.11550, 2023

  76. [76]

    arXiv preprint arXiv:1806.00582 (2018)

    Yue Zhao, Meng Li, Liangzhen Lai, Naveen Suda, Damon Civin, and Vikas Chandra. Federated learning with non-IID data.arXiv preprint arXiv:1806.00582, 2018

  77. [77]

    EgoScale: Scaling Dexterous Manipulation with Diverse Ego- centric Human Data,

    Ruijie Zheng, Dantong Niu, Yuqi Xie, Jing Wang, Mengda Xu, Yunfan Jiang, Fernando Castañeda, Fengyuan Hu, You Liang Tan, Letian Fu, et al. Egoscale: Scaling dexterous manipulation with diverse egocentric human data.arXiv preprint arXiv:2602.16710, 2026

  78. [78]

    FedVLN: Privacy-preserving federated vision-and-language navigation

    Kaiwen Zhou and Xin Eric Wang. FedVLN: Privacy-preserving federated vision-and-language navigation. InEuropean Conference on Computer Vision (ECCV), 2022

  79. [79]

    Communication-efficient federated learning with compensated overlap-fedavg.IEEE Transactions on Parallel and Distributed Systems, 33(1):192–205, 2021

    Yuhao Zhou, Qing Ye, and Jiancheng Lv. Communication-efficient federated learning with compensated overlap-fedavg.IEEE Transactions on Parallel and Distributed Systems, 33(1):192–205, 2021

  80. [80]

    Communication-efficient federated learning with single-step synthetic features compressor for faster convergence

    Yuhao Zhou, Mingjia Shi, Yuanxi Li, Yanan Sun, Qing Ye, and Jiancheng Lv. Communication-efficient federated learning with single-step synthetic features compressor for faster convergence. InProceedings of the IEEE/CVF international conference on computer vision, pages 5008–5017, 2023

Showing first 80 references.