Recognition: 2 theorem links
· Lean TheoremForgeVLA: Federated Vision-Language-Action Learning without Language Annotations
Pith reviewed 2026-05-11 01:47 UTC · model grok-4.3
The pith
ForgeVLA trains vision-language-action models across distributed robots using only vision-action pairs by locally recovering language labels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Each client uses an embodied instruction classifier to map raw vision-action pairs to a predefined instruction set, thereby constructing vision-language-action triplets without central data sharing; a client-side contrastive planning loss together with server-side adaptive aggregation then prevents vision-language feature collapse, allowing the federated model to learn task-discriminative representations that outperform baselines across multiple benchmarks.
What carries the argument
An embodied instruction classifier that recovers the language modality from vision-action pairs, combined with a contrastive planning loss and adaptive aggregation to preserve task-discriminative features.
If this is right
- VLA models can be trained at larger scale by using the vision-action data that robots already collect without extra annotation effort.
- Raw sensor data stays local, satisfying privacy and bandwidth constraints across different robot deployments.
- Task-discriminative representations emerge even when clients hold data from dissimilar environments.
- Ablation results confirm that removing either the local classifier or the contrastive-adaptive components degrades performance.
Where Pith is reading between the lines
- The same local-modality-recovery pattern could be tested in other federated multimodal settings where one data type is missing.
- Performance on robots whose instruction distributions differ sharply from the predefined set would reveal how much the fixed vocabulary constrains generality.
- Replacing the fixed instruction set with a learned, expanding vocabulary on the server might further improve adaptability without centralizing raw data.
Load-bearing premise
That an embodied instruction classifier trained on a fixed set of instructions can recover language labels from vision-action pairs with enough accuracy and consistency to support effective VLA learning on heterogeneous clients.
What would settle it
Run the classifier on held-out vision-action sequences from new environments and measure whether its instruction-prediction accuracy correlates directly with the final performance of the trained VLA policy; a weak correlation would undermine the central claim.
Figures
read the original abstract
Vision-Language-Action (VLA) models hold great promise for general-purpose robotic intelligence, yet scaling up such models is severely bottlenecked by the high cost of acquiring annotated training data. Fortunately, vision-equipped robots deployed across various domains already produce abundant vision-action pairs that can be leveraged to scale up VLA training more efficiently. However, these raw data cannot be centrally aggregated due to various constraints and also exhibit severe heterogeneity. To address these challenges, in this paper, we propose ForgeVLA, a federated VLA training framework that learns VLA models from distributed vision-action pairs without centralizing raw data or requiring manual annotations. Specifically, each client in ForgeVLA is equipped with an embodied instruction classifier that maps vision-action pairs to a predefined instruction set, recovering the missing language modality and forming complete vision-language-action triplets. Beyond triplet construction, we also identify vision-language feature collapse as a critical challenge that has been largely overlooked in prior federated VLA research. To mitigate this issue, ForgeVLA combines a client-side contrastive planning loss with a server-side adaptive aggregation strategy to learn task-discriminative representations efficiently. Extensive experiments across multiple benchmarks show that ForgeVLA significantly outperforms other baselines, and ablation studies further validate the contribution of each component.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes ForgeVLA, a federated framework for training Vision-Language-Action (VLA) models from distributed vision-action pairs without centralizing raw data or requiring manual language annotations. Each client deploys an embodied instruction classifier that maps vision-action pairs onto a predefined instruction set to recover language modality and form VLA triplets. The framework additionally introduces a client-side contrastive planning loss and server-side adaptive aggregation to mitigate vision-language feature collapse. Extensive experiments across multiple benchmarks are reported to show significant outperformance over baselines, with ablation studies validating the contribution of each component.
Significance. If the empirical results hold and the method generalizes across heterogeneous client distributions, ForgeVLA could enable scalable VLA training by exploiting abundant unlabeled robot data in a privacy-preserving federated setting. This directly targets the annotation-cost and data-heterogeneity bottlenecks in embodied AI, with potential impact on general-purpose robotic intelligence. The identification of feature collapse as an overlooked issue in federated VLA is a useful conceptual contribution.
major comments (2)
- [Abstract and §3] Abstract and §3 (method description): The training procedure, data requirements, and initialization for the embodied instruction classifier are unspecified. This is load-bearing for the central claim of learning 'without Language Annotations,' because any reliance on language-annotated data (even for pre-training or a seed set) would collapse the premise. The manuscript must clarify whether the classifier is obtained without annotations and how it handles client-specific visual/action shifts without retraining or additional labels.
- [§4 (Experiments) and Table 1/2] §4 (Experiments) and Table 1/2: The abstract asserts 'significant outperformance' and 'ablation studies further validate the contribution of each component,' yet the provided summary supplies no quantitative metrics, error bars, dataset sizes, or specific ablation numbers. Without these, the strength of the empirical support for the federated VLA claims cannot be assessed; the full manuscript must include them with clear baselines and statistical significance tests.
minor comments (2)
- [Abstract] The abstract would be strengthened by including one or two key quantitative results (e.g., success-rate deltas) rather than qualitative statements alone.
- [§3] Notation for the contrastive planning loss and adaptive aggregation should be introduced with explicit equations in §3 to improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment below and will update the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (method description): The training procedure, data requirements, and initialization for the embodied instruction classifier are unspecified. This is load-bearing for the central claim of learning 'without Language Annotations,' because any reliance on language-annotated data (even for pre-training or a seed set) would collapse the premise. The manuscript must clarify whether the classifier is obtained without annotations and how it handles client-specific visual/action shifts without retraining or additional labels.
Authors: We agree that the description of the embodied instruction classifier in §3 is too brief and will expand it substantially in the revision. The revised text will specify that the classifier is initialized from a publicly available pre-trained vision-language model and trained via cross-entropy loss on a fixed, publicly released set of vision-action-instruction triplets drawn from standard benchmarks. No client-specific language annotations are used at any stage. Client-specific visual and action distribution shifts are handled by keeping the classifier weights frozen after initialization; adaptation occurs exclusively through the client-side contrastive planning loss, which aligns the fixed language embeddings with local vision-action features without requiring retraining or new labels. revision: yes
-
Referee: [§4 (Experiments) and Table 1/2] §4 (Experiments) and Table 1/2: The abstract asserts 'significant outperformance' and 'ablation studies further validate the contribution of each component,' yet the provided summary supplies no quantitative metrics, error bars, dataset sizes, or specific ablation numbers. Without these, the strength of the empirical support for the federated VLA claims cannot be assessed; the full manuscript must include them with clear baselines and statistical significance tests.
Authors: The full manuscript already reports concrete metrics, ablation numbers, and baseline comparisons in Tables 1 and 2 together with dataset sizes. To address the concern directly, the revision will add error bars computed over multiple random seeds, explicitly list the number of runs, and include statistical significance tests (paired t-tests with p-values) for the primary performance gains over baselines. revision: yes
Circularity Check
No circularity detected; framework is engineering-based with independent empirical validation.
full rationale
The paper describes ForgeVLA as a federated framework that equips clients with an embodied instruction classifier to map vision-action pairs onto a predefined instruction set, thereby forming VLA triplets without centralizing data or manual annotations. It further proposes a client-side contrastive planning loss and server-side adaptive aggregation to address vision-language feature collapse. No equations, derivations, or first-principles predictions are presented that reduce by construction to fitted inputs, self-definitions, or self-citation chains. The central claims rest on the design of these components and their experimental performance across benchmarks, which constitutes independent content rather than tautological equivalence to the inputs. The classifier's training procedure is not specified in the abstract, but this is an unelaborated assumption, not a circular reduction in any claimed derivation.
Axiom & Free-Parameter Ledger
invented entities (1)
-
embodied instruction classifier
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclearclient-side contrastive planning loss with a server-side adaptive aggregation strategy
Reference graph
Works this paper leans on
-
[1]
Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang
Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. InACM SIGSAC Conference on Computer and Communications Security, 2016
work page 2016
-
[2]
doi:10.1038/s44160-022-00231-0
Milad Abolhasani and Eugenia Kumacheva. The rise of self-driving labs in chemical and materials sciences. Nature Synthesis, 2(6):483–492, 2023. doi: 10.1038/s44160-022-00231-0
-
[3]
Whatmough, and Venkatesh Saligrama
Durmus Alp Emre Acar, Yue Zhao, Ramon Matas Navarro, Matthew Mattina, Paul N. Whatmough, and Venkatesh Saligrama. Federated learning based on dynamic regularization. InInternational Conference on Learning Representations, 2021
work page 2021
-
[4]
Amazon. Amazon launches a new AI foundation model to power its robotic fleet and de- ploys its 1 millionth robot, 2025. URL https://www.aboutamazon.com/news/operations/ amazon-million-robots-ai-foundation-model
work page 2025
-
[5]
Non-stationary stochastic optimization.Operations Research, 63(5):1227–1244, 2015
Omar Besbes, Yonatan Gur, and Assaf Zeevi. Non-stationary stochastic optimization.Operations Research, 63(5):1227–1244, 2015. doi: 10.1287/opre.2015.1408. URL https://pubsonline.informs.org/ doi/10.1287/opre.2015.1408
-
[6]
Zero-shot robotic manipulation with pretrained image-editing diffusion models
Kevin Black, Mitsuhiko Nakamoto, Pranav Atreya, Homer Walke, Chelsea Finn, Aviral Kumar, and Sergey Levine. Zero-shot robotic manipulation with pretrained image-editing diffusion models. InInternational Conference on Learning Representations (ICLR), 2024
work page 2024
-
[7]
π0: A vision-language-action flow model for general robot control
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control. InRobotics: Science and Systems (RSS), 2025
work page 2025
-
[8]
Brendan McMahan, Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth
Keith Bonawitz, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H. Brendan McMahan, Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth. Practical secure aggregation for privacy-preserving machine learning. InACM SIGSAC Conference on Computer and Communications Security, 2017
work page 2017
-
[9]
Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir Ivanov, Chloe Kiddon, Jakub Koneˇcn`y, Stefano Mazzocchi, Brendan McMahan, et al. Towards federated learning at scale: System design.Proceedings of Machine Learning and Systems, 1:374–388, 2019
work page 2019
-
[10]
FLAME: A federated learning benchmark for robotic manipulation.arXiv preprint arXiv:2503.01729, 2025
Santiago Bou Betran, Alberta Longhini, Miguel Vasco, Yuchong Zhang, and Danica Kragic. FLAME: A federated learning benchmark for robotic manipulation.arXiv preprint arXiv:2503.01729, 2025
-
[11]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. InAdvances in Neural Information Processing Systems, volume 33, pages 1877–1901, 2020
work page 1901
-
[12]
InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy
Xinyi Chen, Yilun Chen, Yanwei Fu, Ning Gao, Jiaya Jia, Weiyang Jin, Hao Li, Yao Mu, Jiangmiao Pang, Yu Qiao, et al. InternVLA-M1: A spatially guided vision-language-action framework for generalist robot policy.arXiv preprint arXiv:2510.13778, 2025
work page internal anchor Pith review arXiv 2025
-
[13]
Miao Cui, Tao Chang, Meihan Wu, Hongbin Xu, Chun Li, Ming Li, and Xiaodong Wang. FedVLA: Federated vision-language-action learning with dual gating mixture-of-experts for robotic manipulation. In International Conference on Computer Vision (ICCV), 2025
work page 2025
-
[14]
RoboNet: Large-scale multi-robot learning
Sudeep Dasari, Frederik Ebert, Stephen Tian, Suraj Nair, Bernadette Bucher, Karl Schmeckpeper, Siddharth Singh, Sergey Levine, and Chelsea Finn. RoboNet: Large-scale multi-robot learning. InConference on Robot Learning (CoRL), 2019. 10
work page 2019
-
[15]
Scaling cross-embodied learning: One policy for manipulation, navigation, locomotion and aviation
Ria Doshi, Homer Rich Walke, Oier Mees, Sudeep Dasari, and Sergey Levine. Scaling cross-embodied learning: One policy for manipulation, navigation, locomotion and aviation. InProceedings of The 8th Conference on Robot Learning, volume 270 ofProceedings of Machine Learning Research, pages 496–512. PMLR, 2024
work page 2024
-
[16]
Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. PaLM-E: An embodied multimodal language model. InInternational Conference on Machine Learning (ICML), 2023
work page 2023
-
[17]
Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation.Advances in neural information processing systems, 36:9156–9172, 2023
work page 2023
-
[18]
Agricultural robotics: The future of robotic agriculture
Tom Duckett, Simon Pearson, Simon Blackmore, and Bruce Grieve. Agricultural robotics: The future of robotic agriculture. Uk-ras white paper, EPSRC UK-RAS Network, 2018
work page 2018
-
[19]
Jonas Geiping, Hartmut Bauermeister, Hannah Dröge, and Michael Moeller. Inverting gradients – how easy is it to break privacy in federated learning? InAdvances in Neural Information Processing Systems, 2020
work page 2020
-
[20]
Training compute-optimal large language models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2022
work page 2022
-
[21]
LoRA: Low-rank adaptation of large language models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2022
work page 2022
-
[22]
World robotics 2025 – industrial robots
International Federation of Robotics. World robotics 2025 – industrial robots. Techni- cal report, IFR Statistical Department, Frankfurt, 2025. 4,664,000 industrial robots in op- erational use worldwide in 2024. Available at https://ifr.org/ifr-press-releases/news/ global-robot-demand-in-factories-doubles-over-10-years
work page 2025
-
[23]
Intuitive Surgical, Inc. 2025 annual report, 2025. URL https://isrg.intuitive.com/ static-files/d01bbc25-f8cf-433b-8ebb-b5afc1926236
work page 2025
-
[24]
Stephen James, Zicong Ma, David Rovick Arrojo, and Andrew J. Davison. RLBench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Letters, 2020
work page 2020
-
[25]
Peter Kairouz, H. Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning.Foundations and Trends in Machine Learning, 14(1–2), 2021
work page 2021
-
[26]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[27]
SCAFFOLD: Stochastic controlled averaging for federated learning
Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. SCAFFOLD: Stochastic controlled averaging for federated learning. InProceed- ings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 5132–5143. PMLR, 2020. URLhttps://proceedi...
work page 2020
-
[28]
Ben Kehoe, Sachin Patil, Pieter Abbeel, and Ken Goldberg. A survey of research on cloud robotics and automation.IEEE Transactions on automation science and engineering, 12(2):398–409, 2015
work page 2015
-
[29]
Tighter theory for local SGD on identical and heterogeneous data
Ahmed Khaled, Konstantin Mishchenko, and Peter Richtarik. Tighter theory for local SGD on identical and heterogeneous data. InProceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 ofProceedings of Machine Learning Research, pages 4519–4529. PMLR, 2020. URLhttps://proceedings.mlr.press/v108/bayoumi20a.html
work page 2020
-
[30]
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024
work page internal anchor Pith review arXiv 2024
-
[31]
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024. 11
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025
work page internal anchor Pith review arXiv 2025
- [33]
-
[34]
Image augmentation is all you need: Regularizing deep reinforcement learning from pixels
Ilya Kostrikov, Denis Yarats, and Rob Fergus. Image augmentation is all you need: Regularizing deep reinforcement learning from pixels. InInternational Conference on Learning Representations (ICLR), 2021
work page 2021
-
[35]
Reinforce- ment learning with augmented data
Michael Laskin, Kimin Lee, Adam Stooke, Lerrel Pinto, Pieter Abbeel, and Aravind Srinivas. Reinforce- ment learning with augmented data. InAdvances in Neural Information Processing Systems (NeurIPS), volume 33, pages 19884–19895, 2020
work page 2020
-
[36]
Preservation of the global knowledge by not-true distillation in federated learning
Gihun Lee, Minchan Jeong, Yongjin Shin, Sangmin Bae, and Se-Young Yun. Preservation of the global knowledge by not-true distillation in federated learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2022
work page 2022
-
[37]
Global and local prompts cooperation via optimal transport for federated learning
Hongxia Li, Wei Huang, Jingya Wang, and Ye Shi. Global and local prompts cooperation via optimal transport for federated learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12151–12161, 2024
work page 2024
-
[38]
Model-contrastive federated learning
Qinbin Li, Bingsheng He, and Dawn Song. Model-contrastive federated learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10713–10722, 2021
work page 2021
-
[39]
Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, Xiaofan Wang, Bei Liu, Jianlong Fu, Jianmin Bao, Dong Chen, Yuanchun Shi, Jiaolong Yang, and Baining Guo. CogACT: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:24...
work page Pith review arXiv 2024
-
[40]
Federated optimization in heterogeneous networks
Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. InConference on Machine Learning and Systems (MLSys), 2020
work page 2020
-
[41]
Sen Lin, Daouda Sow, Kaiyi Ji, Yingbin Liang, and Ness Shroff. Non-convex bilevel optimization with time-varying objective functions.Advances in Neural Information Processing Systems, 36:29692–29717, 2023
work page 2023
-
[42]
Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023
work page 2023
-
[43]
Boyi Liu, Lujia Wang, Ming Liu, and Cheng-Zhong Xu. Federated imitation learning: A novel framework for cloud robotic systems with heterogeneous sensor data.IEEE Robotics and Automation Letters, 2020
work page 2020
-
[44]
Optimization with first-order surrogate functions
Julien Mairal. Optimization with first-order surrogate functions. InProceedings of the 30th International Conference on Machine Learning, volume 28 ofProceedings of Machine Learning Research, pages 783–791. PMLR, 2013
work page 2013
-
[45]
MimicGen: A data generation system for scalable robot learning using human demonstrations
Ajay Mandlekar, Soroush Nasiriany, Bowen Wen, Iretiayo Akinola, Yashraj Narang, Linxi Fan, Yuke Zhu, and Dieter Fox. MimicGen: A data generation system for scalable robot learning using human demonstrations. InConference on Robot Learning (CoRL), 2023
work page 2023
-
[46]
Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas
H. Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. InInternational Conference on Artificial Intelligence and Statistics (AISTATS), 2017
work page 2017
-
[47]
Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations
Tongzhou Mu, Zhan Ling, Fanbo Xiang, Derek Yang, Xuanlin Li, Stone Tao, Zhiao Huang, Zhiwei Jia, and Hao Su. Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations. In Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2021
work page 2021
-
[48]
Niklas Muennighoff, Alexander M. Rush, Boaz Barak, Teven Le Scao, Nouamane Tazi, Aleksandra Piktus, Sampo Pyysalo, Thomas Wolf, and Colin Raffel. Scaling data-constrained language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2023
work page 2023
-
[49]
Octo: An open-source generalist robot policy
Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, You Liang Tan, Lawrence Yunliang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy. InProceedings of Robotics: Science and...
work page 2024
-
[50]
Open X-embodiment: Robotic learning datasets and RT-X models
Open X-Embodiment Collaboration. Open X-embodiment: Robotic learning datasets and RT-X models. IEEE International Conference on Robotics and Automation (ICRA), 2024
work page 2024
-
[51]
SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model
Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, and Xuelong Li. SpatialVLA: Exploring spatial representations for visual-language- action model.arXiv preprint arXiv:2501.15830, 2025
work page internal anchor Pith review arXiv 2025
-
[52]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, pages 8748–8763. PMLR, 2021
work page 2021
-
[53]
Meisam Razaviyayn, Mingyi Hong, and Zhi-Quan Luo. A unified convergence analysis of block successive minimization methods for nonsmooth optimization.SIAM Journal on Optimization, 23(2):1126–1153,
-
[54]
doi: 10.1137/120891009
-
[55]
Sashank Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Kone ˇcný, Sanjiv Kumar, and H. Brendan McMahan. Adaptive federated optimization. InInternational Conference on Learning Representations (ICLR), 2021. URLhttps://openreview.net/forum?id=LkFG3lB13U5
work page 2021
-
[56]
FEDORA: Federated ensemble-directed offline reinforcement learning
Desik Rengarajan, Nitin Ragothaman, Dileep Kalathil, and Srinivas Shakkottai. FEDORA: Federated ensemble-directed offline reinforcement learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2024
work page 2024
-
[57]
Multimodal diffusion transformer: Learning versatile behavior from multimodal goals
Moritz Reuss, Ömer Erdinç Ya ˘gmurlu, Fabian Wenzel, and Rudolf Lioutikov. Multimodal diffusion transformer: Learning versatile behavior from multimodal goals. InRobotics: Science and Systems (RSS), 2024
work page 2024
-
[58]
Relaxed contrastive learning for federated learning
Seonguk Seo, Jinkyu Kim, Geeho Kim, and Bohyung Han. Relaxed contrastive learning for federated learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
work page 2024
-
[59]
Mingjia Shi, Yuhao Zhou, Kai Wang, Huaizheng Zhang, Shudong Huang, Qing Ye, and Jiancheng Lv. Prior: Personalized prior for reactivating the information overlooked in federated learning.Advances in Neural Information Processing Systems, 36:28378–28392, 2023
work page 2023
-
[60]
Yujun Shi, Jian Liang, Wenqing Zhang, Vincent Y . F. Tan, and Song Bai. Towards understanding and mitigating dimensional collapse in heterogeneous federated learning. InInternational Conference on Learning Representations (ICLR), 2023
work page 2023
-
[61]
Sebastian U. Stich. Local SGD converges fast and communicates little. InInternational Conference on Learning Representations (ICLR), 2019. URLhttps://openreview.net/forum?id=S1g2JnRcFX
work page 2019
-
[62]
Scalability in perception for autonomous driving: Waymo open dataset
Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2446–2454, 2020
work page 2020
-
[63]
Fedproto: Federated prototype learning across heterogeneous clients
Yue Tan, Guodong Long, Lu Liu, Tianyi Zhou, Qinghua Lu, Jing Jiang, and Chengqi Zhang. Fedproto: Federated prototype learning across heterogeneous clients. InProceedings of the AAAI conference on artificial intelligence, volume 36, pages 8432–8440, 2022
work page 2022
-
[64]
Domain randomization for transferring deep neural networks from simulation to the real world
Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 23–30, 2017
work page 2017
-
[65]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[66]
Visualizing data using t-sne.Journal of machine learning research, 9(11), 2008
Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of machine learning research, 9(11), 2008
work page 2008
-
[67]
Will we run out of data? an analysis of the limits of scaling datasets in machine learning
Pablo Villalobos, Anson Ho, Jaime Sevilla, Tamay Besiroglu, Lennart Heim, and Marius Hobbhahn. Will we run out of data? limits of LLM scaling based on human-generated data.arXiv preprint arXiv:2211.04325, 2022. 13
-
[68]
Bridgedata v2: A dataset for robot learning at scale
Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, Andre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning, pages 1723–1736. PMLR, 2023
work page 2023
-
[69]
Jianyu Wang, Qinghua Liu, Hao Liang, Gauri Joshi, and H. Vincent Poor. Tackling the objective inconsis- tency problem in heterogeneous federated optimization. InAdvances in Neural Information Processing Systems, volume 33, 2020
work page 2020
-
[70]
Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers
Lirui Wang, Xinlei Chen, Jialiang Zhao, and Kaiming He. Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers. InAdvances in Neural Information Processing Systems, 2024
work page 2024
-
[71]
DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control
Junjie Wen, Yichen Zhu, Jinming Li, Zhibin Tang, Chaomin Shen, and Feifei Feng. DexVLA: Vision- language model with plug-in diffusion expert for general robot control.arXiv preprint arXiv:2502.05855, 2025
work page Pith review arXiv 2025
-
[72]
Russell B. Wynn, Veerle A.I. Huvenne, Timothy P. Le Bas, Bramley J. Murton, Douglas P. Connelly, Brian J. Bett, Henry A. Ruhl, Kirsty J. Morris, Jeffrey Peakall, Daniel R. Parsons, Esther J. Sumner, Stephen E. Darby, Robert M. Dorrell, and James E. Hunt. Autonomous underwater vehicles (AUVs): Their past, present and future contributions to the advancement...
-
[73]
Liping Yi, Han Yu, Chao Ren, Gang Wang, Xiaoguang Liu, and Xiaoxiao Li. Federated model hetero- geneous matryoshka representation learning.Advances in Neural Information Processing Systems, 37: 66431–66454, 2024
work page 2024
-
[74]
Multimodal federated learning via contrastive representation ensemble
Qiying Yu, Yang Liu, Yimu Wang, Ke Xu, and Jingjing Liu. Multimodal federated learning via contrastive representation ensemble. InInternational Conference on Learning Representations (ICLR), 2023
work page 2023
-
[75]
Scaling robot learning with semantically imagined experience.arXiv preprint arXiv:2302.11550, 2023
Tianhe Yu, Ted Xiao, Austin Stone, Jonathan Tompson, Anthony Brohan, Su Wang, Jaspiar Singh, Clayton Tan, Jodilyn Peralta, Brian Ichter, et al. Scaling robot learning with semantically imagined experience. arXiv preprint arXiv:2302.11550, 2023
-
[76]
arXiv preprint arXiv:1806.00582 (2018)
Yue Zhao, Meng Li, Liangzhen Lai, Naveen Suda, Damon Civin, and Vikas Chandra. Federated learning with non-IID data.arXiv preprint arXiv:1806.00582, 2018
-
[77]
EgoScale: Scaling Dexterous Manipulation with Diverse Ego- centric Human Data,
Ruijie Zheng, Dantong Niu, Yuqi Xie, Jing Wang, Mengda Xu, Yunfan Jiang, Fernando Castañeda, Fengyuan Hu, You Liang Tan, Letian Fu, et al. Egoscale: Scaling dexterous manipulation with diverse egocentric human data.arXiv preprint arXiv:2602.16710, 2026
-
[78]
FedVLN: Privacy-preserving federated vision-and-language navigation
Kaiwen Zhou and Xin Eric Wang. FedVLN: Privacy-preserving federated vision-and-language navigation. InEuropean Conference on Computer Vision (ECCV), 2022
work page 2022
-
[79]
Yuhao Zhou, Qing Ye, and Jiancheng Lv. Communication-efficient federated learning with compensated overlap-fedavg.IEEE Transactions on Parallel and Distributed Systems, 33(1):192–205, 2021
work page 2021
-
[80]
Yuhao Zhou, Mingjia Shi, Yuanxi Li, Yanan Sun, Qing Ye, and Jiancheng Lv. Communication-efficient federated learning with single-step synthetic features compressor for faster convergence. InProceedings of the IEEE/CVF international conference on computer vision, pages 5008–5017, 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.