Semi-Supervised Vision-Language-Action Model

Hongyang He; Jiuming Liu; Victor Sanchez

arxiv: 2606.21493 · v1 · pith:GILP2EPZnew · submitted 2026-06-19 · 💻 cs.CV · cs.ET

Semi-Supervised Vision-Language-Action Model

Hongyang He , Jiuming Liu , Victor Sanchez This is my paper

Pith reviewed 2026-06-26 14:14 UTC · model grok-4.3

classification 💻 cs.CV cs.ET

keywords semi-supervised learningvision-language-action modelspseudo-action filteringrobot manipulationparameter-efficient fine-tuningLIBERO benchmarkself-distillationreliability controller

0 comments

The pith

A teacher-student framework filters reliable pseudo-actions to let vision-language-action models adapt with only 10 percent labeled trajectories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SemiVLA as a way to adapt vision-language-action models when most robot trajectories lack action labels. It uses a self-distilled teacher-student setup where a reliability controller scores unlabeled data on vision-language alignment, action feasibility, and temporal consistency before generating pseudo-actions. A Bottleneck-Projected Alignment Update keeps the teacher from incorporating noisy signals. On the LIBERO benchmark with OpenVLA and Selective LoRA, the approach reaches 89 percent average success using 10 percent labeled trajectories, an 8-point gain over standard supervised fine-tuning at no added inference cost. The same pattern holds across PEFT methods and the CALVIN benchmark.

Core claim

SemiVLA is a self-distilled teacher-student framework that learns from reliable pseudo-actions on unlabeled trajectories. It introduces a VLA-specific reliability controller to assess vision-language alignment, action feasibility, and temporal transition consistency, and updates the teacher through a Bottleneck-Projected Alignment Update to avoid noisy feedback contamination. With OpenVLA as the backbone, SemiVLA consistently improves multiple PEFT strategies across LIBERO and CALVIN. Under 10 percent labeled trajectories, SemiVLA with Selective LoRA achieves 89.0 percent average success on LIBERO, outperforming supervised LoRA by 8.0 points without extra inference cost.

What carries the argument

VLA-specific reliability controller that scores unlabeled trajectories on vision-language alignment, action feasibility, and temporal transition consistency before accepting pseudo-actions for student training.

If this is right

The same reliability controller and update rule improve several different parameter-efficient fine-tuning methods.
Performance gains appear on both the LIBERO and CALVIN robot manipulation benchmarks.
No extra inference-time cost is incurred once training finishes.
The framework works when only 10 percent of trajectories carry action labels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The controller's three scoring axes could be reused as a general filter for other embodied pseudo-labeling tasks where physical feasibility matters.
If the Bottleneck-Projected Alignment Update proves stable, it might replace more complex teacher-update heuristics in other self-distillation settings.
Lowering the labeled-trajectory requirement could make it practical to collect large vision-language datasets from passive robot observation without action recording.

Load-bearing premise

The reliability controller can correctly identify which pseudo-actions are trustworthy enough to improve the student without introducing harmful noise.

What would settle it

Running the full pipeline after disabling or randomizing the reliability controller's three scoring criteria and observing whether the performance gain over supervised LoRA disappears or reverses on LIBERO.

read the original abstract

Vision-Language-Action (VLA) models enable robots to predict actions directly from visual observations and language instructions, but adapting them to new environments still depends on costly action-labeled demonstrations. To reduce this dependence, we study semi-supervised VLA adaptation under limited supervision signals, where only a small portion of trajectories contain robot actions and the remaining trajectories provide action-unlabeled vision-language observations. Unlike standard semi-supervised learning, the missing supervision is an embodied action signal that must be visually grounded, language-consistent, physically feasible, and temporally stable. To address this problem, we propose SemiVLA, a self-distilled teacher-student framework that learns from reliable pseudo-actions on unlabeled trajectories. SemiVLA introduces a VLA-specific reliability controller to assess vision-language alignment, action feasibility, and temporal transition consistency, and further updates the teacher through a Bottleneck-Projected Alignment Update to avoid noisy feedback contamination. With OpenVLA as the backbone, SemiVLA consistently improves multiple PEFT strategies across LIBERO and CALVIN. Under 10\% labeled trajectories, SemiVLA with Selective LoRA achieves 89.0\% average success on LIBERO, outperforming supervised LoRA by 8.0 points without extra inference cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SemiVLA claims an 8-point LIBERO gain at 10% labels via a new reliability controller, but the evidence for that controller is thin.

read the letter

The main takeaway is that SemiVLA reports an 8-point average success lift on LIBERO under 10% labeled trajectories by using a reliability controller to generate pseudo-actions for self-distillation on OpenVLA, beating plain supervised LoRA without added inference cost.

What is new is the VLA-specific controller that scores vision-language alignment, action feasibility, and temporal transition consistency on unlabeled trajectories, paired with the Bottleneck-Projected Alignment Update to limit teacher contamination. The paper applies this across multiple PEFT strategies and shows gains on both LIBERO and CALVIN.

The framing addresses a practical robotics constraint where action labels are costly, and the method stays within standard self-distillation while tailoring the checks to embodied signals.

The soft spots are the missing ablations on the three controller checks, no variance or seed details, and no direct validation that the filtered pseudo-actions match ground-truth quality. The data split procedure for the 10% labeled set is also unclear. The stress-test concern holds: if the controller scores correlate weakly with actual correctness in novel scenes, the gains could be fragile or illusory.

This is for researchers adapting VLA models under label constraints. It deserves peer review because the problem is real and the proposal is concrete enough to test and improve.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes SemiVLA, a self-distilled teacher-student framework for semi-supervised adaptation of Vision-Language-Action (VLA) models under limited action-labeled trajectories. A VLA-specific reliability controller scores unlabeled trajectories on vision-language alignment, action feasibility, and temporal transition consistency to generate pseudo-actions; the teacher is then updated via Bottleneck-Projected Alignment Update. Using OpenVLA as backbone and Selective LoRA, the method reports 89.0% average success on LIBERO with 10% labeled data, an 8-point gain over supervised LoRA, with similar gains on CALVIN and across PEFT strategies, all at no extra inference cost.

Significance. If the gains are reproducible and the controller's filtering is shown to be reliable, the work would provide a practical route to reduce expensive action labeling in VLA adaptation while preserving inference efficiency. The domain-specific reliability criteria and the alignment-update mechanism are concrete contributions that could transfer to other embodied semi-supervised settings.

major comments (2)

[Abstract] Abstract: the central 89.0% success / 8-point gain claim rests on the reliability controller producing usable pseudo-actions, yet the abstract (and available text) provides no quantitative validation of controller precision, no ablation removing individual scoring terms (vision-language alignment, feasibility, temporal consistency), and no variance or run statistics for the reported numbers.
[Abstract] Abstract: the 10% labeled-trajectory split procedure is unspecified, so it is impossible to assess whether the reported improvement is robust to different partitions or whether the controller's filtering benefit is confounded by how the labeled subset was chosen.

minor comments (2)

[Abstract] Abstract: the statement that improvements occur 'across multiple PEFT strategies' is not accompanied by per-strategy numbers or tables.
[Abstract] Abstract: error bars, number of seeds, or statistical tests are absent from the numeric claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive suggestions. We address the two major comments below and will incorporate the requested clarifications and additional analyses into the revised manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central 89.0% success / 8-point gain claim rests on the reliability controller producing usable pseudo-actions, yet the abstract (and available text) provides no quantitative validation of controller precision, no ablation removing individual scoring terms (vision-language alignment, feasibility, temporal consistency), and no variance or run statistics for the reported numbers.

Authors: We agree that the abstract and main text would be strengthened by explicit quantitative validation of the controller. The current manuscript reports overall task success but does not include a direct precision metric for the pseudo-actions (e.g., agreement rate with held-out ground-truth actions) or per-component ablations. We will add (i) a quantitative controller-precision evaluation on a validation split, (ii) an ablation table removing each scoring term individually, and (iii) mean and standard deviation across three random seeds for all reported numbers. These additions will be placed in both the abstract and Section 4. revision: yes
Referee: [Abstract] Abstract: the 10% labeled-trajectory split procedure is unspecified, so it is impossible to assess whether the reported improvement is robust to different partitions or whether the controller's filtering benefit is confounded by how the labeled subset was chosen.

Authors: We agree the split procedure must be stated explicitly. The 10% labeled trajectories were obtained by uniform random sampling without replacement from the full trajectory pool per task, with the remainder treated as unlabeled; the same random seed was used for all methods to ensure fair comparison. To demonstrate robustness, we will add (a) an explicit description of the sampling procedure in the methods and abstract, and (b) results across three independent random partitions with variance reported. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark gains rest on external comparisons, not self-referential definitions or fits

full rationale

The paper describes a teacher-student self-distillation framework with a reliability controller for pseudo-action filtering, but reports success rates (e.g., 89.0% on LIBERO with 10% labels) solely via experimental evaluation against supervised baselines on standard benchmarks. No equations, derivations, or 'predictions' appear that reduce the claimed performance to quantities defined inside the paper by construction. The controller is an algorithmic component whose effectiveness is tested empirically rather than assumed via self-definition or self-citation chains. This matches the default case of a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no equations or implementation details, so free parameters, axioms, and invented entities cannot be enumerated; the reliability controller is presented as a new component whose internal thresholds are unspecified.

pith-pipeline@v0.9.1-grok · 5742 in / 1153 out tokens · 18309 ms · 2026-06-26T14:14:03.784971+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 1 canonical work pages

[1]

Alemi, Ian Fischer, Joshua V

Alexander A. Alemi, Ian Fischer, Joshua V. Dillon, and Kevin Murphy. Deep variational information bottleneck. In5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. URLhttps://openreview.net/forum?id=HyxQzBceg

2017
[2]

Are transformers more robust than cnns?Advances in neural information processing systems, 34:26831–26843, 2021

Yutong Bai, Jieru Mei, Alan L Yuille, and Cihang Xie. Are transformers more robust than cnns?Advances in neural information processing systems, 34:26831–26843, 2021

2021
[3]

Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

Pith/arXiv arXiv 2025
[4]

arXiv preprint arXiv:2410.24164, 2024

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

Pith/arXiv arXiv 2024
[5]

Rt-1: Robotics transformer for real-world control at scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, et al. Rt-1: Robotics transformer for real-world control at scale. InRobotics: Science and Systems, 2023

2023
[6]

Moto: Latent motion token as the bridging language for learning robot manipulation from videos

Yi Chen, Yuying Ge, Weiliang Tang, Yizhuo Li, Yixiao Ge, Mingyu Ding, Ying Shan, and Xihui Liu. Moto: Latent motion token as the bridging language for learning robot manipulation from videos. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19752–19763, 2025

2025
[7]

Benchmarking neural network robustness to common corruptions and perturbations.Proceedings of the International Conference on Learning Representations, 2019

Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations.Proceedings of the International Conference on Learning Representations, 2019

2019
[8]

The many faces of robustness: A critical analysis of out-of-distribution generalization.ICCV, 2021

Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, and Justin Gilmer. The many faces of robustness: A critical analysis of out-of-distribution generalization.ICCV, 2021

2021
[9]

Sanketi, Archit Sharma, Cody Simpson, Quan Vuong, Homer Rich Walke, Blake Wulfe, Ted Xiao, Jonathan Heewon Yang, Arefeh Yavary, Tony Z

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abraham Lee, You...

2024
[10]

DROID: A large-scale in-the-wild robot manipulation dataset,

doi: 10.15607/RSS.2024.XX.120. URLhttps://doi.org/10.15607/RSS.2024.XX.120

work page doi:10.15607/rss.2024.xx.120 2024
[11]

Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa 18 Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Paul Foster, Pannag R. Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa 18 Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model. InCoRL, 2024

2024
[12]

Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

Pith/arXiv arXiv 2025
[13]

Lara: Latent action representation alignment for vision-language-action models.arXiv preprint arXiv:2606.07100, 2026

Mengya Liu, Baoxiong Jia, Jiangyong Huang, Jingze Zhang, and Siyuan Huang. Lara: Latent action representation alignment for vision-language-action models.arXiv preprint arXiv:2606.07100, 2026

Pith/arXiv arXiv 2026
[14]

Mechanistic finetuning of vision-language-action models via few-shot demonstrations.arXiv preprint arXiv:2511.22697, 2025

Chancharik Mitra, Yusen Luo, Raj Saravanan, Dantong Niu, Anirudh Pai, Jesse Thomason, Trevor Darrell, Abrar Anwar, Deva Ramanan, and Roei Herzig. Mechanistic finetuning of vision-language-action models via few-shot demonstrations.arXiv preprint arXiv:2511.22697, 2025

arXiv 2025
[15]

Open x-embodiment: Robotic learning datasets and rt-x models : Open x-embodiment collaboration

Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, et al. Open x-embodiment: Robotic learning datasets and rt-x models : Open x-embodiment collaboration. InICRA, 2024

2024
[16]

Vision transformers are robust learners

Sayak Paul and Pin-Yu Chen. Vision transformers are robust learners. InProceedings of the AAAI conference on Artificial Intelligence, volume 36, pages 2071–2081, 2022

2071
[17]

Fixmatch: Simplifying semi-supervised learning with consistency and confidence.Advances in neural information processing systems, 33:596–608, 2020

Kihyuk Sohn, David Berthelot, Nicholas Carlini, Zizhao Zhang, Han Zhang, Colin Raffel, Ekin Dogus Cubuk, Alexey Kurakin, and Chun-Liang Li. Fixmatch: Simplifying semi-supervised learning with consistency and confidence.Advances in neural information processing systems, 33:596–608, 2020

2020
[18]

Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results.Advances in neural information processing systems, 30, 2017

Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results.Advances in neural information processing systems, 30, 2017

2017
[19]

Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

Pith/arXiv arXiv 2024
[20]

The information bottleneck method.arXiv preprint physics/0004057, 2000

Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method.arXiv preprint physics/0004057, 2000

Pith/arXiv arXiv 2000
[21]

Domain randomization for transferring deep neural networks from simulation to the real world

Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. In2017 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 23–30. IEEE, 2017

2017
[22]

Freematch: Self-adaptive thresholding for semi-supervised learning.arXiv preprint arXiv:2205.07246, 2022

Yidong Wang, Hao Chen, Qiang Heng, Wenxin Hou, Yue Fan, Zhen Wu, Jindong Wang, Marios Savvides, Takahiro Shinozaki, Bhiksha Raj, et al. Freematch: Self-adaptive thresholding for semi-supervised learning.arXiv preprint arXiv:2205.07246, 2022

arXiv 2022
[23]

Vla-adapter: An effective paradigm for tiny-scale vision-language-action model.CoRR, 2025

Yihao Wang, Pengxiang Ding, Lingxiao Li, Can Cui, Zirui Ge, Xinyang Tong, Wenxuan Song, Han Zhao, Wei Zhao, Pengxu Hou, Siteng Huang, Yifan Tang, Wenhui Wang, Ru Zhang, Jianyi Liu, and Donglin Wang. Vla-adapter: An effective paradigm for tiny-scale vision-language-action model.CoRR, 2025

2025
[24]

Unsupervised data augmentation for consistency training.Advances in neural information processing systems, 33:6256–6268, 2020

Qizhe Xie, Zihang Dai, Eduard Hovy, Thang Luong, and Quoc Le. Unsupervised data augmentation for consistency training.Advances in neural information processing systems, 33:6256–6268, 2020

2020
[25]

Latent action pretraining from videos

Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Se June Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, et al. Latent action pretraining from videos. InInternational Conference on Learning Representations, volume 2025, pages 28213–28239, 2025

2025
[26]

Understanding the robustness in vision transformers

Daquan Zhou, Zhiding Yu, Enze Xie, Chaowei Xiao, Animashree Anandkumar, Jiashi Feng, and Jose M Alvarez. Understanding the robustness in vision transformers. InInternational conference on machine learning, pages 27378–27394. PMLR, 2022

2022
[27]

Rt-2: Vision-language-action models transfer web knowledge to robotic control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InCoRL, 2023. 19 1

2023

[1] [1]

Alemi, Ian Fischer, Joshua V

Alexander A. Alemi, Ian Fischer, Joshua V. Dillon, and Kevin Murphy. Deep variational information bottleneck. In5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. URLhttps://openreview.net/forum?id=HyxQzBceg

2017

[2] [2]

Are transformers more robust than cnns?Advances in neural information processing systems, 34:26831–26843, 2021

Yutong Bai, Jieru Mei, Alan L Yuille, and Cihang Xie. Are transformers more robust than cnns?Advances in neural information processing systems, 34:26831–26843, 2021

2021

[3] [3]

Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

Pith/arXiv arXiv 2025

[4] [4]

arXiv preprint arXiv:2410.24164, 2024

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

Pith/arXiv arXiv 2024

[5] [5]

Rt-1: Robotics transformer for real-world control at scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, et al. Rt-1: Robotics transformer for real-world control at scale. InRobotics: Science and Systems, 2023

2023

[6] [6]

Moto: Latent motion token as the bridging language for learning robot manipulation from videos

Yi Chen, Yuying Ge, Weiliang Tang, Yizhuo Li, Yixiao Ge, Mingyu Ding, Ying Shan, and Xihui Liu. Moto: Latent motion token as the bridging language for learning robot manipulation from videos. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19752–19763, 2025

2025

[7] [7]

Benchmarking neural network robustness to common corruptions and perturbations.Proceedings of the International Conference on Learning Representations, 2019

Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations.Proceedings of the International Conference on Learning Representations, 2019

2019

[8] [8]

The many faces of robustness: A critical analysis of out-of-distribution generalization.ICCV, 2021

Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, and Justin Gilmer. The many faces of robustness: A critical analysis of out-of-distribution generalization.ICCV, 2021

2021

[9] [9]

Sanketi, Archit Sharma, Cody Simpson, Quan Vuong, Homer Rich Walke, Blake Wulfe, Ted Xiao, Jonathan Heewon Yang, Arefeh Yavary, Tony Z

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abraham Lee, You...

2024

[10] [10]

DROID: A large-scale in-the-wild robot manipulation dataset,

doi: 10.15607/RSS.2024.XX.120. URLhttps://doi.org/10.15607/RSS.2024.XX.120

work page doi:10.15607/rss.2024.xx.120 2024

[11] [11]

Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa 18 Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Paul Foster, Pannag R. Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa 18 Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model. InCoRL, 2024

2024

[12] [12]

Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

Pith/arXiv arXiv 2025

[13] [13]

Lara: Latent action representation alignment for vision-language-action models.arXiv preprint arXiv:2606.07100, 2026

Mengya Liu, Baoxiong Jia, Jiangyong Huang, Jingze Zhang, and Siyuan Huang. Lara: Latent action representation alignment for vision-language-action models.arXiv preprint arXiv:2606.07100, 2026

Pith/arXiv arXiv 2026

[14] [14]

Mechanistic finetuning of vision-language-action models via few-shot demonstrations.arXiv preprint arXiv:2511.22697, 2025

Chancharik Mitra, Yusen Luo, Raj Saravanan, Dantong Niu, Anirudh Pai, Jesse Thomason, Trevor Darrell, Abrar Anwar, Deva Ramanan, and Roei Herzig. Mechanistic finetuning of vision-language-action models via few-shot demonstrations.arXiv preprint arXiv:2511.22697, 2025

arXiv 2025

[15] [15]

Open x-embodiment: Robotic learning datasets and rt-x models : Open x-embodiment collaboration

Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, et al. Open x-embodiment: Robotic learning datasets and rt-x models : Open x-embodiment collaboration. InICRA, 2024

2024

[16] [16]

Vision transformers are robust learners

Sayak Paul and Pin-Yu Chen. Vision transformers are robust learners. InProceedings of the AAAI conference on Artificial Intelligence, volume 36, pages 2071–2081, 2022

2071

[17] [17]

Fixmatch: Simplifying semi-supervised learning with consistency and confidence.Advances in neural information processing systems, 33:596–608, 2020

Kihyuk Sohn, David Berthelot, Nicholas Carlini, Zizhao Zhang, Han Zhang, Colin Raffel, Ekin Dogus Cubuk, Alexey Kurakin, and Chun-Liang Li. Fixmatch: Simplifying semi-supervised learning with consistency and confidence.Advances in neural information processing systems, 33:596–608, 2020

2020

[18] [18]

Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results.Advances in neural information processing systems, 30, 2017

Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results.Advances in neural information processing systems, 30, 2017

2017

[19] [19]

Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

Pith/arXiv arXiv 2024

[20] [20]

The information bottleneck method.arXiv preprint physics/0004057, 2000

Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method.arXiv preprint physics/0004057, 2000

Pith/arXiv arXiv 2000

[21] [21]

Domain randomization for transferring deep neural networks from simulation to the real world

Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. In2017 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 23–30. IEEE, 2017

2017

[22] [22]

Freematch: Self-adaptive thresholding for semi-supervised learning.arXiv preprint arXiv:2205.07246, 2022

Yidong Wang, Hao Chen, Qiang Heng, Wenxin Hou, Yue Fan, Zhen Wu, Jindong Wang, Marios Savvides, Takahiro Shinozaki, Bhiksha Raj, et al. Freematch: Self-adaptive thresholding for semi-supervised learning.arXiv preprint arXiv:2205.07246, 2022

arXiv 2022

[23] [23]

Vla-adapter: An effective paradigm for tiny-scale vision-language-action model.CoRR, 2025

Yihao Wang, Pengxiang Ding, Lingxiao Li, Can Cui, Zirui Ge, Xinyang Tong, Wenxuan Song, Han Zhao, Wei Zhao, Pengxu Hou, Siteng Huang, Yifan Tang, Wenhui Wang, Ru Zhang, Jianyi Liu, and Donglin Wang. Vla-adapter: An effective paradigm for tiny-scale vision-language-action model.CoRR, 2025

2025

[24] [24]

Unsupervised data augmentation for consistency training.Advances in neural information processing systems, 33:6256–6268, 2020

Qizhe Xie, Zihang Dai, Eduard Hovy, Thang Luong, and Quoc Le. Unsupervised data augmentation for consistency training.Advances in neural information processing systems, 33:6256–6268, 2020

2020

[25] [25]

Latent action pretraining from videos

Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Se June Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, et al. Latent action pretraining from videos. InInternational Conference on Learning Representations, volume 2025, pages 28213–28239, 2025

2025

[26] [26]

Understanding the robustness in vision transformers

Daquan Zhou, Zhiding Yu, Enze Xie, Chaowei Xiao, Animashree Anandkumar, Jiashi Feng, and Jose M Alvarez. Understanding the robustness in vision transformers. InInternational conference on machine learning, pages 27378–27394. PMLR, 2022

2022

[27] [27]

Rt-2: Vision-language-action models transfer web knowledge to robotic control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InCoRL, 2023. 19 1

2023