arxiv: 2604.10055 · v2 · submitted 2026-04-11 · 💻 cs.RO

Recognition: unknown

STRONG-VLA: Decoupled Robustness Learning for Vision-Language-Action Models under Multimodal Perturbations

Handing Wang, Yaochu Jin, Yuhan Xie, Yunqi Zhao, Yuping Yan

Pith reviewed 2026-05-10 16:29 UTC · model grok-4.3

classification 💻 cs.RO

keywords Vision-Language-Action modelsmultimodal perturbationsdecoupled fine-tuningrobustness learningcurriculum trainingembodied controlLIBERO benchmarksensor noise

0 comments

The pith

A two-stage decoupled fine-tuning process builds robustness to visual and language noise in VLA models before restoring task alignment on clean data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that joint training on perturbed and clean data creates conflicting optimization goals in vision-language-action models, causing degraded execution under real sensor and instruction shifts. By splitting the process into a curriculum stage for progressive robustness learning followed by clean-data realignment, the method aims to keep both objectives intact. A new benchmark of 28 perturbation types tests this across textual and visual modalities grounded in realistic noise. If the separation works, VLA systems could maintain higher task success rates when facing distribution shifts without retraining from scratch.

Core claim

STRONG-VLA decouples robustness acquisition from task refinement by first exposing the model to a curriculum of multimodal perturbations with increasing difficulty to enable progressive robustness learning under controlled distribution shifts, then realigning the model with clean task distributions to recover execution fidelity while preserving robustness. This yields gains of up to 12.60 percent under seen perturbations and 7.77 percent under unseen perturbations on OpenVLA, with comparable or larger improvements on OpenVLA-OFT and pi0, plus real-world validation on an AIRBOT platform.

What carries the argument

The two-stage decoupled fine-tuning framework that separates curriculum-based robustness acquisition under multimodal perturbations from subsequent clean-data realignment.

Load-bearing premise

Robustness acquired during the perturbation curriculum stage is preserved after realignment on clean data without significant forgetting or interference.

What would settle it

Measuring task success rates after Stage II that show no improvement or a drop relative to joint-training baselines on the 28-perturbation benchmark would indicate the separation fails to hold.

Figures

Figures reproduced from arXiv: 2604.10055 by Handing Wang, Yaochu Jin, Yuhan Xie, Yunqi Zhao, Yuping Yan.

**Figure 2.** Figure 2: Overview of STRONG-VLA. A two-stage curriculum fine-tuning framework that acquires robustness from multimodal [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Robustness performance before and after applying STRONG-VLA across three VLA backbones. We report task success [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Third-person arm trajectory comparison in real [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Examples of visual perturbations at increasing difficulty levels. We show representative instances of photometric [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Visualization of perturbation parameters used for [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Robustness performance under text-only, vision-only, and multimodal perturbations across three VLA backbones. We [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Supplementary first-person snapshots corresponding to Raw Model failure trajectories shown in the third-person [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

read the original abstract

Despite their strong performance in embodied tasks, recent Vision-Language-Action (VLA) models remain highly fragile under multimodal perturbations, where visual corruption and linguistic noise jointly induce distribution shifts that degrade task-level execution. Existing robustness approaches typically rely on joint training with perturbed data, treating robustness as a static objective, which leads to conflicting optimization between robustness and task fidelity. In this work, we propose STRONG-VLA, a decoupled fine-tuning framework that explicitly separates robustness acquisition from task-aligned refinement. In Stage I, the model is exposed to a curriculum of multimodal perturbations with increasing difficulty, enabling progressive robustness learning under controlled distribution shifts. In Stage II, the model is re-aligned with clean task distributions to recover execution fidelity while preserving robustness. We further establish a comprehensive benchmark with 28 perturbation types spanning both textual and visual modalities, grounded in realistic sources of sensor noise, occlusion, and instruction corruption. Extensive experiments on the LIBERO benchmark show that STRONG-VLA consistently improves task success rates across multiple VLA architectures. On OpenVLA, our method achieves gains of up to 12.60% under seen perturbations and 7.77% under unseen perturbations. Notably, similar or larger improvements are observed on OpenVLA-OFT (+14.48% / +13.81%) and pi0 (+16.49% / +5.58%), demonstrating strong cross-architecture generalization. Real-world experiments on an AIRBOT robotic platform further validate its practical effectiveness. These results highlight the importance of decoupled optimization for multimodal robustness and establish STRONG-VLA as a simple yet principled framework for robust embodied control.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

STRONG-VLA's two-stage decoupling of robustness curriculum from clean realignment is a reasonable engineering move with decent cross-model gains, but the missing checks on whether stage I robustness actually survives stage II leave the main claim shaky.

read the letter

The paper's core move is to split robustness training for VLA models into a first stage that ramps up multimodal perturbations via curriculum, then a second stage that realigns the model on clean task data. This avoids the usual joint-training conflicts between robustness and task accuracy. The 28-perturbation benchmark, built from realistic sensor noise and instruction issues, is a concrete addition that others can use directly. Experiments run on OpenVLA, OpenVLA-OFT, and pi0, with reported lifts of 7-16% on seen and unseen perturbations, plus real-robot runs on AIRBOT. Those numbers and the multi-model spread are the strongest parts of the evidence so far. The approach is straightforward enough that groups already fine-tuning VLAs could test it without much overhead. The citation list covers the relevant VLA and robustness baselines without obvious omissions. The soft spot is the one the stress test flags. We only get final success rates on perturbed inputs; there are no intermediate robustness measurements right after stage I or ablations that turn the curriculum off. If clean realignment erases the invariance features learned earlier, the gains could come from the training schedule itself rather than the decoupling. The abstract and results do not address this directly, so the central story rests on an untested assumption. This work is aimed at robotics researchers who already run VLA models and need practical robustness fixes for noisy environments. Readers who want a ready benchmark and a simple recipe will find value even if they end up modifying the method. It has enough empirical scope and real-hardware validation to deserve a serious referee, though the authors will likely need to add retention checks and controls before acceptance.

Referee Report

2 major / 2 minor

Summary. The paper proposes STRONG-VLA, a decoupled two-stage fine-tuning framework for Vision-Language-Action (VLA) models to improve robustness to multimodal (visual and linguistic) perturbations. Stage I applies a curriculum of increasing-difficulty perturbations to acquire robustness; Stage II realigns the model on clean task data to recover execution fidelity while aiming to preserve the learned robustness. A 28-type perturbation benchmark is introduced, and experiments on the LIBERO benchmark report consistent task-success gains across OpenVLA (+12.60% seen / +7.77% unseen), OpenVLA-OFT, and pi0, with additional real-robot validation on an AIRBOT platform.

Significance. If the central claim holds, the work provides evidence that explicitly separating robustness acquisition from task realignment can reduce optimization conflicts that arise in joint training, yielding more reliable VLA performance under realistic sensor and instruction noise. The cross-architecture consistency, introduction of a grounded multimodal benchmark, and real-world hardware results would strengthen the case for decoupled robustness methods in embodied AI.

major comments (2)

[§3 (Method) and §4 (Experiments)] The central claim that robustness acquired in Stage I survives Stage II realignment on clean data is not directly supported by evidence. No before/after robustness metrics on perturbed inputs, no ablation removing Stage I, and no retention mechanisms (e.g., regularization or replay) are reported; the published gains are measured only on the final model. This assumption is load-bearing for the decoupled-training thesis.
[§4 (Experiments)] The experimental section lacks statistical details (standard deviations, number of runs, significance tests) and explicit data-split descriptions for the seen/unseen perturbation partitions. Without these, the reported gains (e.g., +12.60% / +7.77% on OpenVLA) cannot be fully verified as robust rather than incidental.

minor comments (2)

[Abstract and §3] The abstract and method description would benefit from a concise statement of the exact loss functions or objectives used in each stage to clarify how the two objectives are decoupled.
[§4.1 (Benchmark)] Perturbation generation details (e.g., specific visual corruption parameters and linguistic noise models) should be expanded in the benchmark section to support reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing clarifications based on the manuscript and committing to revisions that strengthen the evidence for our claims without misrepresenting the current results.

read point-by-point responses

Referee: [§3 (Method) and §4 (Experiments)] The central claim that robustness acquired in Stage I survives Stage II realignment on clean data is not directly supported by evidence. No before/after robustness metrics on perturbed inputs, no ablation removing Stage I, and no retention mechanisms (e.g., regularization or replay) are reported; the published gains are measured only on the final model. This assumption is load-bearing for the decoupled-training thesis.

Authors: We acknowledge that the manuscript primarily reports final-model performance on perturbed inputs rather than explicit before/after Stage II comparisons. The observed gains on both seen and unseen perturbations (including cross-architecture results) provide indirect support that robustness is retained, as Stage II uses only clean data yet the model still outperforms baselines under perturbation. However, we agree this is insufficient to fully substantiate the decoupling thesis. In the revised manuscript we will add: (i) direct robustness metrics on perturbed inputs before and after Stage II, (ii) an ablation that removes Stage I entirely (training only on clean data), and (iii) explicit discussion of how the two-stage separation itself functions as the retention mechanism by avoiding the optimization conflicts of joint training. These additions will be placed in §3 and §4. revision: yes
Referee: [§4 (Experiments)] The experimental section lacks statistical details (standard deviations, number of runs, significance tests) and explicit data-split descriptions for the seen/unseen perturbation partitions. Without these, the reported gains (e.g., +12.60% / +7.77% on OpenVLA) cannot be fully verified as robust rather than incidental.

Authors: We agree that the current experimental reporting is incomplete on these dimensions. The manuscript states the gains but does not include variance estimates or split details. In the revision we will expand §4 to report: standard deviations over multiple random seeds (we will run and report at least 3–5 independent trials per setting), results of statistical significance tests (e.g., paired t-tests against baselines), and a precise description of the seen/unseen perturbation partitions, including how the 28-type benchmark was partitioned to guarantee that unseen perturbations are genuinely novel and not merely held-out instances of seen types. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical claims with no derivation chain

full rationale

The paper presents an empirical two-stage fine-tuning procedure (Stage I curriculum on multimodal perturbations, Stage II clean-data realignment) and reports measured success rates on held-out perturbations and hardware. No equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. All performance numbers (e.g., +12.60% seen / +7.77% unseen on OpenVLA) are obtained from external benchmarks and are therefore falsifiable rather than tautological by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical ML robustness study; no mathematical axioms or invented physical entities are introduced. Free parameters consist of standard training hyperparameters and the specific curriculum schedule, none of which are quantified in the abstract.

pith-pipeline@v0.9.0 · 5610 in / 1092 out tokens · 49695 ms · 2026-05-10T16:29:45.890058+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
cs.RO 2026-05 unverdicted novelty 7.0

OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.

Reference graph

Works this paper leans on

24 extracted references · 16 canonical work pages · cited by 1 Pith paper · 7 internal anchors

[1]

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky
[2]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

𝜋0: A Vision-Language-Action Flow Model for General Robot Control. arXiv:2410.24164 [cs.LG] https://arxiv.org/abs/2410.24164

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Tomas Jackson, Sally Jesmonth, Nikhil J Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Kuang-Huei Lee, Sergey Levine, Yao Lu, Utsav Malla, ...

work page internal anchor Pith review arXiv 2023
[4]

Hao Cheng, Erjia Xiao, Yichi Wang, Chengyuan Yu, Mengshu Sun, Qiang Zhang, Jiahang Cao, Yijie Guo, Ning Liu, Kaidi Xu, Jize Zhang, Chao Shen, Philip Torr, Jindong Gu, and Renjing Xu. 2025. Manipulation Facing Threats: Eval- uating Physical Vulnerabilities in End-to-End Vision Language Action Models. arXiv:2409.13174 [cs.CV] https://arxiv.org/abs/2409.13174

work page arXiv 2025
[5]

Jianing Guo, Zhenhong Wu, Chang Tu, Yiyao Ma, Xiangqi Kong, Zhiqian Liu, Jiaming Ji, Shuning Zhang, Yuanpei Chen, Kai Chen, Qi Dou, Yaodong Yang, Xianglong Liu, Huijie Zhao, Weifeng Lv, and Simin Li. 2026. On Ro- bustness of Vision-Language-Action Model against Multi-Modal Perturbations. arXiv:2510.00037 [cs.CV] https://arxiv.org/abs/2510.00037

work page arXiv 2026
[6]

Asher J Hancock, Allen Z Ren, and Anirudha Majumdar. 2025. Run-time obser- vation interventions make vision-language-action models more visually robust. In2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 9499–9506

2025
[7]

Moo Jin Kim, Chelsea Finn, and Percy Liang. 2025. Fine-Tuning Vision-Language- Action Models: Optimizing Speed and Success. arXiv:2502.19645 [cs.RO] https: //arxiv.org/abs/2502.19645

work page internal anchor Pith review arXiv 2025
[8]

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. 2024. OpenVLA: An Open-Source Vision- Language-Action Model. arXiv:2406.09246 [cs.RO] h...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Jiayu Li, Yunhan Zhao, Xiang Zheng, Zonghuan Xu, Yige Li, Xingjun Ma, and Yu-Gang Jiang. 2025. AttackVLA: Benchmarking Adversarial and Backdoor Attacks on Vision-Language-Action Models. arXiv:2511.12149 [cs.CR] https: //arxiv.org/abs/2511.12149

work page arXiv 2025
[10]

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. 2023. LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning. arXiv:2306.03310 [cs.AI] https://arxiv.org/abs/2306.03310

work page internal anchor Pith review arXiv 2023
[11]

Hanqing Liu, Shouwei Ruan, Jiahuan Long, Junqi Wu, Jiacheng Hou, Huili Tang, Tingsong Jiang, Weien Zhou, and Wen Yao. 2026. Eva-VLA: Evaluating Vision- Language-Action Models’ Robustness Under Real-World Physical Variations. arXiv:2509.18953 [cs.RO] https://arxiv.org/abs/2509.18953

work page arXiv 2026
[12]

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2019. Towards Deep Learning Models Resistant to Adversarial Attacks. arXiv:1706.06083 [stat.ML] https://arxiv.org/abs/1706.06083

work page internal anchor Pith review arXiv 2019
[13]

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, You Liang Tan, Lawrence Yunliang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. 2024. Octo: An Open-Source Generalist Robot Policy. arXiv:2405.12213 [cs.RO] https:/...

work page internal anchor Pith review arXiv 2024
[14]

Taowen Wang, Cheng Han, James Liang, Wenhao Yang, Dongfang Liu, Luna Xinyu Zhang, Qifan Wang, Jiebo Luo, and Ruixiang Tang. 2025. Exploring the Adversarial Vulnerabilities of Vision-Language-Action Models in Robotics. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 6948–6958

2025
[15]

Zhijie Wang, Zhehua Zhou, Jiayang Song, Yuheng Huang, Zhan Shu, and Lei Ma. 2025. VLATest: Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation.Proceedings of the ACM on Software Engineering2, FSE (2025), 1615–1638. https://api.semanticscholar.org/CorpusID:272753667

2025
[16]

Haochuan Xu, Yun Sing Koh, Shuhuai Huang, Zirun Zhou, Di Wang, Jun Sakuma, and Jingfeng Zhang. 2025. Model-agnostic Adversarial Attack and Defense for Vision-Language-Action Models. arXiv:2510.13237 [cs.CV] https://arxiv.org/abs/ 2510.13237

work page arXiv 2025
[17]

Yuping Yan, Yuhan Xie, Yixin Zhang, Lingjuan Lyu, Handing Wang, and Yaochu Jin. 2025. When Alignment Fails: Multimodal Adversarial Attacks on Vision- Language-Action Models. arXiv:2511.16203 [cs.CV] https://arxiv.org/abs/2511. 16203

work page arXiv 2025
[18]

Dapeng Zhang, Jing Sun, Chenghui Hu, Xiaoyan Wu, Zhenlong Yuan, Rui Zhou, Fei Shen, and Qingguo Zhou. 2025. Pure Vision Language Action (VLA) Models: A Comprehensive Survey. arXiv:2509.19012 [cs.RO] https://arxiv.org/abs/2509. 19012

work page arXiv 2025
[19]

Hongyin Zhang, Pengxiang Ding, Shangke Lyu, Ying Peng, and Donglin Wang
[20]

arXiv:2502.09268 [cs.RO] https://arxiv.org/abs/2502.09268

GEVRM: Goal-Expressive Video Generation Model For Robust Visual Manipulation. arXiv:2502.09268 [cs.RO] https://arxiv.org/abs/2502.09268

work page arXiv
[21]

Hongyin Zhang, Shuo Zhang, Junxi Jin, Qixin Zeng, Runze Li, and Donglin Wang
[22]

Robustvla: Robustness-aware reinforcement post-training for vision-language-action models.arXiv preprint arXiv:2511.01331, 2025

RobustVLA: Robustness-Aware Reinforcement Post-Training for Vision- Language-Action Models. arXiv:2511.01331 [cs.RO] https://arxiv.org/abs/2511. 01331

work page arXiv
[23]

Sanketi, Grecia Salazar, Michael S

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, Quan Vuong, Vincent Van- houcke, Huong Tran, Radu Soricut, Anikait Singh, Jaspiar Singh, Pierre Ser- manet, Pannag R. Sanketi, Grecia Salazar, Michael S. Ryoo, Krista Reymann, Kanishka Rao, Karl Pertsch, Igor Mordatch, Henryk Michale...
[24]

InProceedings of The 7th Conference on Robot Learning (Proceedings of Ma- chine Learning Research, Vol

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. InProceedings of The 7th Conference on Robot Learning (Proceedings of Ma- chine Learning Research, Vol. 229), Jie Tan, Marc Toussaint, and Kourosh Darvish (Eds.). PMLR, 2165–2183. https://proceedings.mlr.press/v229/zitkovich23a.html Conference’17, July 2017, Washington, DC, USA...

2017