pith. sign in

arxiv: 2605.26820 · v1 · pith:36UCJQUXnew · submitted 2026-05-26 · 💻 cs.RO

Can VLA Models Learn from Real-World Data Continually without Forgetting?

Pith reviewed 2026-06-29 16:46 UTC · model grok-4.3

classification 💻 cs.RO
keywords vision-language-action modelscontinual learningcatastrophic forgettingreal-world roboticsexperience replaymanipulation tasksrobot policies
0
0 comments X

The pith

Vision-language-action models suffer significant catastrophic forgetting when continually trained on heterogeneous real-world robot demonstrations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs a dataset of four sequential real-world manipulation tasks to examine whether VLA models can acquire new skills without losing previously learned ones. Experiments show these models undergo major forgetting when trained sequentially on heterogeneous physical demonstrations. The authors then evaluate experience replay and identify key factors that control its effectiveness. A reader would care because robots intended for long-term use in unstructured environments must retain earlier behaviors while incorporating new ones.

Core claim

Using a new real-world dataset spanning rigid-object pick-and-place, contact-rich pressing, and deformable-object folding, the study shows that VLA models suffer significant catastrophic forgetting when continually learning from heterogeneous demonstrations; experience replay mitigates this only when specific implementation factors are addressed correctly.

What carries the argument

The real-world continual learning dataset of four sequential manipulation tasks, used as the testbed to quantify forgetting in VLA models and to assess experience replay.

Load-bearing premise

The four sequential manipulation tasks and the collected demonstrations are representative of the challenges that arise in broader real-world continual deployment of VLA models.

What would settle it

An experiment in which a VLA model trained sequentially on the four tasks shows no measurable performance decline on the first task after completing the later tasks would falsify the claim of significant forgetting.

Figures

Figures reproduced from arXiv: 2605.26820 by Jiarun Zhu, Jiayu Chen, Mingqi Yuan, Wenjun Zeng, Xiaoquan Sun, Yijun Hong, Zetian Xu, Zhiyong Wang.

Figure 1
Figure 1. Figure 1: Overview of our investigation into real-world continual VLA learning. We collect a real-world sequential manipulation dataset of four heterogeneous tasks and study whether VLA models can adapt to them sequentially without forgetting. The top panels illustrate the continual learning problem, the real-world task stream, and the training procedure. The bottom panels summa￾rize our central findings: (1) naive … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the (a) robot platform (the left figure) and (b) task sequence (the right figure). We adopt a multi-view teleoperation platform to collect demonstrations across four diverse manipulation tasks. The four tasks span rigid-object pick-and-place, cup hanging, contact-rich pressing, and deformable-object folding. This design provides a deployment-realistic task stream for evaluating whether VLA mode… view at source ↗
Figure 3
Figure 3. Figure 3: Forgetting matrices under sequential fine-tuning. Without ER (left), all previously learned tasks collapse to near-zero performance, confirming severe catastrophic forgetting. With appropriately configured ER (right panels), forgetting is largely eliminated across all tasks. However, excessively high replay frequency impairs new-task learning, while insufficient replay data weakens retention—revealing a U-… view at source ↗
Figure 4
Figure 4. Figure 4: Replay effectiveness exhibits a U-shaped sensitivity to buffer size and replay fre￾quency. Overly frequent replay (fr = 0.5) impairs new-task acquisition, particularly for fragile tasks such as the press button task. Insufficient replay frequency (fr = 0.05) or buffer capacity (B = 0.002) weakens retention on tasks that require diverse replay trajectories, such as the hang cup task. The optimal point (B = … view at source ↗
read the original abstract

Vision-language-action (VLA) models provide a promising foundation for general-purpose robotics. However, their successful deployment in real-world scenarios requires the ability to continually acquire new skills while retaining previously learned behaviors. While pioneering research has studied the continual learning of VLA models in narrowly simulated environments, this challenge remains largely unexplored under realistic conditions. To address this limitation, we construct a real-world continual learning dataset comprising four sequential manipulation tasks, spanning rigid-object pick-and-place, contact-rich pressing, and deformable-object folding. Using this dataset, we conduct comprehensive experiments and find that VLA models suffer significant catastrophic forgetting when continually learning from heterogeneous real-world demonstrations. We then systematically evaluate experience replay and uncover key implementation factors that govern its success. In summary, this work provides the first empirical study of real-world continual VLA learning and offers practical guidance for deploying long-lived robot policies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper investigates continual learning for Vision-Language-Action (VLA) models in real-world settings. It introduces a dataset of four sequential manipulation tasks (rigid-object pick-and-place, contact-rich pressing, deformable-object folding) and reports that VLA models exhibit significant catastrophic forgetting when trained sequentially on heterogeneous real-world demonstrations. The work then evaluates experience replay, identifies key implementation factors for its effectiveness, and positions the study as the first empirical examination of real-world continual VLA learning with practical guidance for long-lived policies.

Significance. If the empirical findings hold under broader conditions, the paper would be significant for highlighting a key obstacle to long-term deployment of VLA models outside simulation and for supplying concrete replay-based mitigation insights. It addresses a clear gap between existing simulated continual-learning studies and realistic robotics data.

major comments (2)
  1. [Dataset construction and experimental setup] The headline claim of significant catastrophic forgetting on heterogeneous real-world data rests on a benchmark of exactly four tasks. The manuscript should explicitly justify why this sequence adequately proxies the heterogeneity, long-horizon dependencies, sensor noise accumulation, and task-interference patterns expected in open-ended continual deployment; otherwise the measured forgetting may be an artifact of the narrow task set rather than a general property of VLA fine-tuning.
  2. [Results] The abstract asserts the forgetting result and replay findings yet supplies no quantitative metrics, controls, statistical details, or dataset size. The results section must include these (e.g., per-task success rates before/after sequential training, replay buffer sizes, statistical significance) for the central empirical claim to be evaluable.
minor comments (1)
  1. [Dataset] Clarify the precise number of demonstrations collected per task and any balancing procedures used across the four tasks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our study of continual learning for VLA models in real-world settings. We address each major comment below.

read point-by-point responses
  1. Referee: [Dataset construction and experimental setup] The headline claim of significant catastrophic forgetting on heterogeneous real-world data rests on a benchmark of exactly four tasks. The manuscript should explicitly justify why this sequence adequately proxies the heterogeneity, long-horizon dependencies, sensor noise accumulation, and task-interference patterns expected in open-ended continual deployment; otherwise the measured forgetting may be an artifact of the narrow task set rather than a general property of VLA fine-tuning.

    Authors: We agree that four tasks constitute a limited benchmark and cannot fully represent all aspects of open-ended deployment. The sequence was deliberately constructed to span distinct heterogeneity dimensions (rigid vs. deformable objects; pick-and-place vs. contact-rich pressing vs. folding) that induce measurable task interference. In revision we will insert an explicit justification paragraph in the dataset section, framing the benchmark as a minimal yet representative real-world proxy while noting its scope limitations and the value of future larger-scale studies. revision: yes

  2. Referee: [Results] The abstract asserts the forgetting result and replay findings yet supplies no quantitative metrics, controls, statistical details, or dataset size. The results section must include these (e.g., per-task success rates before/after sequential training, replay buffer sizes, statistical significance) for the central empirical claim to be evaluable.

    Authors: Abstracts are intentionally concise and omit detailed metrics by design. We acknowledge that the results section must supply all requested quantitative elements for evaluability. The manuscript already reports per-task success rates and replay buffer sizes; we will expand the section with explicit before/after tables, dataset sizes, additional controls, and statistical significance tests in the revised version. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical study with no derivations or self-referential predictions

full rationale

The paper constructs a real-world dataset of four sequential manipulation tasks and reports experimental observations of catastrophic forgetting in VLA models along with evaluations of experience replay. No equations, fitted parameters presented as predictions, uniqueness theorems, or self-citation chains appear in the abstract or described methodology. All claims rest on direct empirical measurement rather than any derivation that reduces to its own inputs by construction, rendering the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical study; central claim rests on the representativeness of the four-task dataset and the validity of the experimental protocol described at high level in the abstract. No free parameters, axioms, or invented entities are identifiable from the provided text.

pith-pipeline@v0.9.1-grok · 5699 in / 969 out tokens · 17178 ms · 2026-06-29T16:46:08.383741+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. LargeMonitor: Monitoring Online Task-Free Continual Learning via Large Pretrained Models

    cs.LG 2026-06 unverdicted novelty 4.0

    LargeMonitor introduces a decoupled framework using large pretrained models for robust drift detection and semantic diagnosis to improve online task-free continual learning.

Reference graph

Works this paper leans on

44 extracted references · 11 canonical work pages · cited by 1 Pith paper · 8 internal anchors

  1. [1]

    A Survey on Vision-Language-Action Models for Embodied AI

    Yueen Ma, Zixing Song, Yuzheng Zhuang, Jianye Hao, and Irwin King. A survey on vision- language-action models for embodied ai.arXiv preprint arXiv:2405.14093, 2024

  2. [2]

    Openvla: An open-source vision-language-action model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, et al. Openvla: An open-source vision-language-action model. InConference on Robot Learning, pages 2679–2713. PMLR, 2025

  3. [3]

    Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, brian ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Allen Z. Ren,...

  4. [4]

    A continual learning survey: Defying forgetting in classification tasks.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(7):3366–3385, 2021

    Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Aleš Leonardis, Gregory Slabaugh, and Tinne Tuytelaars. A continual learning survey: Defying forgetting in classification tasks.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(7):3366–3385, 2021

  5. [5]

    A comprehensive survey of continual learning: Theory, method and application.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8):5362–5383, 2024

    Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. A comprehensive survey of continual learning: Theory, method and application.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8):5362–5383, 2024

  6. [6]

    Pretrained vision- language-action models are surprisingly resistant to forgetting in continual learning.arXiv preprint arXiv:2603.03818, 2026

    Huihan Liu, Changyeon Kim, Bo Liu, Minghuan Liu, and Yuke Zhu. Pretrained vision- language-action models are surprisingly resistant to forgetting in continual learning.arXiv preprint arXiv:2603.03818, 2026

  7. [7]

    Simple Recipe Works: Vision-Language-Action Models are Natural Continual Learners with Reinforcement Learning

    Jiaheng Hu, Jay Shim, Chen Tang, Yoonchang Sung, Bo Liu, Peter Stone, and Roberto Martin- Martin. Simple recipe works: Vision-language-action models are natural continual learners with reinforcement learning.arXiv preprint arXiv:2603.11653, 2026

  8. [8]

    CLARE: Continual Learning for Vision-Language-Action Models via Autonomous Adapter Routing and Expansion

    Ralf Römer, Yi Zhang, and Angela P Schoellig. Clare: Continual learning for vision-language- action models via autonomous adapter routing and expansion.arXiv preprint arXiv:2601.09512, 2026

  9. [9]

    Towards Long-Lived Robots: Continual Learning VLA Models via Reinforcement Fine-Tuning

    Yuan Liu, Haoran Li, Shuai Tian, Yuxing Qin, Yuhui Chen, Yupeng Zheng, Yongzhen Huang, and Dongbin Zhao. Towards long-lived robots: Continual learning vla models via reinforcement fine-tuning.arXiv preprint arXiv:2602.10503, 2026

  10. [10]

    Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

  11. [11]

    Openvla: An open-source vision-language-action model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, et al. Openvla: An open-source vision-language-action model. In8th Annual Conference on Robot Learning, 2024

  12. [12]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

  13. [13]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, et al.π0.5: a vision- language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

  14. [14]

    Rdt-1b: a diffusion foundation model for bimanual manipulation

    Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation. In The Thirteenth International Conference on Learning Representations, 2024. 9

  15. [15]

    X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

    Jiandong Zheng, Junfeng Li, Ziyang Wang, Dawei Liu, Xiang Kang, Yejin Feng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model.arXiv preprint arXiv:2510.10274, 2025

  16. [16]

    WorldVLA: Towards Autoregressive Action World Model

    Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

  17. [17]

    Rethinking the practicality of vision-language-action model: A comprehensive benchmark and an improved baseline.arXiv preprint arXiv:2602.22663, 2026

    Wenxuan Song, Jiayi Chen, Xiaoquan Sun, Huashuo Lei, Yikai Qin, Wei Zhao, Pengxiang Ding, Han Zhao, Tongxin Wang, Pengxu Hou, Zhide Zhong, Haodong Yan, Donglin Wang, Jun Ma, and Haoang Li. Rethinking the practicality of vision-language-action model: A comprehensive benchmark and an improved baseline.arXiv preprint arXiv:2602.22663, 2026

  18. [18]

    Atomvla: Scalable post-training for robotic manipulation via predictive latent world models.arXiv preprint arXiv:2603.08519, 2026

    Xiaoquan Sun, Zetian Xu, Chen Cao, Zonghe Liu, Yihan Sun, Jingrui Pang, Ruijian Zhang, Zhen Yang, Kang Pang, Dingxin He, et al. Atomvla: Scalable post-training for robotic manipulation via predictive latent world models.arXiv preprint arXiv:2603.08519, 2026

  19. [19]

    Continual learning of large language models: A comprehensive survey.ACM Computing Surveys, 58(5):1–42, 2025

    Haizhou Shi, Zihao Xu, Hengyi Wang, Weiyi Qin, Wenyuan Wang, Yibin Wang, Zifeng Wang, Sayna Ebrahimi, and Hao Wang. Continual learning of large language models: A comprehensive survey.ACM Computing Surveys, 58(5):1–42, 2025

  20. [20]

    Expe- rience replay for continual learning.Advances in neural information processing systems, 32, 2019

    David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lillicrap, and Gregory Wayne. Expe- rience replay for continual learning.Advances in neural information processing systems, 32, 2019

  21. [21]

    Loss of plasticity in deep continual learning.Nature, 632(8026):768–774, 2024

    Shibhansh Dohare, J Fernando Hernandez-Garcia, Qingfeng Lan, Parash Rahman, A Rupam Mahmood, and Richard S Sutton. Loss of plasticity in deep continual learning.Nature, 632(8026):768–774, 2024

  22. [22]

    Theory on forgetting and generalization of continual learning

    Sen Lin, Peizhong Ju, Yingbin Liang, and Ness Shroff. Theory on forgetting and generalization of continual learning. InInternational Conference on Machine Learning, pages 21078–21100. PMLR, 2023

  23. [23]

    icarl: Incremental classifier and representation learning

    Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. icarl: Incremental classifier and representation learning. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2001–2010, 2017

  24. [24]

    Gradient episodic memory for continual learning

    David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. Advances in neural information processing systems, 30, 2017

  25. [25]

    Efficient lifelong learning with a-GEM

    Arslan Chaudhry, Marc’Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Efficient lifelong learning with a-GEM. InInternational Conference on Learning Representations, 2019

  26. [26]

    Online continual learning with maximal interfered retrieval

    Rahaf Aljundi, Eugene Belilovsky, Tinne Tuytelaars, Laurent Charlin, Massimo Caccia, Min Lin, and Lucas Page-Caccia. Online continual learning with maximal interfered retrieval. Advances in neural information processing systems, 32, 2019

  27. [27]

    Dark experience for general continual learning: a strong, simple baseline.Advances in neural information processing systems, 33:15920–15930, 2020

    Pietro Buzzega, Matteo Boschini, Angelo Porrello, Davide Abati, and Simone Calderara. Dark experience for general continual learning: a strong, simple baseline.Advances in neural information processing systems, 33:15920–15930, 2020

  28. [28]

    Partially observable markov decision processes

    Matthijs TJ Spaan. Partially observable markov decision processes. InReinforcement learning: State-of-the-art, pages 387–414. Springer, 2012. 10 A Detailed Task Scoring Rubrics Unlike binary success/failure metrics commonly used in simulation benchmarks, we adopt a fine- grained multi-stage evaluation protocol for all tasks. This design serves two purpose...

  29. [29]

    The gripper approaches the yellow bowl within 3 cm

  30. [30]

    The robot successfully grasps the yellow bowl

  31. [31]

    The robot moves the bowl above the green bowl

  32. [32]

    A penalty of−0.5is applied if the target bowl is knocked over during execution

    The robot successfully places the yellow bowl into the green bowl. A penalty of−0.5is applied if the target bowl is knocked over during execution. A.2 Hang Cup (D2) The robot must grasp a purple cup and hang it onto a mug rack. The task is evaluated using four intermediate checkpoints (maximum score: 4):

  33. [33]

    The robot correctly locates and grasps the cup

  34. [34]

    The robot moves the cup near the mug rack (within 3 cm)

  35. [35]

    The robot performs a valid hanging attempt with inward insertion motion

  36. [36]

    A.3 Press Button (D3) The robot must identify a green button, move the end-effector toward it, and press it successfully

    The cup is successfully hung onto the rack. A.3 Press Button (D3) The robot must identify a green button, move the end-effector toward it, and press it successfully. The task is evaluated using three intermediate checkpoints (maximum score: 3):

  37. [37]

    The robot correctly localizes the button with proper end-effector orientation

  38. [38]

    The end-effector moves near the button (within 3 cm)

  39. [39]

    A.4 Fold Towel (D4) The robot must fold a gray towel corner-to-corner

    The robot successfully presses the button. A.4 Fold Towel (D4) The robot must fold a gray towel corner-to-corner. The task is evaluated using five intermediate checkpoints (maximum score: 5):

  40. [40]

    The robot correctly locates the top-right towel corner

  41. [41]

    The robot successfully grasps the corner

  42. [42]

    The robot folds the corner toward the top-left corner

  43. [43]

    The robot correctly locates the bottom-right corner

  44. [44]

    The robot completes the final towel alignment and tidying behavior. 11