Can VLA Models Learn from Real-World Data Continually without Forgetting?

Jiarun Zhu; Jiayu Chen; Mingqi Yuan; Wenjun Zeng; Xiaoquan Sun; Yijun Hong; Zetian Xu; Zhiyong Wang

arxiv: 2605.26820 · v1 · pith:36UCJQUXnew · submitted 2026-05-26 · 💻 cs.RO

Can VLA Models Learn from Real-World Data Continually without Forgetting?

Jiarun Zhu , Yijun Hong , Xiaoquan Sun , Zetian Xu , Mingqi Yuan , Zhiyong Wang , Wenjun Zeng , Jiayu Chen This is my paper

Pith reviewed 2026-06-29 16:46 UTC · model grok-4.3

classification 💻 cs.RO

keywords vision-language-action modelscontinual learningcatastrophic forgettingreal-world roboticsexperience replaymanipulation tasksrobot policies

0 comments

The pith

Vision-language-action models suffer significant catastrophic forgetting when continually trained on heterogeneous real-world robot demonstrations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs a dataset of four sequential real-world manipulation tasks to examine whether VLA models can acquire new skills without losing previously learned ones. Experiments show these models undergo major forgetting when trained sequentially on heterogeneous physical demonstrations. The authors then evaluate experience replay and identify key factors that control its effectiveness. A reader would care because robots intended for long-term use in unstructured environments must retain earlier behaviors while incorporating new ones.

Core claim

Using a new real-world dataset spanning rigid-object pick-and-place, contact-rich pressing, and deformable-object folding, the study shows that VLA models suffer significant catastrophic forgetting when continually learning from heterogeneous demonstrations; experience replay mitigates this only when specific implementation factors are addressed correctly.

What carries the argument

The real-world continual learning dataset of four sequential manipulation tasks, used as the testbed to quantify forgetting in VLA models and to assess experience replay.

Load-bearing premise

The four sequential manipulation tasks and the collected demonstrations are representative of the challenges that arise in broader real-world continual deployment of VLA models.

What would settle it

An experiment in which a VLA model trained sequentially on the four tasks shows no measurable performance decline on the first task after completing the later tasks would falsify the claim of significant forgetting.

Figures

Figures reproduced from arXiv: 2605.26820 by Jiarun Zhu, Jiayu Chen, Mingqi Yuan, Wenjun Zeng, Xiaoquan Sun, Yijun Hong, Zetian Xu, Zhiyong Wang.

**Figure 1.** Figure 1: Overview of our investigation into real-world continual VLA learning. We collect a real-world sequential manipulation dataset of four heterogeneous tasks and study whether VLA models can adapt to them sequentially without forgetting. The top panels illustrate the continual learning problem, the real-world task stream, and the training procedure. The bottom panels summarize our central findings: (1) naive … view at source ↗

**Figure 2.** Figure 2: Overview of the (a) robot platform (the left figure) and (b) task sequence (the right figure). We adopt a multi-view teleoperation platform to collect demonstrations across four diverse manipulation tasks. The four tasks span rigid-object pick-and-place, cup hanging, contact-rich pressing, and deformable-object folding. This design provides a deployment-realistic task stream for evaluating whether VLA mode… view at source ↗

**Figure 3.** Figure 3: Forgetting matrices under sequential fine-tuning. Without ER (left), all previously learned tasks collapse to near-zero performance, confirming severe catastrophic forgetting. With appropriately configured ER (right panels), forgetting is largely eliminated across all tasks. However, excessively high replay frequency impairs new-task learning, while insufficient replay data weakens retention—revealing a U-… view at source ↗

**Figure 4.** Figure 4: Replay effectiveness exhibits a U-shaped sensitivity to buffer size and replay frequency. Overly frequent replay (fr = 0.5) impairs new-task acquisition, particularly for fragile tasks such as the press button task. Insufficient replay frequency (fr = 0.05) or buffer capacity (B = 0.002) weakens retention on tasks that require diverse replay trajectories, such as the hang cup task. The optimal point (B = … view at source ↗

read the original abstract

Vision-language-action (VLA) models provide a promising foundation for general-purpose robotics. However, their successful deployment in real-world scenarios requires the ability to continually acquire new skills while retaining previously learned behaviors. While pioneering research has studied the continual learning of VLA models in narrowly simulated environments, this challenge remains largely unexplored under realistic conditions. To address this limitation, we construct a real-world continual learning dataset comprising four sequential manipulation tasks, spanning rigid-object pick-and-place, contact-rich pressing, and deformable-object folding. Using this dataset, we conduct comprehensive experiments and find that VLA models suffer significant catastrophic forgetting when continually learning from heterogeneous real-world demonstrations. We then systematically evaluate experience replay and uncover key implementation factors that govern its success. In summary, this work provides the first empirical study of real-world continual VLA learning and offers practical guidance for deploying long-lived robot policies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

First real-world VLA continual learning dataset shows forgetting but rests on a narrow four-task benchmark.

read the letter

The paper's main point is that it builds the first real-world continual learning dataset for VLA models using four sequential manipulation tasks and demonstrates that these models experience significant catastrophic forgetting when learning from heterogeneous demonstrations. It also provides some guidance on using experience replay effectively.

They do well by shifting the focus from simulated environments to actual robot-collected data, which brings the problem closer to deployment realities. The construction of the dataset covering pick-and-place, pressing, and folding tasks, plus the experiments evaluating replay methods and their key factors, offers practical insights that could help in building more robust long-lived policies.

The soft spot is the limited scope of the benchmark. With only four tasks, it's not clear how well the forgetting results generalize to the wider range of challenges in open real-world continual learning, such as greater task diversity, longer sequences, or different kinds of distribution shifts. The stress test on representativeness holds here because the chosen tasks might not fully proxy the heterogeneity and interference patterns that would arise in broader use. The paper would be stronger with more tasks or explicit discussion of these limits.

This kind of work is for researchers in robot learning who are trying to move VLA models out of the lab and into sustained operation. It is worth engaging with because it identifies a real issue and starts to explore mitigations. I would recommend sending it for peer review as the first real-world empirical study on this topic, even though the current scale is modest and revisions would likely be needed to address generalizability.

Referee Report

2 major / 1 minor

Summary. The paper investigates continual learning for Vision-Language-Action (VLA) models in real-world settings. It introduces a dataset of four sequential manipulation tasks (rigid-object pick-and-place, contact-rich pressing, deformable-object folding) and reports that VLA models exhibit significant catastrophic forgetting when trained sequentially on heterogeneous real-world demonstrations. The work then evaluates experience replay, identifies key implementation factors for its effectiveness, and positions the study as the first empirical examination of real-world continual VLA learning with practical guidance for long-lived policies.

Significance. If the empirical findings hold under broader conditions, the paper would be significant for highlighting a key obstacle to long-term deployment of VLA models outside simulation and for supplying concrete replay-based mitigation insights. It addresses a clear gap between existing simulated continual-learning studies and realistic robotics data.

major comments (2)

[Dataset construction and experimental setup] The headline claim of significant catastrophic forgetting on heterogeneous real-world data rests on a benchmark of exactly four tasks. The manuscript should explicitly justify why this sequence adequately proxies the heterogeneity, long-horizon dependencies, sensor noise accumulation, and task-interference patterns expected in open-ended continual deployment; otherwise the measured forgetting may be an artifact of the narrow task set rather than a general property of VLA fine-tuning.
[Results] The abstract asserts the forgetting result and replay findings yet supplies no quantitative metrics, controls, statistical details, or dataset size. The results section must include these (e.g., per-task success rates before/after sequential training, replay buffer sizes, statistical significance) for the central empirical claim to be evaluable.

minor comments (1)

[Dataset] Clarify the precise number of demonstrations collected per task and any balancing procedures used across the four tasks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our study of continual learning for VLA models in real-world settings. We address each major comment below.

read point-by-point responses

Referee: [Dataset construction and experimental setup] The headline claim of significant catastrophic forgetting on heterogeneous real-world data rests on a benchmark of exactly four tasks. The manuscript should explicitly justify why this sequence adequately proxies the heterogeneity, long-horizon dependencies, sensor noise accumulation, and task-interference patterns expected in open-ended continual deployment; otherwise the measured forgetting may be an artifact of the narrow task set rather than a general property of VLA fine-tuning.

Authors: We agree that four tasks constitute a limited benchmark and cannot fully represent all aspects of open-ended deployment. The sequence was deliberately constructed to span distinct heterogeneity dimensions (rigid vs. deformable objects; pick-and-place vs. contact-rich pressing vs. folding) that induce measurable task interference. In revision we will insert an explicit justification paragraph in the dataset section, framing the benchmark as a minimal yet representative real-world proxy while noting its scope limitations and the value of future larger-scale studies. revision: yes
Referee: [Results] The abstract asserts the forgetting result and replay findings yet supplies no quantitative metrics, controls, statistical details, or dataset size. The results section must include these (e.g., per-task success rates before/after sequential training, replay buffer sizes, statistical significance) for the central empirical claim to be evaluable.

Authors: Abstracts are intentionally concise and omit detailed metrics by design. We acknowledge that the results section must supply all requested quantitative elements for evaluability. The manuscript already reports per-task success rates and replay buffer sizes; we will expand the section with explicit before/after tables, dataset sizes, additional controls, and statistical significance tests in the revised version. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical study with no derivations or self-referential predictions

full rationale

The paper constructs a real-world dataset of four sequential manipulation tasks and reports experimental observations of catastrophic forgetting in VLA models along with evaluations of experience replay. No equations, fitted parameters presented as predictions, uniqueness theorems, or self-citation chains appear in the abstract or described methodology. All claims rest on direct empirical measurement rather than any derivation that reduces to its own inputs by construction, rendering the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical study; central claim rests on the representativeness of the four-task dataset and the validity of the experimental protocol described at high level in the abstract. No free parameters, axioms, or invented entities are identifiable from the provided text.

pith-pipeline@v0.9.1-grok · 5699 in / 969 out tokens · 17178 ms · 2026-06-29T16:46:08.383741+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LargeMonitor: Monitoring Online Task-Free Continual Learning via Large Pretrained Models
cs.LG 2026-06 unverdicted novelty 4.0

LargeMonitor introduces a decoupled framework using large pretrained models for robust drift detection and semantic diagnosis to improve online task-free continual learning.

Reference graph

Works this paper leans on

44 extracted references · 11 canonical work pages · cited by 1 Pith paper · 8 internal anchors

[1]

A Survey on Vision-Language-Action Models for Embodied AI

Yueen Ma, Zixing Song, Yuzheng Zhuang, Jianye Hao, and Irwin King. A survey on vision- language-action models for embodied ai.arXiv preprint arXiv:2405.14093, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Openvla: An open-source vision-language-action model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, et al. Openvla: An open-source vision-language-action model. InConference on Robot Learning, pages 2679–2713. PMLR, 2025

2025
[3]

Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, brian ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Allen Z. Ren,...

2025
[4]

A continual learning survey: Defying forgetting in classification tasks.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(7):3366–3385, 2021

Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Aleš Leonardis, Gregory Slabaugh, and Tinne Tuytelaars. A continual learning survey: Defying forgetting in classification tasks.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(7):3366–3385, 2021

2021
[5]

A comprehensive survey of continual learning: Theory, method and application.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8):5362–5383, 2024

Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. A comprehensive survey of continual learning: Theory, method and application.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8):5362–5383, 2024

2024
[6]

Pretrained vision- language-action models are surprisingly resistant to forgetting in continual learning.arXiv preprint arXiv:2603.03818, 2026

Huihan Liu, Changyeon Kim, Bo Liu, Minghuan Liu, and Yuke Zhu. Pretrained vision- language-action models are surprisingly resistant to forgetting in continual learning.arXiv preprint arXiv:2603.03818, 2026

work page arXiv 2026
[7]

Simple Recipe Works: Vision-Language-Action Models are Natural Continual Learners with Reinforcement Learning

Jiaheng Hu, Jay Shim, Chen Tang, Yoonchang Sung, Bo Liu, Peter Stone, and Roberto Martin- Martin. Simple recipe works: Vision-language-action models are natural continual learners with reinforcement learning.arXiv preprint arXiv:2603.11653, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[8]

CLARE: Continual Learning for Vision-Language-Action Models via Autonomous Adapter Routing and Expansion

Ralf Römer, Yi Zhang, and Angela P Schoellig. Clare: Continual learning for vision-language- action models via autonomous adapter routing and expansion.arXiv preprint arXiv:2601.09512, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[9]

Towards Long-Lived Robots: Continual Learning VLA Models via Reinforcement Fine-Tuning

Yuan Liu, Haoran Li, Shuai Tian, Yuxing Qin, Yuhui Chen, Yupeng Zheng, Yongzhen Huang, and Dongbin Zhao. Towards long-lived robots: Continual learning vla models via reinforcement fine-tuning.arXiv preprint arXiv:2602.10503, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[10]

Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

2023
[11]

Openvla: An open-source vision-language-action model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, et al. Openvla: An open-source vision-language-action model. In8th Annual Conference on Robot Learning, 2024

2024
[12]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, et al.π0.5: a vision- language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Rdt-1b: a diffusion foundation model for bimanual manipulation

Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation. In The Thirteenth International Conference on Learning Representations, 2024. 9

2024
[15]

X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

Jiandong Zheng, Junfeng Li, Ziyang Wang, Dawei Liu, Xiang Kang, Yejin Feng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model.arXiv preprint arXiv:2510.10274, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

WorldVLA: Towards Autoregressive Action World Model

Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Rethinking the practicality of vision-language-action model: A comprehensive benchmark and an improved baseline.arXiv preprint arXiv:2602.22663, 2026

Wenxuan Song, Jiayi Chen, Xiaoquan Sun, Huashuo Lei, Yikai Qin, Wei Zhao, Pengxiang Ding, Han Zhao, Tongxin Wang, Pengxu Hou, Zhide Zhong, Haodong Yan, Donglin Wang, Jun Ma, and Haoang Li. Rethinking the practicality of vision-language-action model: A comprehensive benchmark and an improved baseline.arXiv preprint arXiv:2602.22663, 2026

work page arXiv 2026
[18]

Atomvla: Scalable post-training for robotic manipulation via predictive latent world models.arXiv preprint arXiv:2603.08519, 2026

Xiaoquan Sun, Zetian Xu, Chen Cao, Zonghe Liu, Yihan Sun, Jingrui Pang, Ruijian Zhang, Zhen Yang, Kang Pang, Dingxin He, et al. Atomvla: Scalable post-training for robotic manipulation via predictive latent world models.arXiv preprint arXiv:2603.08519, 2026

work page arXiv 2026
[19]

Continual learning of large language models: A comprehensive survey.ACM Computing Surveys, 58(5):1–42, 2025

Haizhou Shi, Zihao Xu, Hengyi Wang, Weiyi Qin, Wenyuan Wang, Yibin Wang, Zifeng Wang, Sayna Ebrahimi, and Hao Wang. Continual learning of large language models: A comprehensive survey.ACM Computing Surveys, 58(5):1–42, 2025

2025
[20]

Expe- rience replay for continual learning.Advances in neural information processing systems, 32, 2019

David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lillicrap, and Gregory Wayne. Expe- rience replay for continual learning.Advances in neural information processing systems, 32, 2019

2019
[21]

Loss of plasticity in deep continual learning.Nature, 632(8026):768–774, 2024

Shibhansh Dohare, J Fernando Hernandez-Garcia, Qingfeng Lan, Parash Rahman, A Rupam Mahmood, and Richard S Sutton. Loss of plasticity in deep continual learning.Nature, 632(8026):768–774, 2024

2024
[22]

Theory on forgetting and generalization of continual learning

Sen Lin, Peizhong Ju, Yingbin Liang, and Ness Shroff. Theory on forgetting and generalization of continual learning. InInternational Conference on Machine Learning, pages 21078–21100. PMLR, 2023

2023
[23]

icarl: Incremental classifier and representation learning

Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. icarl: Incremental classifier and representation learning. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2001–2010, 2017

2001
[24]

Gradient episodic memory for continual learning

David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. Advances in neural information processing systems, 30, 2017

2017
[25]

Efficient lifelong learning with a-GEM

Arslan Chaudhry, Marc’Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Efficient lifelong learning with a-GEM. InInternational Conference on Learning Representations, 2019

2019
[26]

Online continual learning with maximal interfered retrieval

Rahaf Aljundi, Eugene Belilovsky, Tinne Tuytelaars, Laurent Charlin, Massimo Caccia, Min Lin, and Lucas Page-Caccia. Online continual learning with maximal interfered retrieval. Advances in neural information processing systems, 32, 2019

2019
[27]

Dark experience for general continual learning: a strong, simple baseline.Advances in neural information processing systems, 33:15920–15930, 2020

Pietro Buzzega, Matteo Boschini, Angelo Porrello, Davide Abati, and Simone Calderara. Dark experience for general continual learning: a strong, simple baseline.Advances in neural information processing systems, 33:15920–15930, 2020

2020
[28]

Partially observable markov decision processes

Matthijs TJ Spaan. Partially observable markov decision processes. InReinforcement learning: State-of-the-art, pages 387–414. Springer, 2012. 10 A Detailed Task Scoring Rubrics Unlike binary success/failure metrics commonly used in simulation benchmarks, we adopt a fine- grained multi-stage evaluation protocol for all tasks. This design serves two purpose...

2012
[29]

The gripper approaches the yellow bowl within 3 cm
[30]

The robot successfully grasps the yellow bowl
[31]

The robot moves the bowl above the green bowl
[32]

A penalty of−0.5is applied if the target bowl is knocked over during execution

The robot successfully places the yellow bowl into the green bowl. A penalty of−0.5is applied if the target bowl is knocked over during execution. A.2 Hang Cup (D2) The robot must grasp a purple cup and hang it onto a mug rack. The task is evaluated using four intermediate checkpoints (maximum score: 4):
[33]

The robot correctly locates and grasps the cup
[34]

The robot moves the cup near the mug rack (within 3 cm)
[35]

The robot performs a valid hanging attempt with inward insertion motion
[36]

A.3 Press Button (D3) The robot must identify a green button, move the end-effector toward it, and press it successfully

The cup is successfully hung onto the rack. A.3 Press Button (D3) The robot must identify a green button, move the end-effector toward it, and press it successfully. The task is evaluated using three intermediate checkpoints (maximum score: 3):
[37]

The robot correctly localizes the button with proper end-effector orientation
[38]

The end-effector moves near the button (within 3 cm)
[39]

A.4 Fold Towel (D4) The robot must fold a gray towel corner-to-corner

The robot successfully presses the button. A.4 Fold Towel (D4) The robot must fold a gray towel corner-to-corner. The task is evaluated using five intermediate checkpoints (maximum score: 5):
[40]

The robot correctly locates the top-right towel corner
[41]

The robot successfully grasps the corner
[42]

The robot folds the corner toward the top-left corner
[43]

The robot correctly locates the bottom-right corner
[44]

The robot completes the final towel alignment and tidying behavior. 11

[1] [1]

A Survey on Vision-Language-Action Models for Embodied AI

Yueen Ma, Zixing Song, Yuzheng Zhuang, Jianye Hao, and Irwin King. A survey on vision- language-action models for embodied ai.arXiv preprint arXiv:2405.14093, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

Openvla: An open-source vision-language-action model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, et al. Openvla: An open-source vision-language-action model. InConference on Robot Learning, pages 2679–2713. PMLR, 2025

2025

[3] [3]

Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, brian ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Allen Z. Ren,...

2025

[4] [4]

A continual learning survey: Defying forgetting in classification tasks.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(7):3366–3385, 2021

Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Aleš Leonardis, Gregory Slabaugh, and Tinne Tuytelaars. A continual learning survey: Defying forgetting in classification tasks.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(7):3366–3385, 2021

2021

[5] [5]

A comprehensive survey of continual learning: Theory, method and application.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8):5362–5383, 2024

Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. A comprehensive survey of continual learning: Theory, method and application.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8):5362–5383, 2024

2024

[6] [6]

Pretrained vision- language-action models are surprisingly resistant to forgetting in continual learning.arXiv preprint arXiv:2603.03818, 2026

Huihan Liu, Changyeon Kim, Bo Liu, Minghuan Liu, and Yuke Zhu. Pretrained vision- language-action models are surprisingly resistant to forgetting in continual learning.arXiv preprint arXiv:2603.03818, 2026

work page arXiv 2026

[7] [7]

Simple Recipe Works: Vision-Language-Action Models are Natural Continual Learners with Reinforcement Learning

Jiaheng Hu, Jay Shim, Chen Tang, Yoonchang Sung, Bo Liu, Peter Stone, and Roberto Martin- Martin. Simple recipe works: Vision-language-action models are natural continual learners with reinforcement learning.arXiv preprint arXiv:2603.11653, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[8] [8]

CLARE: Continual Learning for Vision-Language-Action Models via Autonomous Adapter Routing and Expansion

Ralf Römer, Yi Zhang, and Angela P Schoellig. Clare: Continual learning for vision-language- action models via autonomous adapter routing and expansion.arXiv preprint arXiv:2601.09512, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[9] [9]

Towards Long-Lived Robots: Continual Learning VLA Models via Reinforcement Fine-Tuning

Yuan Liu, Haoran Li, Shuai Tian, Yuxing Qin, Yuhui Chen, Yupeng Zheng, Yongzhen Huang, and Dongbin Zhao. Towards long-lived robots: Continual learning vla models via reinforcement fine-tuning.arXiv preprint arXiv:2602.10503, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[10] [10]

Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

2023

[11] [11]

Openvla: An open-source vision-language-action model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, et al. Openvla: An open-source vision-language-action model. In8th Annual Conference on Robot Learning, 2024

2024

[12] [12]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, et al.π0.5: a vision- language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

Rdt-1b: a diffusion foundation model for bimanual manipulation

Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation. In The Thirteenth International Conference on Learning Representations, 2024. 9

2024

[15] [15]

X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

Jiandong Zheng, Junfeng Li, Ziyang Wang, Dawei Liu, Xiang Kang, Yejin Feng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model.arXiv preprint arXiv:2510.10274, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

WorldVLA: Towards Autoregressive Action World Model

Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Rethinking the practicality of vision-language-action model: A comprehensive benchmark and an improved baseline.arXiv preprint arXiv:2602.22663, 2026

Wenxuan Song, Jiayi Chen, Xiaoquan Sun, Huashuo Lei, Yikai Qin, Wei Zhao, Pengxiang Ding, Han Zhao, Tongxin Wang, Pengxu Hou, Zhide Zhong, Haodong Yan, Donglin Wang, Jun Ma, and Haoang Li. Rethinking the practicality of vision-language-action model: A comprehensive benchmark and an improved baseline.arXiv preprint arXiv:2602.22663, 2026

work page arXiv 2026

[18] [18]

Atomvla: Scalable post-training for robotic manipulation via predictive latent world models.arXiv preprint arXiv:2603.08519, 2026

Xiaoquan Sun, Zetian Xu, Chen Cao, Zonghe Liu, Yihan Sun, Jingrui Pang, Ruijian Zhang, Zhen Yang, Kang Pang, Dingxin He, et al. Atomvla: Scalable post-training for robotic manipulation via predictive latent world models.arXiv preprint arXiv:2603.08519, 2026

work page arXiv 2026

[19] [19]

Continual learning of large language models: A comprehensive survey.ACM Computing Surveys, 58(5):1–42, 2025

Haizhou Shi, Zihao Xu, Hengyi Wang, Weiyi Qin, Wenyuan Wang, Yibin Wang, Zifeng Wang, Sayna Ebrahimi, and Hao Wang. Continual learning of large language models: A comprehensive survey.ACM Computing Surveys, 58(5):1–42, 2025

2025

[20] [20]

Expe- rience replay for continual learning.Advances in neural information processing systems, 32, 2019

David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lillicrap, and Gregory Wayne. Expe- rience replay for continual learning.Advances in neural information processing systems, 32, 2019

2019

[21] [21]

Loss of plasticity in deep continual learning.Nature, 632(8026):768–774, 2024

Shibhansh Dohare, J Fernando Hernandez-Garcia, Qingfeng Lan, Parash Rahman, A Rupam Mahmood, and Richard S Sutton. Loss of plasticity in deep continual learning.Nature, 632(8026):768–774, 2024

2024

[22] [22]

Theory on forgetting and generalization of continual learning

Sen Lin, Peizhong Ju, Yingbin Liang, and Ness Shroff. Theory on forgetting and generalization of continual learning. InInternational Conference on Machine Learning, pages 21078–21100. PMLR, 2023

2023

[23] [23]

icarl: Incremental classifier and representation learning

Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. icarl: Incremental classifier and representation learning. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2001–2010, 2017

2001

[24] [24]

Gradient episodic memory for continual learning

David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. Advances in neural information processing systems, 30, 2017

2017

[25] [25]

Efficient lifelong learning with a-GEM

Arslan Chaudhry, Marc’Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Efficient lifelong learning with a-GEM. InInternational Conference on Learning Representations, 2019

2019

[26] [26]

Online continual learning with maximal interfered retrieval

Rahaf Aljundi, Eugene Belilovsky, Tinne Tuytelaars, Laurent Charlin, Massimo Caccia, Min Lin, and Lucas Page-Caccia. Online continual learning with maximal interfered retrieval. Advances in neural information processing systems, 32, 2019

2019

[27] [27]

Dark experience for general continual learning: a strong, simple baseline.Advances in neural information processing systems, 33:15920–15930, 2020

Pietro Buzzega, Matteo Boschini, Angelo Porrello, Davide Abati, and Simone Calderara. Dark experience for general continual learning: a strong, simple baseline.Advances in neural information processing systems, 33:15920–15930, 2020

2020

[28] [28]

Partially observable markov decision processes

Matthijs TJ Spaan. Partially observable markov decision processes. InReinforcement learning: State-of-the-art, pages 387–414. Springer, 2012. 10 A Detailed Task Scoring Rubrics Unlike binary success/failure metrics commonly used in simulation benchmarks, we adopt a fine- grained multi-stage evaluation protocol for all tasks. This design serves two purpose...

2012

[29] [29]

The gripper approaches the yellow bowl within 3 cm

[30] [30]

The robot successfully grasps the yellow bowl

[31] [31]

The robot moves the bowl above the green bowl

[32] [32]

A penalty of−0.5is applied if the target bowl is knocked over during execution

The robot successfully places the yellow bowl into the green bowl. A penalty of−0.5is applied if the target bowl is knocked over during execution. A.2 Hang Cup (D2) The robot must grasp a purple cup and hang it onto a mug rack. The task is evaluated using four intermediate checkpoints (maximum score: 4):

[33] [33]

The robot correctly locates and grasps the cup

[34] [34]

The robot moves the cup near the mug rack (within 3 cm)

[35] [35]

The robot performs a valid hanging attempt with inward insertion motion

[36] [36]

A.3 Press Button (D3) The robot must identify a green button, move the end-effector toward it, and press it successfully

The cup is successfully hung onto the rack. A.3 Press Button (D3) The robot must identify a green button, move the end-effector toward it, and press it successfully. The task is evaluated using three intermediate checkpoints (maximum score: 3):

[37] [37]

The robot correctly localizes the button with proper end-effector orientation

[38] [38]

The end-effector moves near the button (within 3 cm)

[39] [39]

A.4 Fold Towel (D4) The robot must fold a gray towel corner-to-corner

The robot successfully presses the button. A.4 Fold Towel (D4) The robot must fold a gray towel corner-to-corner. The task is evaluated using five intermediate checkpoints (maximum score: 5):

[40] [40]

The robot correctly locates the top-right towel corner

[41] [41]

The robot successfully grasps the corner

[42] [42]

The robot folds the corner toward the top-left corner

[43] [43]

The robot correctly locates the bottom-right corner

[44] [44]

The robot completes the final towel alignment and tidying behavior. 11