CLARE: Continual Learning for Vision-Language-Action Models via Autonomous Adapter Routing and Expansion
Pith reviewed 2026-05-21 15:48 UTC · model grok-4.3
The pith
CLARE lets pre-trained vision-language-action models learn new robot tasks without forgetting old ones or needing stored examples.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper presents CLARE as a general framework for exemplar-free continual learning in vision-language-action models. It inserts lightweight modular adapters into selected VLA modules and expands the model only where necessary when a new task arrives, with the decision guided by layer-wise feature similarity. During operation, an autoencoder-based routing mechanism dynamically selects and activates the most relevant adapters without requiring task labels. Experiments on the LIBERO benchmark and five real-world tasks demonstrate that this yields high performance on new tasks while avoiding catastrophic forgetting of earlier ones, even surpassing methods that store exemplars.
What carries the argument
Layer-wise feature similarity for deciding adapter insertion points combined with autoencoder-based routing for dynamic adapter activation at deployment time.
If this is right
- Long sequences of robot tasks can be learned sequentially with maintained performance on all previous tasks.
- Memory usage stays low since no previous task data needs to be stored.
- Deployment requires no task identifiers, allowing fluid operation in unstructured environments.
- Overall success rates exceed those of continual learning approaches that rely on exemplar storage.
Where Pith is reading between the lines
- This routing mechanism based on feature similarity might apply to continual learning in other types of neural networks used for control or perception.
- Testing the method on tasks with greater environmental variation could reveal how robust the similarity signal remains.
- Integrating this with other parameter-efficient techniques could further reduce the overhead of model expansion over very long task lifetimes.
Load-bearing premise
Layer-wise feature similarity serves as a dependable indicator for both when and where to add new adapters so that routing can preserve performance over extended task sequences without labels or stored examples.
What would settle it
Observing a long chain of tasks in which new task features closely resemble those of an unrelated earlier task, causing adapters to be placed incorrectly and leading to measurable drops in success rate on the first tasks.
Figures
read the original abstract
To teach robots complex manipulation tasks, a common approach is to fine-tune a pre-trained vision-language-action model (VLA) on task-specific data. However, since this recipe updates existing representations, it is unsuitable for long-term operation in the real world, where robots must continually adapt to new tasks and environments while retaining the knowledge they have already acquired. Existing continual learning methods for robotics commonly require storing previous data (exemplars), struggle with long task sequences, or rely on task identifiers for deployment. To address these limitations, we propose CLARE, a general, parameter-efficient framework for exemplar-free continual learning with VLAs. CLARE introduces lightweight modular adapters into selected VLA modules and autonomously expands the model only where necessary when learning a new task, guided by layer-wise feature similarity. During deployment, an autoencoder-based routing mechanism dynamically activates the most relevant adapters without requiring task labels. Through extensive experiments on the LIBERO benchmark and five real-world tasks, we show that CLARE achieves high performance on new tasks without catastrophic forgetting of earlier tasks, significantly outperforming even exemplar-based methods. Code, data, and videos are available at our website: https://tum-lsy.github.io/clare.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes CLARE, a parameter-efficient continual learning framework for vision-language-action (VLA) models. It inserts lightweight modular adapters into selected VLA modules and expands the model autonomously when learning a new task, guided by layer-wise feature similarity. An autoencoder-based routing mechanism dynamically activates relevant adapters at deployment without task labels or stored exemplars. Experiments on the LIBERO benchmark and five real-world tasks are presented to demonstrate high performance on new tasks without catastrophic forgetting, with claimed outperformance over exemplar-based methods.
Significance. If the empirical results hold under the reported conditions, the work would be significant for practical long-term robotic deployment, as it targets the limitations of exemplar storage, task identifiers, and short task sequences in existing robotics continual learning. The framework's use of standard adapter and autoencoder components with autonomous decisions is a pragmatic engineering contribution. Credit is given for the inclusion of real-world tasks alongside the LIBERO benchmark and for making code, data, and videos available.
major comments (2)
- [§3.2] §3.2 (Adapter Expansion Mechanism): The central claim of exemplar-free operation rests on layer-wise feature similarity providing a reliable, task-agnostic signal for both when and where to insert new adapters. However, the manuscript does not report controls comparing this trigger to alternatives such as gradient-norm thresholds or autoencoder reconstruction error, nor does it validate that chosen layers correlate with task-specific performance gains rather than shared visual statistics across tasks in the entangled VLA representation space. This leaves the robustness of the routing and expansion decisions unverified for long sequences.
- [Table 3] Table 3 (Ablation on Routing): The reported outperformance over exemplar-based baselines is load-bearing for the practical advantage of CLARE, yet the ablation isolating the autoencoder router's contribution from the expansion strategy is insufficiently detailed; without quantitative metrics on how similarity thresholds and autoencoder training hyperparameters are selected, the support for the no-forgetting claim across task sequences cannot be fully assessed.
minor comments (2)
- [Figure 4] Figure 4 (Routing Visualization): The diagram of the autoencoder router would benefit from explicit annotation of the reconstruction loss term and how it interacts with the similarity-based expansion decision to improve clarity for readers.
- [§4.3] §4.3 (Real-world Experiments): The description of the five real-world tasks lacks a table summarizing per-task success rates and forgetting metrics; adding this would strengthen the presentation of the empirical results.
Simulated Author's Rebuttal
We are grateful to the referee for the detailed and insightful comments on our manuscript. We have carefully considered each point and provide point-by-point responses below. Where appropriate, we have revised the manuscript to address the concerns raised.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Adapter Expansion Mechanism): The central claim of exemplar-free operation rests on layer-wise feature similarity providing a reliable, task-agnostic signal for both when and where to insert new adapters. However, the manuscript does not report controls comparing this trigger to alternatives such as gradient-norm thresholds or autoencoder reconstruction error, nor does it validate that chosen layers correlate with task-specific performance gains rather than shared visual statistics across tasks in the entangled VLA representation space. This leaves the robustness of the routing and expansion decisions unverified for long sequences.
Authors: We thank the referee for highlighting this important aspect of our adapter expansion mechanism. The choice of layer-wise feature similarity was motivated by its ability to capture distributional shifts in the feature space without requiring task labels or stored data, which aligns with our exemplar-free goal. To strengthen this, we have added new ablation studies in the revised manuscript comparing the feature similarity-based trigger against gradient-norm thresholds and autoencoder reconstruction error. These experiments demonstrate that feature similarity yields more consistent expansion decisions across tasks. Additionally, we have included an analysis correlating the selected layers with performance improvements on task-specific metrics, showing that insertions in high-similarity layers lead to better adaptation without affecting shared representations. For long sequences, while our primary experiments use sequences of 5-10 tasks, we have extended the evaluation to longer sequences in the appendix and discussed potential limitations for very extended deployments. revision: yes
-
Referee: [Table 3] Table 3 (Ablation on Routing): The reported outperformance over exemplar-based baselines is load-bearing for the practical advantage of CLARE, yet the ablation isolating the autoencoder router's contribution from the expansion strategy is insufficiently detailed; without quantitative metrics on how similarity thresholds and autoencoder training hyperparameters are selected, the support for the no-forgetting claim across task sequences cannot be fully assessed.
Authors: We agree that more details on the ablation and hyperparameter selection would enhance the clarity of our results. In the revised version, we have expanded Table 3 with additional rows isolating the router's contribution and provided quantitative metrics on the similarity threshold selection process, including sensitivity analysis. We have also detailed the autoencoder training hyperparameters and their selection criteria based on validation reconstruction error. These additions support the robustness of the no-forgetting claim by showing consistent performance across varying task sequences. revision: yes
Circularity Check
No significant circularity; CLARE is an empirical engineering framework
full rationale
The paper introduces CLARE as a parameter-efficient continual learning method for VLAs that inserts adapters based on layer-wise feature similarity and routes them via an autoencoder during inference. No derivation chain, prediction, or first-principles result is presented that reduces by construction to a fitted parameter or self-defined quantity. The central claims rest on experimental validation across the LIBERO benchmark and real-world tasks rather than on any self-referential equations or load-bearing self-citations. Standard adapter and autoencoder components are adopted without smuggling in prior author-specific ansatzes that would create circularity. This is the expected outcome for an applied robotics engineering paper whose performance claims are externally falsifiable through benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
autonomously expands the model only where necessary when learning a new task, guided by layer-wise feature similarity... autoencoder-based routing mechanism dynamically activates the most relevant adapters
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanalpha_pin_under_high_calibration unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
if all z-scores exceed a threshold γ... Expand A_ℓ ← A_ℓ ∪ {A^{k_ℓ}_ℓ}
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 3 Pith papers
-
GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization
GuidedVLA improves VLA success rates by manually supervising separate attention heads in the action decoder with auxiliary signals for task-relevant factors.
-
Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation
Anchor-Centric Adaptation escapes the diversity trap by prioritizing repeated demonstrations at core anchors over broad coverage, yielding higher success rates under fixed data budgets in robotic manipulation.
-
Modular Continual Learning via Zero-Leakage Reconstruction Routing and Autonomous Task Discovery
A silicon-native modular system with parallel live distillation and a tight-bottleneck autoencoder achieves parameter isolation, autonomous task discovery, and strong retention across vision and language tasks without...
Reference graph
Works this paper leans on
-
[1]
Continual lifelong learning with neural networks: A review,
G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter, “Continual lifelong learning with neural networks: A review,”Neural Networks, pp. 54–71, 2019
work page 2019
-
[2]
S. Thrun and T. M. Mitchell, “Lifelong robot learning,”Robotics and Autonomous systems, pp. 25–46, 1995
work page 1995
-
[3]
A. Billardet al., “A roadmap for AI in robotics,”Nature Machine Intelligence, 2025
work page 2025
-
[4]
Loss of plasticity in deep continual learning,
S. Dohare, J. F. Hernandez-Garcia, Q. Lan, P. Rahman, A. R. Mah- mood, and R. S. Sutton, “Loss of plasticity in deep continual learning,” Nature, pp. 768–774, 2024
work page 2024
-
[5]
OpenVLA: An open-source vision-language-action model,
M. J. Kimet al., “OpenVLA: An open-source vision-language-action model,” inConference on Robot Learning, 2025, pp. 2679–2713
work page 2025
-
[6]
$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization
P. Intelligenceet al., “π 0.5: a vision-language-action model with open- world generalization,”arXiv preprint arXiv:2504.16054, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Flower: Democratizing generalist robot policies with efficient vision-language-action flow policies,
M. Reuss, H. Zhou, M. R ¨uhle, ¨O. E. Ya ˘gmurlu, F. Otto, and R. Lioutikov, “Flower: Democratizing generalist robot policies with efficient vision-language-action flow policies,” inConference on Robot Learning (CoRL), 2025
work page 2025
-
[8]
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
M. Shukoret al., “SmolVLA: A vision-language-action model for affordable and efficient robotics,”arXiv preprint arXiv:2506.01844, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Open X-embodiment: Robotic learning datasets and RT-X models,
A. O’Neillet al., “Open X-embodiment: Robotic learning datasets and RT-X models,” inInternational Conference on Robotics and Automation (ICRA), 2024, pp. 6892–6903
work page 2024
-
[10]
Catastrophic forgetting in connectionist networks,
R. M. French, “Catastrophic forgetting in connectionist networks,” Trends in Cognitive Sciences, pp. 128–135, 1999
work page 1999
-
[11]
Gradient episodic memory for con- tinual learning,
D. Lopez-Paz and M. Ranzato, “Gradient episodic memory for con- tinual learning,”Advances in Neural Information Processing Systems (NeurIPS), 2017
work page 2017
-
[12]
On Tiny Episodic Memories in Continual Learning
A. Chaudhryet al., “On tiny episodic memories in continual learning,” arXiv preprint arXiv:1902.10486, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1902
-
[13]
Lifelong robotic reinforcement learning by retaining experiences,
A. Xie and C. Finn, “Lifelong robotic reinforcement learning by retaining experiences,” inConference on Lifelong Learning Agents. PMLR, 2022, pp. 838–855
work page 2022
-
[14]
Tail: Task-specific adapters for imitation learning with large pretrained models,
Z. Liuet al., “Tail: Task-specific adapters for imitation learning with large pretrained models,”arXiv preprint arXiv:2310.05905, 2023
-
[15]
Self-expansion of pre- trained models with mixture of adapters for continual learning,
H. Wang, H. Lu, L. Yao, and D. Gong, “Self-expansion of pre- trained models with mixture of adapters for continual learning,” in Proceedings of the Computer Vision and Pattern Recognition Confer- ence (CVPR), 2025, pp. 10 087–10 098
work page 2025
-
[16]
Sparse diffusion policy: A sparse, reusable, and flexible policy for robot learning,
Y . Wanget al., “Sparse diffusion policy: A sparse, reusable, and flexible policy for robot learning,” inConference on Robot Learning (CoRL), 2024
work page 2024
-
[17]
Scaling Laws for Neural Language Models
J. Kaplanet al., “Scaling laws for neural language models,”arXiv preprint arXiv:2001.08361, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[18]
J. Achiamet al., “GPT-4 Technical report,”arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[19]
Denoising diffusion probabilistic models,
J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,”Advances in Neural Information Processing Systems, pp. 6840–6851, 2020
work page 2020
-
[20]
Flow matching for generative modeling,
Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow matching for generative modeling,” inInternational Conference on Learning Representations (ICLR), 2023
work page 2023
-
[21]
PaLi-3 vision lan- guage models: Smaller, faster, stronger,
X. Chenet al., “Pali-3 vision language models: Smaller, faster, stronger,”arXiv preprint arXiv:2310.09199, 2023
-
[22]
Data scaling laws in imitation learning for robotic manipulation,
F. Lin, Y . Hu, P. Sheng, C. Wen, J. You, and Y . Gao, “Data scaling laws in imitation learning for robotic manipulation,” inInternational Conference on Learning Representations (ICLR), 2025
work page 2025
-
[23]
Learning without forgetting for vision-language models,
D.-W. Zhouet al., “Learning without forgetting for vision-language models,”IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 2025
work page 2025
-
[24]
Overcoming catastrophic forgetting in neural networks,
J. Kirkpatricket al., “Overcoming catastrophic forgetting in neural networks,”Proceedings of the National Academy of Sciences, pp. 3521–3526, 2017
work page 2017
-
[25]
Continual learning through synap- tic intelligence,
F. Zenke, B. Poole, and S. Ganguli, “Continual learning through synap- tic intelligence,” inInternational Conference on Machine Learning (ICML), 2017, pp. 3987–3995
work page 2017
-
[26]
Packnet: Adding multiple tasks to a single network by iterative pruning,
A. Mallya and S. Lazebnik, “Packnet: Adding multiple tasks to a single network by iterative pruning,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7765–7773
work page 2018
-
[27]
A. A. Rusuet al., “Progressive neural networks,”arXiv preprint arXiv:1606.04671, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[28]
LoRA: Low-rank adaptation of large language models
E. J. Huet al., “LoRA: Low-rank adaptation of large language models.” inInternational Conference on Learning Representations (ICLR), 2022
work page 2022
-
[29]
LOTUS: Continual imitation learning for robot manipulation through unsupervised skill discovery,
W. Wan, Y . Zhu, R. Shah, and Y . Zhu, “LOTUS: Continual imitation learning for robot manipulation through unsupervised skill discovery,” inIEEE International Conference on Robotics and Automation (ICRA), 2024, pp. 537–544
work page 2024
-
[30]
Bottom-up skill discovery from unsegmented demonstrations for long-horizon robot manipulation,
Y . Zhu, P. Stone, and Y . Zhu, “Bottom-up skill discovery from unsegmented demonstrations for long-horizon robot manipulation,” IEEE Robotics and Automation Letters, pp. 4126–4133, 2022
work page 2022
-
[31]
L. Xie, Y . Li, and H. Zhuang, “Analytic task scheduler: Recursive least squares based method for continual learning in embodied foundation models,”arXiv preprint arXiv:2506.09623, 2025
-
[32]
Diffusion policy: Visuomotor policy learning via action diffusion,
C. Chiet al., “Diffusion policy: Visuomotor policy learning via action diffusion,” inRobotics: Science and Systems (RSS), 2023
work page 2023
-
[33]
Learning fine-grained bimanual manipulation with low-cost hardware,
T. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning fine-grained bimanual manipulation with low-cost hardware,” inRobotics: Science and Systems (RSS), 2023
work page 2023
-
[34]
Deep generative models in robotics: A survey on learning from multimodal demonstrations,
J. Urainet al., “Deep generative models in robotics: A sur- vey on learning from multimodal demonstrations,”arXiv preprint arXiv:2408.04380, 2024
-
[35]
Outrageously large neural networks: The sparsely- gated mixture-of-experts layer,
N. Shazeeret al., “Outrageously large neural networks: The sparsely- gated mixture-of-experts layer,” inInternational Conference on Learn- ing Representations (ICLR), 2017
work page 2017
-
[36]
DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models,
D. Daiet al., “DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models,” inAnnual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), 2024, pp. 1280–1297
work page 2024
-
[37]
Locating and editing factual associations in gpt,
K. Meng, D. Bau, A. Andonian, and Y . Belinkov, “Locating and editing factual associations in gpt,”Advances in Neural Information Processing Systems, pp. 17 359–17 372, 2022
work page 2022
-
[38]
Transformer feed- forward layers are key-value memories,
M. Geva, R. Schuster, J. Berant, and O. Levy, “Transformer feed- forward layers are key-value memories,” inProceedings of the Con- ference on Empirical Methods in Natural Language Processing, 2021, pp. 5484–5495
work page 2021
-
[39]
Safe: Multitask failure detection for vision-language- action models,
Q. Guet al., “Safe: Multitask failure detection for vision-language- action models,”Advances in Neural Information Processing Systems (NeurIPS), 2025
work page 2025
-
[40]
Failure prediction at runtime for generative robot policies,
R. R ¨omer, A. Kobras, L. Worbis, and A. P. Schoellig, “Failure prediction at runtime for generative robot policies,”Advances in Neural Information Processing Systems (NeurIPS), 2025
work page 2025
-
[41]
LIBERO: Benchmarking knowledge transfer for lifelong robot learning,
B. Liuet al., “LIBERO: Benchmarking knowledge transfer for lifelong robot learning,”Advances in Neural Information Processing Systems (NeurIPS), pp. 44 776–44 791, 2023
work page 2023
-
[42]
The ingredients for robotic diffusion transformers,
S. Dasari, O. Mees, S. Zhao, M. K. Srirama, and S. Levine, “The ingredients for robotic diffusion transformers,” inProceedings of the International Conference on Robotics and Automation (ICRA), 2025
work page 2025
-
[43]
Scalable diffusion models with transform- ers,
W. Peebles and S. Xie, “Scalable diffusion models with transform- ers,” inProceedings of the IEEE/CVF international Conference on computer vision, 2023, pp. 4195–4205
work page 2023
-
[44]
DINOv2: Learning Robust Visual Features without Supervision
M. Oquabet al., “DINOv2: Learning robust visual features without supervision,”arXiv preprint arXiv:2304.07193, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
Learning transferable visual models from natural language supervision,
A. Radfordet al., “Learning transferable visual models from natural language supervision,” inInternational Conference on machine learn- ing. PMLR, 2021, pp. 8748–8763
work page 2021
-
[46]
Don’t forget, there is more than forgetting: new metrics for continual learning,
N. D ´ıaz-Rodr´ıguezet al., “Don’t forget, there is more than forgetting: new metrics for continual learning,” inContinual Learning Workshop at NeurIPS 2018, 2018, pp. 1–7
work page 2018
-
[47]
Training language models to follow instructions with human feedback,
L. Ouyanget al., “Training language models to follow instructions with human feedback,”Advances in Neural Information Processing Systems, pp. 27 730–27 744, 2022
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.