Simple Recipe Works: Vision-Language-Action Models are Natural Continual Learners with Reinforcement Learning

Bo Liu; Chen Tang; Jay Shim; Jiaheng Hu; Peter Stone; Roberto Martin-Martin; Yoonchang Sung

arxiv: 2603.11653 · v2 · pith:ROSR2MAJnew · submitted 2026-03-12 · 💻 cs.LG · cs.RO

Simple Recipe Works: Vision-Language-Action Models are Natural Continual Learners with Reinforcement Learning

Jiaheng Hu , Jay Shim , Chen Tang , Yoonchang Sung , Bo Liu , Peter Stone , Roberto Martin-Martin This is my paper

classification 💻 cs.LG cs.RO

keywords continuallearningadaptationlargefine-tuningforgettinglifelongmodel

0 comments

read the original abstract

Continual Reinforcement Learning (CRL) for Vision-Language-Action (VLA) models is a promising direction toward self-improving embodied agents that can adapt in openended, evolving environments. However, conventional wisdom from continual learning suggests that naive Sequential Fine-Tuning (Seq. FT) leads to catastrophic forgetting, necessitating complex CRL strategies. In this work, we take a step back and conduct a systematic study of CRL for large pretrained VLAs across diverse lifelong RL benchmarks. We find that, contrary to established belief, simple Seq. FT with low-rank adaptation (LoRA) is remarkably strong: it achieves high plasticity, exhibits little to no forgetting, and retains strong zero-shot generalization, frequently outperforming more sophisticated CRL methods. Through detailed analysis, we show that this robustness arises from a synergy between the large pretrained model, parameter-efficient adaptation, and on-policy RL. Together, these components reshape the stability-plasticity trade-off, making continual adaptation both stable and scalable. Our results position Sequential Fine-Tuning as a powerful method for continual RL with VLAs and provide new insights into lifelong learning in the large model era. Code is available at github.com/UT-Austin-RobIn/continual-vla-rl.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Can VLA Models Learn from Real-World Data Continually without Forgetting?
cs.RO 2026-05 unverdicted novelty 7.0

VLA models exhibit catastrophic forgetting on a new real-world dataset of four sequential manipulation tasks, with experience replay implementation factors evaluated for mitigation.
Preserving Foundational Capabilities in Flow-Matching VLAs through Conservative SFT
cs.RO 2026-05 unverdicted novelty 7.0

ConSFT prevents catastrophic forgetting in fine-tuning flow-matching VLAs by dynamically scaling gradients based on model confidence, retaining over 20% more pre-trained capability than standard SFT without prior data...
Beyond One-Size-Fits-All: Diagnosis-Driven Online Reinforcement Learning with Offline Priors
cs.LG 2026-06 unverdicted novelty 6.0

Argues for shifting to diagnosis-driven tension management of offline priors in online RL, supported by a framework on prior roles, experiments showing help-or-hurt reversals, and cross-domain evidence.
What Matters in Orchestrating Robot Policies: A Systematic Study of Hierarchical VLA Agents
cs.RO 2026-06 unverdicted novelty 6.0

A systematic study of hierarchical VLA agents identifies design principles that improve robot manipulation performance over flat and naive hierarchical baselines in simulation and real-world experiments.
PHASER: Phase-Aware and Semantic Experience Replay for Vision-Language-Action Models
cs.RO 2026-06 unverdicted novelty 6.0

PHASER improves average success rate by up to 31% over uniform experience replay on LIBERO continual learning benchmarks for VLA models by phase-centric capacity allocation and semantic interference routing.
Preserving Foundational Capabilities in Flow-Matching VLAs through Conservative SFT
cs.RO 2026-05 unverdicted novelty 5.0

ConSFT is a gradient-scaling fine-tuning objective for flow-matching VLAs that bounds parameter disruption via model-confidence weighting, yielding over 20% better capability retention than vanilla SFT on LIBERO and RoboTwin.