LiMoDE: Rethinking Lifelong Robot Manipulation from a Mixture-of-Dynamic-Experts Perspective

Lin Wang; Zhihao Gu

arxiv: 2606.26183 · v1 · pith:KPHQD4FGnew · submitted 2026-06-24 · 💻 cs.RO · cs.AI· cs.LG

LiMoDE: Rethinking Lifelong Robot Manipulation from a Mixture-of-Dynamic-Experts Perspective

Zhihao Gu , Lin Wang This is my paper

Pith reviewed 2026-06-26 02:02 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LG

keywords lifelong learningrobot manipulationmixture of expertsdynamic expertstask adaptationcatastrophic forgettingparameter-efficient fine-tuning

0 comments

The pith

A dynamic mixture-of-experts structure learns reusable robot manipulation skills in pre-training and combines them with new experts for lifelong task adaptation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LiMoDE as a two-stage method to let robots build on prior manipulation experience when facing new tasks. In the first stage a dynamic mixture of experts activates different numbers of heterogeneous specialists according to motion patterns, extracting reusable short-term skills across multiple tasks. In the second stage new lifelong experts are trained and merged on the fly with the frozen pre-trained ones so that knowledge transfers without overwriting earlier capabilities. This addresses the limits of prompt-based or single-task fine-tuning approaches that either lose old skills or fail to model skill interactions. If the separation of motion-driven experts succeeds, robots can accumulate capabilities over time while adding only a moderate number of extra parameters and little inference cost.

Core claim

LiMoDE first trains a dynamic MoE during multi-task pre-training that activates a varied number of heterogeneous experts based on motion information to capture prior knowledge for different short-term manipulations. It then applies a lifelong MoE adaptation mechanism that learns new experts and dynamically combines them with the frozen pre-trained experts when facing novel tasks, enabling knowledge transfer and lifelong adaptation.

What carries the argument

The LiMoDE two-stage scheme that uses motion-based dynamic expert activation in pre-training followed by on-the-fly combination of new lifelong experts with frozen ones during adaptation.

If this is right

Superior performance on simulated lifelong benchmarks and real-world tasks compared with prior parameter-efficient or prompt-based baselines.
Lifelong adaptation occurs by adding a moderate number of trainable parameters rather than retraining the full model.
Inference overhead remains limited because only a subset of experts is active for any given motion.
Knowledge transfer improves when new experts interact dynamically with the frozen pre-trained set instead of operating in isolation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The motion-driven expert split may generalize to other sequential robot skills such as navigation or assembly if similar low-level primitives can be isolated.
Increasing the expert pool size during pre-training could support longer task sequences without raising interference, provided the activation rule stays motion-based.
The frozen-expert reuse pattern suggests that early training on diverse motion data creates modular components that later tasks can query selectively.

Load-bearing premise

That motion-based activation of heterogeneous experts during pre-training produces reusable skills that can be freely recombined with newly trained experts without catastrophic interference.

What would settle it

A new manipulation task where the adapted model shows large performance drops on previously learned tasks or requires far more than the reported number of additional parameters to match the claimed accuracy.

Figures

Figures reproduced from arXiv: 2606.26183 by Lin Wang, Zhihao Gu.

**Figure 1.** Figure 1: Motivation of our framework. Base tasks contain various short-term actions, and skills in them (highlighted in bold) can be adapted to complete new tasks. New tasks contain shared knowledge, and interaction between them is beneficial for continual adaptation. explores learning task-specific parameters by low-rank adaptation (LoRA) [21], and retrieves corresponding adapters. Although these approaches achie… view at source ↗

**Figure 2.** Figure 2: Overview of the proposed Lifelong Mixture of Dynamic Experts (LiMoDE). It is a two-stage learning scheme that consists of a Dynamic Mixture of Experts Structure (DyMoES) and a Lifelong Mixture of Experts Adaptation Mechanism (LiMoEAM). In the pretraining stage, DyMoES activates a varied number of heterogeneous experts conditioned on motion information to learn skills in short-term manipulations. In the li… view at source ↗

**Figure 3.** Figure 3: Experimental setup. Top: LIBERO. Bottom: Real world. experts, we adopt a lightweight router-decorrelation regularizer LRDR on the Gramian matrix of the gating network Wrouter: LRDR = 1 2 ∥WT routerWrouter − INE ∥ 2 2 , (5) where INE is an identity matrix with dimension NE. Note that orthogonality- and decorrelation-inspired routing regularizers have been explored in MoE to improve routing diversity and red… view at source ↗

**Figure 4.** Figure 4: Visualization of the visual-dynamics-conditioned threshold and expert selection mechanisms. (a) The visual-dynamics-conditioned threshold M across timesteps for task “Pick the spoon and stir water in the bowl”. The colored clusters denote the different manipulation phases. Each phase is also labeled with the average number of activated experts K, computed by averaging token-wise activations within that pha… view at source ↗

read the original abstract

Building a generalist robot that can leverage prior knowledge for continuous task adaptation remains a significant challenge. Previous works alleviate the catastrophic forgetting problem by parameter-efficient fine-tuning for single-task adaptation. However, they fail to extract reusable skills and model the interaction with other skills effectively. Recent works try to address these issues by learning prompts. Differently, this paper presents an architectural perspective on the Lifelong Mixture of Dynamic Experts (\textit{LiMoDE}), a novel two-stage learning scheme for lifelong robot manipulation. Specifically, a dynamic MoE structure is first proposed in the multi-task pre-training stage to learn prior knowledge, where a varied number of heterogeneous experts are activated based on the motion information to address different short-term manipulations. Subsequently, in the task adaptation stage, we design a lifelong MoE adaptation mechanism % (LiMoEAM) that learns lifelong experts and dynamically combines them with frozen ones for new tasks, facilitating the knowledge transfer during adaptation. The proposed \textit{LiMoDE} is evaluated on both the simulated lifelong learning benchmark and real-world tasks. Extensive experiments demonstrate its effectiveness in achieving superior performance and strong lifelong adaptation by introducing a moderate number of additional trainable parameters and inference overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LiMoDE sketches a two-stage dynamic MoE for lifelong robot manipulation but the abstract supplies no numbers, baselines, or ablations so the performance claims cannot be checked.

read the letter

The main takeaway is that the paper puts forward a two-stage mixture-of-dynamic-experts scheme for lifelong robot manipulation: a pre-training stage that uses motion-conditioned dynamic routing to activate varying numbers of heterogeneous experts, followed by an adaptation stage that adds new lifelong experts while keeping the pre-trained ones frozen and combining them on the fly. That split is the concrete proposal.

It improves on prior single-task fine-tuning or prompt methods by trying to extract reusable skills through the expert structure rather than treating each task in isolation. The robotics-specific choice to condition routing on motion information is a reasonable domain touch, and the parameter-efficient addition of new experts is a standard way to limit forgetting.

The soft spot is the missing evidence for the central transfer assumption. The abstract states that the approach achieves superior performance and strong lifelong adaptation on simulated benchmarks and real tasks with only moderate extra parameters and inference cost, yet it gives no quantitative results, no listed baselines, no forgetting curves, and no ablation that isolates whether the pre-trained experts actually remain reusable and non-interfering once new ones are added. Without those checks the claim that the pre-training stage successfully extracts composable skills does not land.

This is for people already working on continual learning or MoE architectures in robotics who want to see how the two-stage split plays out in manipulation. A reader looking for a method with demonstrated gains will not find enough here yet. It deserves peer review because the problem matters and the architecture is coherent on its own terms; the experiments, if present and solid, would be worth referee time even if they require substantial revision.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes LiMoDE, a two-stage scheme for lifelong robot manipulation. It uses a dynamic MoE in multi-task pre-training to activate varied numbers of heterogeneous experts based on motion information for learning prior knowledge. Then, in adaptation, lifelong experts are learned and dynamically combined with frozen pre-trained ones. It claims superior performance and strong lifelong adaptation on simulated benchmarks and real-world tasks with moderate additional parameters and inference overhead.

Significance. If the empirical claims hold, the work could advance lifelong robotic learning by providing an architectural separation between skill extraction via dynamic expert activation and interference-free adaptation, potentially improving upon prompt-based or parameter-efficient fine-tuning methods in terms of reusability and efficiency.

major comments (3)

[Abstract] Abstract: the claim of 'superior performance and strong lifelong adaptation' is asserted without any quantitative results, baselines, ablation studies, or metrics, so the central empirical claim cannot be evaluated from the manuscript.
[Pre-training stage / Task adaptation stage] Pre-training and adaptation stages: the load-bearing assumption that dynamic MoE pre-training extracts reusable, non-interfering skills (via motion-conditioned expert activation) is stated but not isolated; no ablation (e.g., freezing vs. unfreezing pre-trained experts, per-expert attribution, or forgetting curves across task sequences) is described to confirm transfer without catastrophic interference.
[Evaluation] Evaluation: the manuscript states evaluation on 'simulated lifelong learning benchmark and real-world tasks' but supplies no tables, figures, or numbers showing parameter counts, inference overhead, or comparisons, undermining the efficiency and superiority assertions.

minor comments (1)

[Abstract] The parenthetical '(LiMoEAM)' appears once but is never referenced again; either define the acronym consistently or remove it.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and indicate where revisions will be made to improve clarity and completeness.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of 'superior performance and strong lifelong adaptation' is asserted without any quantitative results, baselines, ablation studies, or metrics, so the central empirical claim cannot be evaluated from the manuscript.

Authors: We agree the abstract would be strengthened by including key quantitative highlights. The full manuscript contains these results in the experiments section. We will revise the abstract to reference specific metrics (e.g., success rates and overhead comparisons) while remaining concise. revision: yes
Referee: [Pre-training stage / Task adaptation stage] Pre-training and adaptation stages: the load-bearing assumption that dynamic MoE pre-training extracts reusable, non-interfering skills (via motion-conditioned expert activation) is stated but not isolated; no ablation (e.g., freezing vs. unfreezing pre-trained experts, per-expert attribution, or forgetting curves across task sequences) is described to confirm transfer without catastrophic interference.

Authors: This observation is fair. The current text describes the mechanism but lacks explicit isolation experiments. We will add ablations on expert freezing, attribution, and forgetting curves in a new subsection of the revised manuscript. revision: yes
Referee: [Evaluation] Evaluation: the manuscript states evaluation on 'simulated lifelong learning benchmark and real-world tasks' but supplies no tables, figures, or numbers showing parameter counts, inference overhead, or comparisons, undermining the efficiency and superiority assertions.

Authors: The manuscript does contain evaluation tables and figures with these metrics (performance, parameters, overhead). To prevent any oversight, we will add a consolidated summary table early in the evaluation section and ensure all comparisons are explicitly cross-referenced. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical claims rest on external benchmarks

full rationale

The paper proposes a two-stage dynamic MoE architecture for lifelong manipulation (pre-training with motion-conditioned experts, then adaptation with frozen experts plus new lifelong experts) and supports its claims solely via experimental results on simulated benchmarks and real-world tasks. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central performance claims are therefore independent of any internal reduction and rest on falsifiable external evaluation, satisfying the default expectation of no significant circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; all claims rest on the unstated assumption that motion-based expert activation yields reusable skills.

pith-pipeline@v0.9.1-grok · 5742 in / 997 out tokens · 16430 ms · 2026-06-26T02:02:48.406224+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

65 extracted references · 17 canonical work pages · 10 internal anchors

[1]

RT-1: Robotics Transformer for Real-World Control at Scale

A. Brohanet al., “Rt-1: Robotics transformer for real-world control at scale,”arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

Rt-2: Vision-language-action models transfer web knowledge to robotic control,

B. Zitkovichet al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” inCoRL, 2023

2023
[3]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Blacket al., “π0: A vision-language-action flow model for general robot control,”arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Learning diffusion policy from primitive skills for robot manipulation,

Z. Gu, M. Yang, D. Zou, and D. Xu, “Learning diffusion policy from primitive skills for robot manipulation,” inAAAI, 2026

2026
[5]

Bc-z: Zero-shot task generaliza- tion with robotic imitation learning,

E. Jang, A. Irpan, M. Khansariet al., “Bc-z: Zero-shot task generaliza- tion with robotic imitation learning,” inCoRL, 2022

2022
[6]

Think small, act big: Primitive prompt learning for lifelong robot manipulation,

Y . Yao and other, “Think small, act big: Primitive prompt learning for lifelong robot manipulation,” inCVPR, 2025

2025
[7]

M2distill: Multi- modal distillation for lifelong imitation learning,

K. Roy, A. Dissanayakc, B. Tidd, and P. Moghadam, “M2distill: Multi- modal distillation for lifelong imitation learning,” inIEEE ICRA, 2025

2025
[8]

Policy compatible skill incremental learning via lazy learning interface,

D. Lee, D. Lee, T. Kwack, W. Choi, and H. Woo, “Policy compatible skill incremental learning via lazy learning interface,” inNeurIPS, 2025

2025
[9]

A continual learning survey: Defying forgetting in classification tasks,

M. De Lange, R. Aljundi, M. Masana, S. Parisot, X. Jia, A. Leonardis, G. Slabaugh, and T. Tuytelaars, “A continual learning survey: Defying forgetting in classification tasks,” inIEEE TPAMI, 2021

2021
[10]

Cril: Continual robot imitation learning via generative and prediction model,

C. Gao, H. Gao, and S. e. a. Guo, “Cril: Continual robot imitation learning via generative and prediction model,” inIEEE IROS, 2021

2021
[11]

Bottom-up skill discovery from unsegmented demonstra- tions for long-horizon robot manipulation,

Y . Zhuet al., “Bottom-up skill discovery from unsegmented demonstra- tions for long-horizon robot manipulation,” inIEEE RAL, 2022

2022
[12]

Libero: Benchmarking knowledge transfer for lifelong robot learning,

B. Liu, Y . Zhu, C. Gaoet al., “Libero: Benchmarking knowledge transfer for lifelong robot learning,” inNeurIPS, 2023

2023
[13]

Lotus: Continual imitation learning for robot manipu- lation through unsupervised skill discovery,

W. Wanet al., “Lotus: Continual imitation learning for robot manipu- lation through unsupervised skill discovery,” inIEEE ICRA, 2024

2024
[14]

Expe- rience replay for continual learning,

D. Rolnick, A. Ahuja, J. Schwarz, T. Lillicrap, and G. Wayne, “Expe- rience replay for continual learning,” inNeurIPS, 2019

2019
[15]

Efficient data collection for robotic manipulation via compositional generalization,

J. Gao, A. Xie, T. Xiao, C. Finn, and D. Sadigh, “Efficient data collection for robotic manipulation via compositional generalization,” inRSS, 2024

2024
[16]

Learning without forgetting,

Z. Liet al., “Learning without forgetting,” inIEEE TPAMI, 2017

2017
[17]

Continual learning through synaptic intelligence,

F. Zenke, B. Poole, and S. Ganguli, “Continual learning through synaptic intelligence,” inICML, 2017

2017
[18]

Tail: Task-specific adapters for imitation learning with large pretrained models,

Z. Liu, J. Zhang, K. Asadiet al., “Tail: Task-specific adapters for imitation learning with large pretrained models,” inICML, 2024

2024
[19]

Learning to modulate pre-trained models in rl,

T. Schmied, M. Hofmarcher, F. Paischer, R. Pascanu, and S. Hochreiter, “Learning to modulate pre-trained models in rl,” inNeurIPS, 2023

2023
[20]

Towards a unified view of parameter-efficient transfer learning,

J. He, C. Zhou, X. Ma, T. Berg-Kirkpatrick, and G. Neubig, “Towards a unified view of parameter-efficient transfer learning,” inICLR, 2021

2021
[21]

Lora: Low-rank adaptation of large language models

E. J. Hu, Y . Shen, P. Walliset al., “Lora: Low-rank adaptation of large language models.” inICLR, 2022

2022
[22]

Continual sequence generation with adaptive compositional modules,

Y . Zhang, X. Wang, and D. Yang, “Continual sequence generation with adaptive compositional modules,” inACL, 2022

2022
[23]

Sapt: A shared attention framework for parameter-efficient continual learning of large language models,

W. Zhao, S. Wang, Y . Huet al., “Sapt: A shared attention framework for parameter-efficient continual learning of large language models,” in ACL, 2024, pp. 11 641–11 661

2024
[24]

Adaptformer: Adapting vision transformers for scalable visual recognition,

S. Chen, C. Ge, Z. Tong, and other, “Adaptformer: Adapting vision transformers for scalable visual recognition,” inNeurIPS, 2022

2022
[25]

Visual prompt tuning,

M. Jia, L. Tang, B.-C. Chen, C. Cardie, S. Belongie, B. Hariharan, and S.-N. Lim, “Visual prompt tuning,” inECCV, 2022

2022
[26]

The Power of Scale for Parameter-Efficient Prompt Tuning

B. Lester, R. Al-Rfou, and N. Constant, “The power of scale for parameter-efficient prompt tuning,”arXiv:2104.08691, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[27]

Incremental learning of retrievable skills for efficient continual task adaptation,

D. Lee, M. Yoo, W. K. Kimet al., “Incremental learning of retrievable skills for efficient continual task adaptation,” inNeurIPS, 2024

2024
[28]

Hand me the data: Fast robot adaptation via hand path retrieval,

M. Hong, A. Liang, K. Kimet al., “Hand me the data: Fast robot adaptation via hand path retrieval,”arXiv:2505.20455, 2025

work page arXiv 2025
[29]

Learning generalizable manipulation policy with adapter-based parameter fine-tuning,

K. Lu, K. T. Ly, W. Hebberdet al., “Learning generalizable manipulation policy with adapter-based parameter fine-tuning,” inIROS, 2024

2024
[30]

Efficient continual adaptation of pretrained robotic policy with online meta-learned adapters,

R. Zhuet al., “Efficient continual adaptation of pretrained robotic policy with online meta-learned adapters,”arXiv:2503.18684, 2025

work page arXiv 2025
[31]

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

D. Daiet al., “Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models,”arXiv:2401.06066, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

A. Liuet al., “Deepseek-v2: A strong, economical, and efficient mixture- of-experts language model,”arXiv:2405.04434, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Moeut: Mixture-of-experts universal transformers,

R. Csord ´as, K. Irie, J. Schmidhuber, C. Potts, and C. D. Manning, “Moeut: Mixture-of-experts universal transformers,” inNeurIPS, 2024

2024
[34]

Switchhead: Accelerating trans- formers with mixture-of-experts attention,

R. Csord ´as, P. Pikekos, K. Irieet al., “Switchhead: Accelerating trans- formers with mixture-of-experts attention,” inNeurIPS, 2024

2024
[35]

Statistical advantages of perturbing cosine router in mixture of experts,

H. Nguyen, P. Akbarian, T. Phamet al., “Statistical advantages of perturbing cosine router in mixture of experts,” inICLR, 2025

2025
[36]

Routing experts: Learning to route dynamic experts in multi-modal large language models,

Q. Wu, Z. Ke, Y . Zhouet al., “Routing experts: Learning to route dynamic experts in multi-modal large language models,” inICLR, 2025

2025
[37]

From sparse to soft mixtures of experts,

J. Puigcerver, C. Riquelme, B. Mustafa, and N. Houlsby, “From sparse to soft mixtures of experts,” inICLR, 2024

2024
[38]

arXiv preprint arXiv:2405.00361 (2024)

Z. Liuet al., “Adamole: Fine-tuning large language models with adaptive mixture of low-rank adaptation experts,”arXiv:2405.00361, 2024

work page arXiv 2024
[39]

Mixture of lora experts,

X. Wu, S. Huang, and F. Wei, “Mixture of lora experts,” inICLR, 2024

2024
[40]

Omni-smola: Boosting generalist multimodal models with soft mixture of low-rank experts,

J. Wuet al., “Omni-smola: Boosting generalist multimodal models with soft mixture of low-rank experts,” inCVPR, 2024

2024
[41]

Moe-loco: Mixture of experts for multitask locomotion,

R. Huang, S. Zhu, Y . Du, and H. Zhao, “Moe-loco: Mixture of experts for multitask locomotion,”arXiv:2503.08564, 2025

work page arXiv 2025
[42]

Sparse diffusion policy: A sparse, reusable, and flexible policy for robot learning,

Y . Wang, Y . Zhang, M. Huoet al., “Sparse diffusion policy: A sparse, reusable, and flexible policy for robot learning,” inCoRL, 2024

2024
[43]

Efficient diffusion transformer policies with mixture of expert denoisers for multitask learning,

M. Reusset al., “Efficient diffusion transformer policies with mixture of expert denoisers for multitask learning,” inICLR, 2025

2025
[44]

Expertise need not monopolize: Action- specialized mixture of experts for vision-language-action learning,

W. Shen, Y . Liu, Y . Wuet al., “Expertise need not monopolize: Action- specialized mixture of experts for vision-language-action learning,” arXiv:2510.14300, 2025

work page arXiv 2025
[45]

Diffusion policy: Visuomotor policy learning via action diffusion,

C. Chi, Z. Xu, S. Fenget al., “Diffusion policy: Visuomotor policy learning via action diffusion,” inIJRR, 2025

2025
[46]

Learning grounded finite-state representations from unstructured demonstrations,

S. Niekumet al., “Learning grounded finite-state representations from unstructured demonstrations,”IJRR, 2015

2015
[47]

Generative skill chaining: Long-horizon skill planning with diffusion models,

U. A. Mishra, S. Xue, Y . Chen, and D. Xu, “Generative skill chaining: Long-horizon skill planning with diffusion models,” inCoRL, 2023

2023
[48]

Skilldiffuser: Interpretable hierarchical planning via skill abstractions in diffusion-based task execution,

Z. Lianget al., “Skilldiffuser: Interpretable hierarchical planning via skill abstractions in diffusion-based task execution,” inCVPR, 2024

2024
[49]

A framework for behavioural cloning

M. Bain and C. Sammut, “A framework for behavioural cloning.” in Machine intelligence, 1995

1995
[50]

Learning transferable visual models from natural language supervision,

A. Radfordet al., “Learning transferable visual models from natural language supervision,” inICML, 2021

2021
[51]

Film: Visual reasoning with a general conditioning layer,

E. Perez, F. Strub, H. De Vries, V . Dumoulin, and A. Courville, “Film: Visual reasoning with a general conditioning layer,” inAAAI, 2018

2018
[52]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

N. Shazeeret al., “Outrageously large neural networks: The sparsely- gated mixture-of-experts layer,”arXiv:1701.06538, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[53]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

D. Lepikhinet al., “Gshard: Scaling giant models with conditional computation and automatic sharding,”arXiv:2006.16668, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006
[54]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,

W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,”Journal of Machine Learning Research, vol. 23, no. 120, pp. 1–39, 2022

2022
[55]

Load balancing mixture of experts with similarity preserving routers,

N. Omi, S. Sen, and A. Farhadi, “Load balancing mixture of experts with similarity preserving routers,”arXiv preprint arXiv:2506.14038, 2025

work page arXiv 2025
[56]

Advancing expert specialization for better moe,

H. Guo, H. Lu, G. Nanet al., “Advancing expert specialization for better moe,”NIPS, vol. 38, pp. 48 767–48 809, 2026

2026
[57]

Synergistic intra-and cross-layer regularization losses for moe expert specialization,

R. Huet al., “Synergistic intra-and cross-layer regularization losses for moe expert specialization,”arXiv:2602.14159, 2026

work page arXiv 2026
[58]

Continual learning with tiny episodic memories,

A. Chaudhry, M. Rohrbach, M. Elhoseinyet al., “Continual learning with tiny episodic memories,” inWorkshop on Multi-Task and Lifelong Reinforcement Learning, 2019

2019
[59]

Lifelong robotic reinforcement learning by retaining experiences,

A. Xie and C. Finn, “Lifelong robotic reinforcement learning by retaining experiences,” inCoLLA. PMLR, 2022, pp. 838–855

2022
[60]

OpenVLA: An Open-Source Vision-Language-Action Model

M. J. Kimet al., “Openvla: An open-source vision-language-action model,”arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[61]

Universal actions for enhanced embodied foundation models,

J. Zheng, J. Li, D. Liuet al., “Universal actions for enhanced embodied foundation models,” inCVPR, 2025. 9

2025
[62]

Learning to act anywhere with task-centric latent actions,

Q. Bu, Y . Yang, J. Cai, S. Gao, G. Ren, M. Yao, P. Luo, and H. Li, “Learning to act anywhere with task-centric latent actions,” inRSS, 2025

2025
[63]

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

Q. Liet al., “Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation,” arXiv:2411.19650, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[64]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

M. J. Kim, C. Finn, and P. Liang, “Fine-tuning vision-language-action models: Optimizing speed and success,”arXiv:2502.19645, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[65]

Florence-vl: Enhancing vision-language models with generative vision encoder and depth-breadth fusion,

J. Chenet al., “Florence-vl: Enhancing vision-language models with generative vision encoder and depth-breadth fusion,” inCVPR, 2025

2025

[1] [1]

RT-1: Robotics Transformer for Real-World Control at Scale

A. Brohanet al., “Rt-1: Robotics transformer for real-world control at scale,”arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[2] [2]

Rt-2: Vision-language-action models transfer web knowledge to robotic control,

B. Zitkovichet al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” inCoRL, 2023

2023

[3] [3]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Blacket al., “π0: A vision-language-action flow model for general robot control,”arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Learning diffusion policy from primitive skills for robot manipulation,

Z. Gu, M. Yang, D. Zou, and D. Xu, “Learning diffusion policy from primitive skills for robot manipulation,” inAAAI, 2026

2026

[5] [5]

Bc-z: Zero-shot task generaliza- tion with robotic imitation learning,

E. Jang, A. Irpan, M. Khansariet al., “Bc-z: Zero-shot task generaliza- tion with robotic imitation learning,” inCoRL, 2022

2022

[6] [6]

Think small, act big: Primitive prompt learning for lifelong robot manipulation,

Y . Yao and other, “Think small, act big: Primitive prompt learning for lifelong robot manipulation,” inCVPR, 2025

2025

[7] [7]

M2distill: Multi- modal distillation for lifelong imitation learning,

K. Roy, A. Dissanayakc, B. Tidd, and P. Moghadam, “M2distill: Multi- modal distillation for lifelong imitation learning,” inIEEE ICRA, 2025

2025

[8] [8]

Policy compatible skill incremental learning via lazy learning interface,

D. Lee, D. Lee, T. Kwack, W. Choi, and H. Woo, “Policy compatible skill incremental learning via lazy learning interface,” inNeurIPS, 2025

2025

[9] [9]

A continual learning survey: Defying forgetting in classification tasks,

M. De Lange, R. Aljundi, M. Masana, S. Parisot, X. Jia, A. Leonardis, G. Slabaugh, and T. Tuytelaars, “A continual learning survey: Defying forgetting in classification tasks,” inIEEE TPAMI, 2021

2021

[10] [10]

Cril: Continual robot imitation learning via generative and prediction model,

C. Gao, H. Gao, and S. e. a. Guo, “Cril: Continual robot imitation learning via generative and prediction model,” inIEEE IROS, 2021

2021

[11] [11]

Bottom-up skill discovery from unsegmented demonstra- tions for long-horizon robot manipulation,

Y . Zhuet al., “Bottom-up skill discovery from unsegmented demonstra- tions for long-horizon robot manipulation,” inIEEE RAL, 2022

2022

[12] [12]

Libero: Benchmarking knowledge transfer for lifelong robot learning,

B. Liu, Y . Zhu, C. Gaoet al., “Libero: Benchmarking knowledge transfer for lifelong robot learning,” inNeurIPS, 2023

2023

[13] [13]

Lotus: Continual imitation learning for robot manipu- lation through unsupervised skill discovery,

W. Wanet al., “Lotus: Continual imitation learning for robot manipu- lation through unsupervised skill discovery,” inIEEE ICRA, 2024

2024

[14] [14]

Expe- rience replay for continual learning,

D. Rolnick, A. Ahuja, J. Schwarz, T. Lillicrap, and G. Wayne, “Expe- rience replay for continual learning,” inNeurIPS, 2019

2019

[15] [15]

Efficient data collection for robotic manipulation via compositional generalization,

J. Gao, A. Xie, T. Xiao, C. Finn, and D. Sadigh, “Efficient data collection for robotic manipulation via compositional generalization,” inRSS, 2024

2024

[16] [16]

Learning without forgetting,

Z. Liet al., “Learning without forgetting,” inIEEE TPAMI, 2017

2017

[17] [17]

Continual learning through synaptic intelligence,

F. Zenke, B. Poole, and S. Ganguli, “Continual learning through synaptic intelligence,” inICML, 2017

2017

[18] [18]

Tail: Task-specific adapters for imitation learning with large pretrained models,

Z. Liu, J. Zhang, K. Asadiet al., “Tail: Task-specific adapters for imitation learning with large pretrained models,” inICML, 2024

2024

[19] [19]

Learning to modulate pre-trained models in rl,

T. Schmied, M. Hofmarcher, F. Paischer, R. Pascanu, and S. Hochreiter, “Learning to modulate pre-trained models in rl,” inNeurIPS, 2023

2023

[20] [20]

Towards a unified view of parameter-efficient transfer learning,

J. He, C. Zhou, X. Ma, T. Berg-Kirkpatrick, and G. Neubig, “Towards a unified view of parameter-efficient transfer learning,” inICLR, 2021

2021

[21] [21]

Lora: Low-rank adaptation of large language models

E. J. Hu, Y . Shen, P. Walliset al., “Lora: Low-rank adaptation of large language models.” inICLR, 2022

2022

[22] [22]

Continual sequence generation with adaptive compositional modules,

Y . Zhang, X. Wang, and D. Yang, “Continual sequence generation with adaptive compositional modules,” inACL, 2022

2022

[23] [23]

Sapt: A shared attention framework for parameter-efficient continual learning of large language models,

W. Zhao, S. Wang, Y . Huet al., “Sapt: A shared attention framework for parameter-efficient continual learning of large language models,” in ACL, 2024, pp. 11 641–11 661

2024

[24] [24]

Adaptformer: Adapting vision transformers for scalable visual recognition,

S. Chen, C. Ge, Z. Tong, and other, “Adaptformer: Adapting vision transformers for scalable visual recognition,” inNeurIPS, 2022

2022

[25] [25]

Visual prompt tuning,

M. Jia, L. Tang, B.-C. Chen, C. Cardie, S. Belongie, B. Hariharan, and S.-N. Lim, “Visual prompt tuning,” inECCV, 2022

2022

[26] [26]

The Power of Scale for Parameter-Efficient Prompt Tuning

B. Lester, R. Al-Rfou, and N. Constant, “The power of scale for parameter-efficient prompt tuning,”arXiv:2104.08691, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[27] [27]

Incremental learning of retrievable skills for efficient continual task adaptation,

D. Lee, M. Yoo, W. K. Kimet al., “Incremental learning of retrievable skills for efficient continual task adaptation,” inNeurIPS, 2024

2024

[28] [28]

Hand me the data: Fast robot adaptation via hand path retrieval,

M. Hong, A. Liang, K. Kimet al., “Hand me the data: Fast robot adaptation via hand path retrieval,”arXiv:2505.20455, 2025

work page arXiv 2025

[29] [29]

Learning generalizable manipulation policy with adapter-based parameter fine-tuning,

K. Lu, K. T. Ly, W. Hebberdet al., “Learning generalizable manipulation policy with adapter-based parameter fine-tuning,” inIROS, 2024

2024

[30] [30]

Efficient continual adaptation of pretrained robotic policy with online meta-learned adapters,

R. Zhuet al., “Efficient continual adaptation of pretrained robotic policy with online meta-learned adapters,”arXiv:2503.18684, 2025

work page arXiv 2025

[31] [31]

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

D. Daiet al., “Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models,”arXiv:2401.06066, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

A. Liuet al., “Deepseek-v2: A strong, economical, and efficient mixture- of-experts language model,”arXiv:2405.04434, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

Moeut: Mixture-of-experts universal transformers,

R. Csord ´as, K. Irie, J. Schmidhuber, C. Potts, and C. D. Manning, “Moeut: Mixture-of-experts universal transformers,” inNeurIPS, 2024

2024

[34] [34]

Switchhead: Accelerating trans- formers with mixture-of-experts attention,

R. Csord ´as, P. Pikekos, K. Irieet al., “Switchhead: Accelerating trans- formers with mixture-of-experts attention,” inNeurIPS, 2024

2024

[35] [35]

Statistical advantages of perturbing cosine router in mixture of experts,

H. Nguyen, P. Akbarian, T. Phamet al., “Statistical advantages of perturbing cosine router in mixture of experts,” inICLR, 2025

2025

[36] [36]

Routing experts: Learning to route dynamic experts in multi-modal large language models,

Q. Wu, Z. Ke, Y . Zhouet al., “Routing experts: Learning to route dynamic experts in multi-modal large language models,” inICLR, 2025

2025

[37] [37]

From sparse to soft mixtures of experts,

J. Puigcerver, C. Riquelme, B. Mustafa, and N. Houlsby, “From sparse to soft mixtures of experts,” inICLR, 2024

2024

[38] [38]

arXiv preprint arXiv:2405.00361 (2024)

Z. Liuet al., “Adamole: Fine-tuning large language models with adaptive mixture of low-rank adaptation experts,”arXiv:2405.00361, 2024

work page arXiv 2024

[39] [39]

Mixture of lora experts,

X. Wu, S. Huang, and F. Wei, “Mixture of lora experts,” inICLR, 2024

2024

[40] [40]

Omni-smola: Boosting generalist multimodal models with soft mixture of low-rank experts,

J. Wuet al., “Omni-smola: Boosting generalist multimodal models with soft mixture of low-rank experts,” inCVPR, 2024

2024

[41] [41]

Moe-loco: Mixture of experts for multitask locomotion,

R. Huang, S. Zhu, Y . Du, and H. Zhao, “Moe-loco: Mixture of experts for multitask locomotion,”arXiv:2503.08564, 2025

work page arXiv 2025

[42] [42]

Sparse diffusion policy: A sparse, reusable, and flexible policy for robot learning,

Y . Wang, Y . Zhang, M. Huoet al., “Sparse diffusion policy: A sparse, reusable, and flexible policy for robot learning,” inCoRL, 2024

2024

[43] [43]

Efficient diffusion transformer policies with mixture of expert denoisers for multitask learning,

M. Reusset al., “Efficient diffusion transformer policies with mixture of expert denoisers for multitask learning,” inICLR, 2025

2025

[44] [44]

Expertise need not monopolize: Action- specialized mixture of experts for vision-language-action learning,

W. Shen, Y . Liu, Y . Wuet al., “Expertise need not monopolize: Action- specialized mixture of experts for vision-language-action learning,” arXiv:2510.14300, 2025

work page arXiv 2025

[45] [45]

Diffusion policy: Visuomotor policy learning via action diffusion,

C. Chi, Z. Xu, S. Fenget al., “Diffusion policy: Visuomotor policy learning via action diffusion,” inIJRR, 2025

2025

[46] [46]

Learning grounded finite-state representations from unstructured demonstrations,

S. Niekumet al., “Learning grounded finite-state representations from unstructured demonstrations,”IJRR, 2015

2015

[47] [47]

Generative skill chaining: Long-horizon skill planning with diffusion models,

U. A. Mishra, S. Xue, Y . Chen, and D. Xu, “Generative skill chaining: Long-horizon skill planning with diffusion models,” inCoRL, 2023

2023

[48] [48]

Skilldiffuser: Interpretable hierarchical planning via skill abstractions in diffusion-based task execution,

Z. Lianget al., “Skilldiffuser: Interpretable hierarchical planning via skill abstractions in diffusion-based task execution,” inCVPR, 2024

2024

[49] [49]

A framework for behavioural cloning

M. Bain and C. Sammut, “A framework for behavioural cloning.” in Machine intelligence, 1995

1995

[50] [50]

Learning transferable visual models from natural language supervision,

A. Radfordet al., “Learning transferable visual models from natural language supervision,” inICML, 2021

2021

[51] [51]

Film: Visual reasoning with a general conditioning layer,

E. Perez, F. Strub, H. De Vries, V . Dumoulin, and A. Courville, “Film: Visual reasoning with a general conditioning layer,” inAAAI, 2018

2018

[52] [52]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

N. Shazeeret al., “Outrageously large neural networks: The sparsely- gated mixture-of-experts layer,”arXiv:1701.06538, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[53] [53]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

D. Lepikhinet al., “Gshard: Scaling giant models with conditional computation and automatic sharding,”arXiv:2006.16668, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006

[54] [54]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,

W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,”Journal of Machine Learning Research, vol. 23, no. 120, pp. 1–39, 2022

2022

[55] [55]

Load balancing mixture of experts with similarity preserving routers,

N. Omi, S. Sen, and A. Farhadi, “Load balancing mixture of experts with similarity preserving routers,”arXiv preprint arXiv:2506.14038, 2025

work page arXiv 2025

[56] [56]

Advancing expert specialization for better moe,

H. Guo, H. Lu, G. Nanet al., “Advancing expert specialization for better moe,”NIPS, vol. 38, pp. 48 767–48 809, 2026

2026

[57] [57]

Synergistic intra-and cross-layer regularization losses for moe expert specialization,

R. Huet al., “Synergistic intra-and cross-layer regularization losses for moe expert specialization,”arXiv:2602.14159, 2026

work page arXiv 2026

[58] [58]

Continual learning with tiny episodic memories,

A. Chaudhry, M. Rohrbach, M. Elhoseinyet al., “Continual learning with tiny episodic memories,” inWorkshop on Multi-Task and Lifelong Reinforcement Learning, 2019

2019

[59] [59]

Lifelong robotic reinforcement learning by retaining experiences,

A. Xie and C. Finn, “Lifelong robotic reinforcement learning by retaining experiences,” inCoLLA. PMLR, 2022, pp. 838–855

2022

[60] [60]

OpenVLA: An Open-Source Vision-Language-Action Model

M. J. Kimet al., “Openvla: An open-source vision-language-action model,”arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[61] [61]

Universal actions for enhanced embodied foundation models,

J. Zheng, J. Li, D. Liuet al., “Universal actions for enhanced embodied foundation models,” inCVPR, 2025. 9

2025

[62] [62]

Learning to act anywhere with task-centric latent actions,

Q. Bu, Y . Yang, J. Cai, S. Gao, G. Ren, M. Yao, P. Luo, and H. Li, “Learning to act anywhere with task-centric latent actions,” inRSS, 2025

2025

[63] [63]

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

Q. Liet al., “Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation,” arXiv:2411.19650, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[64] [64]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

M. J. Kim, C. Finn, and P. Liang, “Fine-tuning vision-language-action models: Optimizing speed and success,”arXiv:2502.19645, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[65] [65]

Florence-vl: Enhancing vision-language models with generative vision encoder and depth-breadth fusion,

J. Chenet al., “Florence-vl: Enhancing vision-language models with generative vision encoder and depth-breadth fusion,” inCVPR, 2025

2025