pith. sign in

arxiv: 2605.17571 · v1 · pith:4NSPAS7Gnew · submitted 2026-05-17 · 💻 cs.CV · cs.LG

Stable Routing for Mixture-of-Experts in Class-Incremental Learning

Pith reviewed 2026-05-20 13:10 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords class-incremental learningmixture-of-expertsstable routingrouting alignmentcontinual learningexpert expansionknowledge preservationcatastrophic forgetting
0
0 comments X

The pith

Aligning old-class routing to historical distributions stabilizes expandable mixture-of-experts in class-incremental learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that adding new experts to a mixture-of-experts model during class-incremental learning causes the router to reassign old-class samples, disrupting previously learned expert compositions even when old experts remain frozen. StaR-MoE counters this by adding sensitivity-aware routing alignment that pulls current routing for old classes back toward their historical distributions and by applying asymmetric capacity regularization that promotes use of the new experts for new classes. If the approach works, models can keep growing their expert pool while retaining prior accuracy and gaining on new classes. A sympathetic reader cares because this isolates routing stability as a controllable factor that existing MoE-CIL methods have left unaddressed.

Core claim

Expandable mixture-of-experts architectures for class-incremental learning require two complementary properties: stable old-class routing to preserve knowledge and sufficient capacity utilization for new-class adaptation. StaR-MoE realizes these properties through sensitivity-aware routing alignment, which matches the router's present behavior on old classes to historical routing distributions via sensitivity-guided constraints, together with asymmetric capacity regularization that encourages effective use of the expanded expert pool without eroding class-specific specialization.

What carries the argument

Sensitivity-aware routing alignment, which enforces that the current router's assignments for old-class samples remain close to their historical distributions by means of sensitivity-guided constraints.

If this is right

  • StaR-MoE raises both average accuracy and last-task accuracy above prior state-of-the-art methods on four standard class-incremental learning benchmarks.
  • Old-class knowledge remains more intact because expert assignments for those classes change less after new experts are introduced.
  • New classes still receive adequate expert capacity because the regularization term remains asymmetric and does not force every sample onto old experts.
  • The framework demonstrates that routing-level interventions can be added on top of frozen-expert expandable MoE without redesigning the underlying architecture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same sensitivity-guided alignment idea could be tested in other continual settings where model components are added incrementally, such as expanding transformer layers or growing neural architecture search outputs.
  • If routing drift turns out to be the dominant interference mechanism, monitoring router sensitivity might serve as a general diagnostic for forgetting in any expert-based or modular continual-learning system.
  • The approach leaves open whether similar constraints could be applied at the feature level rather than the routing level when experts are not strictly additive.

Load-bearing premise

The main source of interference is routing drift from expert expansion, and constraining the router to match historical distributions for old classes will preserve knowledge without preventing effective learning of new classes.

What would settle it

On any standard CIL benchmark, measure the change in routing probability vectors for old-class samples before and after each expert addition; if StaR-MoE does not produce measurably smaller changes than an unconstrained MoE baseline, or if removing the alignment term eliminates the reported accuracy gains, the central claim is falsified.

Figures

Figures reproduced from arXiv: 2605.17571 by Da-Wei Zhou, Lijun Zhang, Quan Cheng, Zirui Guo.

Figure 1
Figure 1. Figure 1: Illustration of routing drift in SEMA. Black lines mark expert expan￾sion; non-zero routing probabilities to the left of each marker indicate later experts activated for earlier tasks. We refer to this structural interference problem as routing drift, where expert expansion alters the computational path￾ways of learned classes. Routing drift therefore provides a structural explanation for forgetting in exp… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the StaR-MoE framework. prior work [Chen et al., 2022, He et al., 2022], we freeze the ViT backbone and insert lightweight adapters as experts in parallel with the MLP blocks. Let x l ∈ R d denote the input feature to the MLP block at the l-th layer. The adapter function Al (x l ) comprises a down-projection Wl down ∈ R d×r that maps features to a bottleneck dimension r, followed by a ReLU acti… view at source ↗
Figure 3
Figure 3. Figure 3: Performance curves of different methods on ImageNet-R, ImageNet-A, and VTAB. We [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Further analysis. (a) Task-wise routing probabilities in StaR-MoE on ImageNet-A. Low [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: t-SNE visualization of router-input distributions across MoE layers on VTAB. Samples [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Additional experimental results. (a) Running time comparison on CIFAR-100 and ImageNet [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Impact of MoE design choices. (a) Effect of the number of active experts [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
read the original abstract

Class-incremental learning (CIL) requires models to learn new classes sequentially while preserving prior knowledge. Recently, approaches that combine pre-trained models with mixture-of-experts (MoE) have received increasing attention in CIL: they typically expand experts during learning and employ a router to assign weights across experts. However, existing MoE methods often overlook routing drift induced by expert expansion. Once new experts are introduced, the router may reassign samples from earlier classes to newly added experts, thereby perturbing previously established expert compositions and causing interference even when old experts remain frozen. We argue that expandable MoE in CIL requires two complementary properties: stable old-class routing for knowledge preservation and sufficient capacity utilization for new-class adaptation. To this end, we propose Stable Routing for MoE (StaR-MoE), a routing-level framework for expandable MoE in CIL. By incorporating sensitivity-aware routing alignment, StaR-MoE aligns current old-class routing behavior with historical routing distributions through sensitivity-guided constraints. Complementarily, StaR-MoE introduces asymmetric capacity regularization to encourage effective utilization of the expanded expert pool without compromising class-specific routing specialization. Extensive experiments across four standard CIL benchmarks demonstrate that StaR-MoE consistently improves both average and last accuracy over state-of-the-art methods, highlighting the importance of stable routing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The paper proposes StaR-MoE, a routing-level framework for expandable mixture-of-experts models in class-incremental learning. It identifies routing drift from expert expansion as a source of interference and introduces two components: sensitivity-aware routing alignment, which constrains current old-class routing to match historical distributions, and asymmetric capacity regularization, which promotes utilization of the expanded expert pool. Experiments across four standard CIL benchmarks report consistent gains in both average and last accuracy over prior state-of-the-art methods.

Significance. If the empirical results hold under closer scrutiny, the work is significant for highlighting routing stability as a distinct and actionable requirement in MoE-based CIL. The paired design of alignment for preservation and asymmetric regularization for plasticity offers a practical balance that prior expandable-MoE approaches appear to have under-emphasized. The multi-benchmark evaluation provides direct support for the central claim that stable routing reduces interference without sacrificing adaptation.

major comments (1)
  1. [§3.2] §3.2 (sensitivity-aware routing alignment): the claim that aligning to historical routing distributions preserves knowledge without unduly restricting plasticity rests on the untested premise that historical router outputs for old classes remain near-optimal after expert expansion; an ablation that varies the alignment strength hyper-parameter and reports both old-class routing consistency (e.g., expert-assignment overlap) and new-class accuracy would directly test whether the constraint is load-bearing or merely additive.
minor comments (3)
  1. [Tables 1–2] Table 1 and Table 2: error bars or standard deviations across the reported runs are not shown; adding them would allow readers to judge whether the modest last-accuracy gains (typically 1–3 points) are statistically reliable.
  2. [§4.3] §4.3 (asymmetric capacity regularization): the precise form of the asymmetry (e.g., the weighting between old and new experts) is described only qualitatively; an explicit equation or pseudocode would improve reproducibility.
  3. [Figure 3] Figure 3: the visualization of routing distributions before and after alignment would benefit from a quantitative metric (e.g., KL divergence) in the caption to make the visual comparison more precise.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation and the specific suggestion to strengthen the analysis of sensitivity-aware routing alignment. We will incorporate the recommended ablation in the revised manuscript.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (sensitivity-aware routing alignment): the claim that aligning to historical routing distributions preserves knowledge without unduly restricting plasticity rests on the untested premise that historical router outputs for old classes remain near-optimal after expert expansion; an ablation that varies the alignment strength hyper-parameter and reports both old-class routing consistency (e.g., expert-assignment overlap) and new-class accuracy would directly test whether the constraint is load-bearing or merely additive.

    Authors: We agree that an explicit ablation varying the alignment strength would provide stronger empirical support for the claim that historical routing distributions remain near-optimal for old classes post-expansion. In the revision we will add a study that sweeps the alignment strength hyper-parameter, reporting old-class routing consistency (expert-assignment overlap) together with new-class accuracy. This will clarify whether the constraint is load-bearing for the observed gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper introduces StaR-MoE as a new routing-level framework that adds sensitivity-aware routing alignment (to match historical distributions) and asymmetric capacity regularization (to enable new-class adaptation). These are presented as complementary design choices motivated by the problem of routing drift, not as re-derivations or fits of prior quantities. The abstract and described properties contain no equations that reduce predictions to fitted inputs by construction, nor load-bearing self-citations that substitute for independent justification. Empirical gains across four CIL benchmarks are reported as external validation rather than tautological outcomes. The derivation remains self-contained against the stated assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The two new mechanisms (sensitivity-aware routing alignment and asymmetric capacity regularization) are introduced as design choices whose precise formulations and any hidden hyperparameters remain unspecified.

pith-pipeline@v0.9.0 · 5766 in / 1298 out tokens · 77050 ms · 2026-05-20T13:10:06.344817+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 1 internal anchor

  1. [1]

    Proceedings of the

    Deep Residual Learning for Image Recognition , author =. Proceedings of the

  2. [2]

    An Image is Worth 16x16 Words:

    Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Uszkoreit, Jakob and Houlsby, Neil , booktitle =. An Image is Worth 16x16 Words:

  3. [3]

    Psychology of Learning and Motivation , volume =

    Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem , author =. Psychology of Learning and Motivation , volume =

  4. [4]

    2024 , publisher =

    Class-Incremental Learning: A Survey , author =. 2024 , publisher =

  5. [5]

    2024 , publisher =

    A Comprehensive Survey of Continual Learning: Theory, Method and Application , author =. 2024 , publisher =

  6. [6]

    2022 , publisher =

    Class-Incremental Learning: Survey and Performance Evaluation on Image Classification , author =. 2022 , publisher =

  7. [7]

    Nature Machine Intelligence , volume =

    Three Types of Incremental Learning , author =. Nature Machine Intelligence , volume =. 2022 , publisher =

  8. [8]

    Proceedings of the

    Learning to Prompt for Continual Learning , author =. Proceedings of the

  9. [9]

    Wang, Zifeng and Zhang, Zizhao and Ebrahimi, Sayna and Sun, Ruoxi and Zhang, Han and Lee, Chen-Yu and Ren, Xiaoqi and Su, Guolong and Perot, Vincent and Dy, Jennifer and Pfister, Tomas , booktitle =

  10. [10]

    Dynamic Mixture of Curriculum

    Ge, Chendi and Wang, Xin and Zhang, Zeyang and Chen, Hong and Fan, Jiapei and Huang, Longtao and Xue, Hui and Zhu, Wenwu , booktitle =. Dynamic Mixture of Curriculum

  11. [11]

    Proceedings of the

    Self-Expansion of Pre-trained Models with Mixture of Adapters for Continual Learning , author =. Proceedings of the

  12. [12]

    S-Prompts Learning with Pre-Trained Transformers: An

    Wang, Yabin and Huang, Zhiwu and Hong, Xiaopeng , booktitle =. S-Prompts Learning with Pre-Trained Transformers: An

  13. [13]

    Liang, Yan-Shuo and Li, Wu-Jun , booktitle =

  14. [14]

    Wu, Yichen and Piao, Hongming and Huang, Long-Kai and Wang, Renzhen and Li, Wanhua and Pfister, Hanspeter and Meng, Deyu and Ma, Kede and Wei, Ying , booktitle =

  15. [15]

    and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =

    Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =

  16. [16]

    Theory on

    Li, Hongbo and Lin, Sen and Duan, Lingjie and Liang, Yingbin and Shroff, Ness , booktitle =. Theory on

  17. [17]

    Smith, James Seale and Karlinsky, Leonid and Gutta, Vyshnavi and Cascante-Bonilla, Paola and Kim, Donghyun and Arbelle, Assaf and Panda, Rameswar and Feris, Rogerio and Kira, Zsolt , booktitle =

  18. [18]

    Deep Learning Using Rectified Linear Units (

    Agarap, Abien Fred , journal =. Deep Learning Using Rectified Linear Units (

  19. [19]

    International Conference on Learning Representations , year =

    Towards a Unified View of Parameter-Efficient Transfer Learning , author =. International Conference on Learning Representations , year =

  20. [20]

    Proceedings of the

    Expert Gate: Lifelong Learning with a Network of Experts , author =. Proceedings of the

  21. [21]

    Advances in Neural Information Processing Systems 33 , pages =

    Dark Experience for General Continual Learning: A Strong, Simple Baseline , author =. Advances in Neural Information Processing Systems 33 , pages =

  22. [22]

    2017 , publisher =

    Learning without Forgetting , author =. 2017 , publisher =

  23. [23]

    Proceedings of the

    Douillard, Arthur and Ram. Proceedings of the

  24. [24]

    Proceedings of the

    Task-Agnostic Guided Feature Expansion for Class-Incremental Learning , author =. Proceedings of the

  25. [25]

    International Conference on Learning Representations , year =

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , author =. International Conference on Learning Representations , year =

  26. [26]

    He, Jiangpeng and Duan, Zhihao and Zhu, Fengqing , booktitle =

  27. [27]

    Learning Multiple Layers of Features from Tiny Images , author =

  28. [28]

    Proceedings of the

    The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization , author =. Proceedings of the

  29. [29]

    Proceedings of the

    Natural Adversarial Examples , author =. Proceedings of the

  30. [30]

    Proceedings of the

    Moment Matching for Multi-Source Domain Adaptation , author =. Proceedings of the

  31. [31]

    International Journal of Computer Vision , volume =

    Revisiting Class-Incremental Learning with Pre-Trained Models: Generalizability and Adaptivity Are All You Need , author =. International Journal of Computer Vision , volume =. 2025 , publisher =

  32. [32]

    Proceedings of the

    Expandable Subspace Ensemble for Pre-Trained Model-Based Class-Incremental Learning , author =. Proceedings of the

  33. [33]

    Proceedings of the

    Emerging Properties in Self-Supervised Vision Transformers , author =. Proceedings of the

  34. [34]

    , booktitle =

    Rebuffi, Sylvestre-Alvise and Kolesnikov, Alexander and Sperl, Georg and Lampert, Christoph H. , booktitle =

  35. [35]

    A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark

    A large-scale study of representation learning with the visual task adaptation benchmark , author=. arXiv preprint arXiv:1910.04867 , year=

  36. [36]

    Proceedings of the European Conference on Computer Vision , pages =

    Memory Aware Synapses: Learning What (Not) to Forget , author =. Proceedings of the European Conference on Computer Vision , pages =

  37. [37]

    Proceedings of the National Academy of Sciences , volume=

    Overcoming catastrophic forgetting in neural networks , author=. Proceedings of the National Academy of Sciences , volume=. 2017 , publisher=

  38. [38]

    Proceedings of the 34th International Conference on Machine Learning , pages=

    Continual learning through synaptic intelligence , author=. Proceedings of the 34th International Conference on Machine Learning , pages=

  39. [39]

    Proceedings of the 42nd International Conference on Machine Learning , year =

    Addressing Imbalanced Domain-Incremental Learning through Dual-Balance Collaborative Experts , author =. Proceedings of the 42nd International Conference on Machine Learning , year =

  40. [40]

    Advances in Neural Information Processing Systems 36 , pages =

    Hierarchical Decomposition of Prompt-Based Continual Learning: Rethinking Obscured Sub-Optimality , author =. Advances in Neural Information Processing Systems 36 , pages =

  41. [41]

    Zhang, Gengwei and Wang, Liyuan and Kang, Guoliang and Chen, Ling and Wei, Yunchao , booktitle =

  42. [42]

    Proceedings of the

    Dual-Teacher Class-Incremental Learning with Data-Free Generative Replay , author =. Proceedings of the

  43. [43]

    Proceedings of the

    Adaptive Plasticity Improvement for Continual Learning , author =. Proceedings of the

  44. [44]

    Proceedings of the European Conference on Computer Vision , pages =

    Visual Prompt Tuning , author =. Proceedings of the European Conference on Computer Vision , pages =

  45. [45]

    Proceedings of the

    Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters , author =. Proceedings of the

  46. [46]

    International Conference on Learning Representations , year =

    Lifelong Learning with Dynamically Expandable Networks , author =. International Conference on Learning Representations , year =

  47. [47]

    Proceedings of the 36th International Conference on Machine Learning , pages=

    Learn to grow: A continual structure learning framework for overcoming catastrophic forgetting , author=. Proceedings of the 36th International Conference on Machine Learning , pages=

  48. [48]

    2021 , publisher =

    Sokar, Ghada and Mocanu, Decebal Constantin and Pechenizkiy, Mykola , journal =. 2021 , publisher =

  49. [49]

    Advances in Neural Information Processing Systems 30 , pages =

    Gradient Episodic Memory for Continual Learning , author =. Advances in Neural Information Processing Systems 30 , pages =

  50. [50]

    Advances in Neural Information Processing Systems 35 , pages =

    Exploring Example Influence in Continual Learning , author =. Advances in Neural Information Processing Systems 35 , pages =

  51. [51]

    Trends in Cognitive Sciences , volume =

    Catastrophic Forgetting in Connectionist Networks , author =. Trends in Cognitive Sciences , volume =. 1999 , publisher =

  52. [52]

    Psychological Review , volume =

    Why There Are Complementary Learning Systems in the Hippocampus and Neocortex: Insights from the Successes and Failures of Connectionist Models of Learning and Memory , author =. Psychological Review , volume =. 1995 , publisher =

  53. [53]

    Class-Incremental Learning with

    Huang, Linlan and Cao, Xusheng and Lu, Haori and Liu, Xialei , booktitle =. Class-Incremental Learning with

  54. [54]

    Guo, Xiaohan and Cai, Yusong and Liu, Zejia and Wang, Zhengning and Pan, Lili and Li, Hongliang , journal =

  55. [55]

    and Ba, Jimmy , journal=

    Kingma, Diederik P. and Ba, Jimmy , journal=

  56. [56]

    Chen, Shoufa and Ge, Chongjian and Tong, Zhan and Wang, Jiangliu and Song, Yibing and Wang, Jue and Luo, Ping , booktitle=

  57. [57]

    , booktitle=

    Nair, Vinod and Hinton, Geoffrey E. , booktitle=. Rectified linear units improve

  58. [58]

    Journal of Machine Learning Research , volume =

    Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , author =. Journal of Machine Learning Research , volume =

  59. [59]

    Huai, Tianyu and Zhou, Jie and Wu, Xingjiao and Chen, Qin and Bai, Qingchun and Zhou, Ze and He, Liang , booktitle =

  60. [60]

    Advances in Neural Information Processing Systems 37 , pages =

    Mixture of Experts Meets Prompt-Based Continual Learning , author =. Advances in Neural Information Processing Systems 37 , pages =

  61. [61]

    Advances in Neural Information Processing Systems 13 , pages =

    Incorporating Second-Order Functional Knowledge for Better Option Pricing , author =. Advances in Neural Information Processing Systems 13 , pages =

  62. [62]

    Transactions on Machine Learning Research , issn=

    Continual Diffusion: Continual Customization of Text-to-Image Diffusion with C-LoRA , author=. Transactions on Machine Learning Research , issn=

  63. [63]

    Sun, Hai-Long and Zhou, Da-Wei and Ye, Han-Jia and Zhan, De-Chuan , journal =

  64. [64]

    International Conference on Learning Representations , year =

    Divide and Not Forget: Ensemble of Selectively Trained Experts in Continual Learning , author =. International Conference on Learning Representations , year =

  65. [65]

    Sun, Hai-Long and Zhou, Da-Wei and Zhao, Hanbin and Gan, Le and Zhan, De-Chuan and Ye, Han-Jia , booktitle=

  66. [66]

    International Conference on Learning Representations , year =

    One-Prompt Strikes Back: Sparse Mixture of Experts for Prompt-Based Continual Learning , author =. International Conference on Learning Representations , year =

  67. [67]

    Proceedings of the

    Prototype augmentation and self-supervision for incremental learning , author=. Proceedings of the

  68. [68]

    Visualizing data using t-

    Van der Maaten, Laurens and Hinton, Geoffrey , journal=. Visualizing data using t-