pith. machine review for the scientific record. sign in

arxiv: 2605.09355 · v1 · submitted 2026-05-10 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

FLAME: Adaptive Mixture-of-Experts for Continual Multimodal Multi-Task Learning

Rama Chellappa, Shravan Chaudhari, Suchi Saria, Tanvi Ranade, Xing Han

Pith reviewed 2026-05-12 04:09 UTC · model grok-4.3

classification 💻 cs.LG
keywords mixture-of-expertscontinual learningmultimodal learningmulti-task learningcatastrophic forgettingparameter efficiencyhealthcare applications
0
0 comments X

The pith

FLAME enables multimodal models to pretrain on multiple tasks jointly and then adapt to new tasks with unseen modality combinations by expanding only lightweight routers while compressing expert knowledge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a mixture-of-experts framework that addresses both multi-task pretraining, where related tasks can share representations when available together, and continual learning, where new tasks arrive later with different modality mixes such as images plus text or sensors. It achieves this through modality-specific routers that handle tokens from each data type independently and by folding accumulated expert parameters into low-rank memory subspaces rather than adding full new experts. A reader would care because real deployments start with an incomplete task set yet still need the efficiency gains from joint training, and standard models either forget prior skills or grow inefficiently large. The design keeps total capacity nearly fixed after initial pretraining while supporting flexible modality combinations. Experiments on healthcare multimodal benchmarks indicate it matches joint-training accuracy, reduces forgetting on earlier tasks, and uses fewer added parameters than baselines.

Core claim

The FLAME framework supports training on multimodal tasks with diverse modality configurations by leveraging modality-specific routers that process tokens from each modality across tasks, and enables continual learning over sequential multimodal tasks within a fixed-capacity MoE by compressing accumulated expert knowledge into low-rank memory subspaces while expanding only the lightweight routers, demonstrating competitive multitask pretraining performance while alleviating catastrophic forgetting and improving parameter efficiency on healthcare multimodal benchmarks.

What carries the argument

Modality-specific routers in a sparse Mixture-of-Experts model paired with low-rank compression of accumulated expert parameters, allowing modular capacity growth without full model retraining or interference.

If this is right

  • Models can jointly train on co-available multimodal tasks to borrow representational strength across related objectives.
  • New tasks arriving sequentially can be incorporated by expanding only the routers while keeping the expert pool fixed.
  • Parameter count grows sublinearly because expert knowledge is compressed rather than duplicated or expanded fully.
  • The same architecture maintains competitive accuracy on multiple healthcare multimodal benchmarks across both pretraining and adaptation phases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same routing-plus-compression pattern could be applied to non-healthcare domains such as robotics or autonomous systems where sensor modalities evolve over time.
  • If low-rank subspaces reliably capture task knowledge, then expert modules in other modular architectures may also admit efficient lifelong compression without explicit redesign.
  • Testing the framework on single-modality continual learning would isolate whether modality-specific routers are necessary or whether generic routers suffice.
  • This points toward lifelong-learning systems in which total parameter growth remains bounded even as the number of encountered tasks increases without limit.

Load-bearing premise

Compressing accumulated expert knowledge into low-rank memory subspaces across sequential tasks preserves performance without substantial degradation or interference between modalities.

What would settle it

A substantial rise in error rates on earlier tasks after adding new tasks with previously unseen modality combinations, even after applying the low-rank compression, would show that the method fails to alleviate catastrophic forgetting.

Figures

Figures reproduced from arXiv: 2605.09355 by Rama Chellappa, Shravan Chaudhari, Suchi Saria, Tanvi Ranade, Xing Han.

Figure 1
Figure 1. Figure 1: An illustration of the Flexi-Modal multi-task setting. Each task is associated with an [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: FLAME multi-task pretraining architecture. Flexi-Modal tasks with overlapping but [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Cumulative top-K energy across all 10 expert sublayers (5 experts × {fc1, fc2}) of the first cross-modal MoE block, after FLAME multi-task pretraining on all 9 healthcare tasks. Left: input-covariance spectrum {λj} of Ci = Ez[zz⊤], computed over inputs the router actually dispatches to each expert. Center: weight-only Frobenius spectrum {σ 2 i,k} of Wi . Right: data-aware functional energy Ei,k = σ 2 i,kv … view at source ↗
Figure 4
Figure 4. Figure 4: Overview of FLAME’s CL procedure for MoE, with improved parameter efficiency. so the forward pass for task t excludes any com￾ponent reserved at a later stage. The per-stage rank rt is a hyperparameter, fixing the cumulative reserved rank P t rt and turning capacity plan￾ning into an explicit choice under a fixed-size MoE. The full set of trainable and frozen parame￾ters at stage t, together with the per-s… view at source ↗
Figure 5
Figure 5. Figure 5: Per-stage continual-learning trajectories across four task sequences (columns: Setup 1–4) [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Pairwise task performance. Left: AUROC, Right: AUPRC. Each row represents the focal [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Ablation over number of experts; Left: AUROC, Right: AUPRC. FLAME jointly trained [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
Figure 5
Figure 5. Figure 5: Among the standard CL baselines, LORA is the only one that retains earlier-stage AUROC by freezing the base, so the relevant question is whether FLAME-CL can match LORA’s retention at a smaller per-stage parameter cost. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_5.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison against Lifelong-PT [6], an MoE-based continual-learning method that grows the expert pool with each new task. Per-stage trajectories for the four sequences (columns: Setup 1 to Setup 4) across three metrics (rows: AUROC, encoder parameter count in millions, MoE parameter count in millions). x-axis: training-stage checkpoint; within each stage, AUROC panels overlay one line per task seen so far.… view at source ↗
Figure 9
Figure 9. Figure 9: Comparison against LORA [22], the strongest retention baseline among the standard CL methods. Per-stage trajectories for the four sequences (columns: Setup 1 to Setup 4) across three metrics (rows: AUROC, encoder parameter count in millions, MoE parameter count in millions). x-axis: training-stage checkpoint; within each stage, AUROC panels overlay one line per task seen so far. FLAME-CL (red) matches LORA… view at source ↗
Figure 10
Figure 10. Figure 10: Per-stage AUPRC for the four continual-learning sequences (S1–S4); axes and grouping as in [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Expert input spectra after single-task FLAME pretraining on the three MIMIC-IV tasks [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Expert input spectra after single-task FLAME pretraining on the two eICU tasks. [PITH_FULL_IMAGE:figures/full_fig_p028_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Expert input spectra after single-task FLAME pretraining on the three EMBED breast [PITH_FULL_IMAGE:figures/full_fig_p029_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Expert input spectra after single-task FLAME pretraining on ADNI (multi-modal tabular [PITH_FULL_IMAGE:figures/full_fig_p030_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Expert input spectra under joint FLAME pretraining. Despite the larger and more [PITH_FULL_IMAGE:figures/full_fig_p030_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Per-modality routing for the three MIMIC-IV tasks. Modalities: chest X-ray ( [PITH_FULL_IMAGE:figures/full_fig_p031_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Per-modality routing for the two eICU tasks. [PITH_FULL_IMAGE:figures/full_fig_p031_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Per-modality routing for the three EMBED breast-imaging tasks. Mammography views [PITH_FULL_IMAGE:figures/full_fig_p032_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Per-modality routing for ADNI Alzheimer’s diagnosis (T1–T5 imaging biomarkers). [PITH_FULL_IMAGE:figures/full_fig_p032_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Per-task, per-modality routing under joint MIMIC-IV pretraining over {IHM, LOS, [PITH_FULL_IMAGE:figures/full_fig_p033_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Per-task, per-modality routing under joint pretraining over all 9 tasks across MIMIC-IV, [PITH_FULL_IMAGE:figures/full_fig_p035_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: S1: per-modality, per-stage routing under MIMIC-IV continual learning [PITH_FULL_IMAGE:figures/full_fig_p036_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: S2: per-modality, per-stage routing under eICU continual learning [PITH_FULL_IMAGE:figures/full_fig_p036_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: S3: per-modality, per-stage routing under EMBED continual learning [PITH_FULL_IMAGE:figures/full_fig_p037_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: S4: per-modality, per-stage routing under the mixed cross-dataset continual se [PITH_FULL_IMAGE:figures/full_fig_p037_25.png] view at source ↗
read the original abstract

Real-world model deployment across multiple domains requires multimodal models to operate under two complementary regimes: (1) multi-task pretraining, tasks are co-available at design time where related tasks could borrow representational strength from one another, (2) continual adaptation, in which new tasks emerge after deployment with previously unseen modality combinations. However, neither regime alone suffices: the pretraining task set is never exhaustive, while bypassing joint training forfeits the transfer gains and efficiency among co-trainable tasks. Sparse Mixture-of-Experts (MoE) is a natural fit for this dual requirement: sparse activation enables modular capacity expansion as new tasks arrive, while routing decouples modality-level computation from task-level composition. In this work, we propose a scalable MoE framework for multitask pretraining and continual learning across flexible modality combinations. The framework is designed to support training on multimodal tasks with diverse modality configurations by leveraging modality-specific routers that process tokens from each modality across tasks. Furthermore, it enables continual learning over sequential multimodal tasks within a fixed-capacity MoE by compressing accumulated expert knowledge into low-rank memory subspaces, while expanding only the lightweight routers. We validate the effectiveness of our method on multiple healthcare multimodal benchmarks. It demonstrates competitive multitask pretraining performance while alleviating catastrophic forgetting and improving parameter efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces FLAME, a sparse Mixture-of-Experts framework for multimodal multi-task learning that supports both joint pretraining on co-available tasks and continual adaptation to new tasks with previously unseen modality combinations. Modality-specific routers process tokens from each modality, while accumulated expert knowledge is compressed into low-rank memory subspaces and only lightweight routers are expanded to maintain fixed model capacity. The approach is claimed to achieve competitive multitask pretraining performance, alleviate catastrophic forgetting, and improve parameter efficiency on multiple healthcare multimodal benchmarks.

Significance. If the low-rank compression successfully retains task- and modality-specific information without substantial cross-modal interference or performance degradation, the framework would offer a scalable solution for real-world deployment of multimodal models in dynamic domains such as healthcare. The decoupling of modality-level computation via routers and the fixed-capacity continual learning mechanism address a practical gap between exhaustive joint pretraining and naive sequential adaptation. However, the absence of any quantitative metrics, baselines, ablation studies, or validation of the low-rank approximation in the provided abstract makes it difficult to determine whether the central claims are supported.

major comments (2)
  1. [Abstract] Abstract: the central claims of 'competitive multitask pretraining performance' and 'alleviating catastrophic forgetting' are stated without any reported metrics (e.g., task accuracies, forgetting rates such as average accuracy drop across tasks), baselines (standard MoE, fine-tuning, or replay methods), or ablation results on the low-rank compression. This leaves the effectiveness of compressing expert knowledge into low-rank subspaces unsupported by visible evidence.
  2. [Abstract] Abstract: no details are provided on the low-rank compression implementation, including rank selection criterion, reconstruction error, or modality-specific interference metrics. This is load-bearing for the claim that performance is preserved across sequential tasks with varying modality combinations, as the skeptic concern notes that discarded cross-modal interactions could cause degradation even with sparse routing.
minor comments (1)
  1. [Abstract] The abstract would be strengthened by briefly naming the specific healthcare multimodal benchmarks used and the number of sequential tasks in the continual learning experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the practical value of FLAME in addressing continual multimodal multi-task learning. We agree that the abstract would be strengthened by incorporating key quantitative results and a brief description of the low-rank compression mechanism. We have revised the abstract to address these points and provide point-by-point responses below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claims of 'competitive multitask pretraining performance' and 'alleviating catastrophic forgetting' are stated without any reported metrics (e.g., task accuracies, forgetting rates such as average accuracy drop across tasks), baselines (standard MoE, fine-tuning, or replay methods), or ablation results on the low-rank compression. This leaves the effectiveness of compressing expert knowledge into low-rank subspaces unsupported by visible evidence.

    Authors: We agree that the abstract would benefit from including representative quantitative evidence. The full manuscript reports experimental results on healthcare multimodal benchmarks, including comparisons against standard MoE, fine-tuning, and replay baselines, along with ablation studies on the low-rank compression and metrics such as task accuracies and forgetting rates (average accuracy drop across tasks). To make this evidence visible in the abstract, we have revised it to summarize key performance figures demonstrating competitive multitask pretraining and reduced catastrophic forgetting. revision: yes

  2. Referee: [Abstract] Abstract: no details are provided on the low-rank compression implementation, including rank selection criterion, reconstruction error, or modality-specific interference metrics. This is load-bearing for the claim that performance is preserved across sequential tasks with varying modality combinations, as the skeptic concern notes that discarded cross-modal interactions could cause degradation even with sparse routing.

    Authors: The full manuscript provides these implementation details in the Methods section, including the low-rank factorization approach, rank selection via explained variance threshold, reported reconstruction errors, and analysis of modality-specific interference to confirm limited cross-modal degradation. We acknowledge that a concise reference in the abstract would better address potential skeptic concerns. We have therefore revised the abstract to include a brief description of the low-rank compression process and its role in preserving performance across sequential tasks. revision: yes

Circularity Check

0 steps flagged

No circularity: architectural design choices validated on external benchmarks

full rationale

The paper proposes FLAME as a new MoE architecture for multimodal continual learning, specifying design elements such as modality-specific routers and low-rank compression of expert knowledge into memory subspaces. These are presented as engineering decisions to support multitask pretraining and sequential adaptation, not as outputs derived from equations or parameters that are fitted and then renamed as predictions. The abstract and description contain no self-referential definitions, no fitted-input predictions, and no load-bearing self-citations that reduce the central claims to tautologies. Performance is assessed via empirical results on independent healthcare multimodal benchmarks, keeping the derivation chain self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities beyond the proposed framework itself are detailed.

pith-pipeline@v0.9.0 · 5540 in / 1010 out tokens · 27785 ms · 2026-05-12T04:09:37.051715+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · 6 internal anchors

  1. [1]

    arXiv preprint arXiv:2410.11222 , year=

    Pedram Akbarian, Huy Nguyen, Xing Han, and Nhat Ho. Quadratic gating mixture of experts: Statistical insights into self-attention.arXiv preprint arXiv:2410.11222, 2024

  2. [2]

    Implicit regularization in deep matrix factorization.Advances in neural information processing systems, 32, 2019

    Sanjeev Arora, Nadav Cohen, Wei Hu, and Yuping Luo. Implicit regularization in deep matrix factorization.Advances in neural information processing systems, 32, 2019

  3. [3]

    Neural networks as kernel learners: The silent alignment effect

    Alexander Atanasov, Blake Bordelon, and Cengiz Pehlevan. Neural networks as kernel learners: The silent alignment effect. InInternational Conference on Learning Representations, 2022

  4. [4]

    Multimodal machine learn- ing: A survey and taxonomy.IEEE transactions on pattern analysis and machine intelligence, 41(2):423–443, 2018

    Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. Multimodal machine learn- ing: A survey and taxonomy.IEEE transactions on pattern analysis and machine intelligence, 41(2):423–443, 2018

  5. [5]

    Multitask learning.Machine learning, 28(1):41–75, 1997

    Rich Caruana. Multitask learning.Machine learning, 28(1):41–75, 1997

  6. [6]

    Lifelong language pretraining with distribution-specialized experts

    Wuyang Chen, Yanqi Zhou, Nan Du, Yanping Huang, James Laudon, Zhifeng Chen, and Claire Cui. Lifelong language pretraining with distribution-specialized experts. InInternational Conference on Machine Learning, pages 5383–5395. PMLR, 2023

  7. [7]

    Chandler Davis and William M. Kahan. The rotation of eigenvectors by a perturbation. III. SIAM Journal on Numerical Analysis, 7(1):1–46, 1970

  8. [8]

    A continual learning survey: Defying forgetting in classification tasks.IEEE transactions on pattern analysis and machine intelligence, 44(7):3366– 3385, 2021

    Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Aleš Leonardis, Gregory Slabaugh, and Tinne Tuytelaars. A continual learning survey: Defying forgetting in classification tasks.IEEE transactions on pattern analysis and machine intelligence, 44(7):3366– 3385, 2021

  9. [9]

    The mnist database of handwritten digit images for machine learning research [best of the web].IEEE Signal Processing Magazine, 29(6):141–142, 2012

    Li Deng. The mnist database of handwritten digit images for machine learning research [best of the web].IEEE Signal Processing Magazine, 29(6):141–142, 2012

  10. [10]

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

    William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

  11. [11]

    Efficiently identifying task groupings for multi-task learning.Advances in Neural Information Processing Systems, 34:27503–27516, 2021

    Chris Fifty, Ehsan Amid, Zhe Zhao, Tianhe Yu, Rohan Anil, and Chelsea Finn. Efficiently identifying task groupings for multi-task learning.Advances in Neural Information Processing Systems, 34:27503–27516, 2021

  12. [12]

    Implicit regularization of discrete gradient dynamics in linear neural networks

    Gauthier Gidel, Francis Bach, and Simon Lacoste-Julien. Implicit regularization of discrete gradient dynamics in linear neural networks. InAdvances in Neural Information Processing Systems, 2019

  13. [13]

    Mimic-iv-ecg: Diagnostic electrocardiogram matched subset.Type: dataset, 2023

    Brian Gow, Tom Pollard, Larry A Nathanson, Alistair Johnson, Benjamin Moody, Chrystinne Fernandes, Nathaniel Greenbaum, Jonathan W Waks, Parastou Eslami, Tanner Carbonati, et al. Mimic-iv-ecg: Diagnostic electrocardiogram matched subset.Type: dataset, 2023

  14. [14]

    Implicit bias of gradient descent on linear convolutional networks.Advances in neural information processing systems, 31, 2018

    Suriya Gunasekar, Jason D Lee, Daniel Soudry, and Nati Srebro. Implicit bias of gradient descent on linear convolutional networks.Advances in neural information processing systems, 31, 2018

  15. [15]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 10

  16. [16]

    Sparsely activated mixture-of-experts are robust multi-task learners.arXiv preprint arXiv:2204.07689, 2022

    Shashank Gupta, Subhabrata Mukherjee, Krishan Subudhi, Eduardo Gonzalez, Damien Jose, Ahmed H Awadallah, and Jianfeng Gao. Sparsely activated mixture-of-experts are robust multi-task learners.arXiv preprint arXiv:2204.07689, 2022

  17. [17]

    Guiding mixture-of-experts with temporal multimodal interactions.arXiv preprint arXiv:2509.25678, 2025

    Xing Han, Hsing-Huan Chung, Joydeep Ghosh, Paul Pu Liang, and Suchi Saria. Guiding mixture-of-experts with temporal multimodal interactions.arXiv preprint arXiv:2509.25678, 2025

  18. [18]

    Dynamic combination of heterogeneous models for hierarchical time series

    Xing Han, Jing Hu, and Joydeep Ghosh. Dynamic combination of heterogeneous models for hierarchical time series. In2022 IEEE International Conference on Data Mining Workshops (ICDMW), pages 1207–1216. IEEE, 2022

  19. [19]

    Fusemoe: Mixture-of-experts transformers for fleximodal fusion.Advances in Neural Information Processing Systems, 37:67850–67900, 2024

    Xing Han, Huy Nguyen, Carl Harris, Nhat Ho, and Suchi Saria. Fusemoe: Mixture-of-experts transformers for fleximodal fusion.Advances in Neural Information Processing Systems, 37:67850–67900, 2024

  20. [20]

    Multitask learning and benchmarking with clinical time series data.Scientific data, 6(1):96, 2019

    Hrayr Harutyunyan, Hrant Khachatrian, David C Kale, Greg Ver Steeg, and Aram Galstyan. Multitask learning and benchmarking with clinical time series data.Scientific data, 6(1):96, 2019

  21. [21]

    Hendawy, J

    Ahmed Hendawy, Jan Peters, and Carlo D’Eramo. Multi-task reinforcement learning with mixture of orthogonal experts.arXiv preprint arXiv:2311.11385, 2023

  22. [22]

    LoRA: Low-rank adaptation of large language models

    Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022

  23. [23]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  24. [24]

    Neural tangent kernel: Convergence and generalization in neural networks

    Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. InAdvances in Neural Information Processing Systems, 2018

  25. [25]

    Perceiver: General perception with iterative attention

    Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. Perceiver: General perception with iterative attention. InInternational conference on machine learning, pages 4651–4664. PMLR, 2021

  26. [26]

    The EMory BrEast imaging dataset (EMBED): A racially diverse, granular dataset of 3.4 million screening and diagnostic mammographic images.Radiol

    Jiwoong J Jeong, Brianna L Vey, Ananth Bhimireddy, Thomas Kim, Thiago Santos, Ramon Correa, Raman Dutt, Marina Mosunjac, Gabriela Oprea-Ilies, Geoffrey Smith, Minjae Woo, Christopher R McAdams, Mary S Newell, Imon Banerjee, Judy Gichoya, and Hari Trivedi. The EMory BrEast imaging dataset (EMBED): A racially diverse, granular dataset of 3.4 million screeni...

  27. [27]

    Mixtral of Experts

    Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024

  28. [28]

    Mimic-cxr database.PhysioNet10, 13026(C2JT1Q):5, 2024

    Alistair Johnson, Tom Pollard, Roger Mark, Seth Berkowitz, and Steven Horng. Mimic-cxr database.PhysioNet10, 13026(C2JT1Q):5, 2024

  29. [29]

    Mimic-iv, a freely accessible electronic health record dataset.Scientific data, 10(1):1, 2023

    Alistair EW Johnson, Lucas Bulgarelli, Lu Shen, Alvin Gayles, Ayad Shammout, Steven Horng, Tom J Pollard, Sicheng Hao, Benjamin Moody, Brian Gow, et al. Mimic-iv, a freely accessible electronic health record dataset.Scientific data, 10(1):1, 2023

  30. [30]

    The universal weight subspace hypothesis, 2025

    Prakhar Kaushik, Shravan Chaudhari, Ankit Vaidya, Rama Chellappa, and Alan Yuille. The universal weight subspace hypothesis, 2025

  31. [31]

    Eigenlorax: Recy- cling adapters to find principal subspaces for resource-efficient adaptation and inference

    Prakhar Kaushik, Ankit Vaidya, Shravan Chaudhari, and Alan Yuille. Eigenlorax: Recy- cling adapters to find principal subspaces for resource-efficient adaptation and inference. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 649–659, 2025. 11

  32. [32]

    Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

  33. [33]

    Concentration inequalities and moment bounds for sample covariance operators.Bernoulli, 23(1):110–133, 2017

    Vladimir Koltchinskii and Karim Lounici. Concentration inequalities and moment bounds for sample covariance operators.Bernoulli, 23(1):110–133, 2017

  34. [34]

    Continual learning for domain adaptation in chest x-ray classification

    Matthias Lenga, Heinrich Schulz, and Axel Saalbach. Continual learning for domain adaptation in chest x-ray classification. InMedical Imaging with Deep Learning, pages 413–423. PMLR, 2020

  35. [35]

    Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36:28541–28564, 2023

    Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36:28541–28564, 2023

  36. [36]

    Theory on mixture-of- experts in continual learning.arXiv preprint arXiv:2406.16437, 2024

    Hongbo Li, Sen Lin, Lingjie Duan, Yingbin Liang, and Ness B Shroff. Theory on mixture-of- experts in continual learning.arXiv preprint arXiv:2406.16437, 2024

  37. [37]

    High-modality multimodal transformer: Quantifying modality & interaction heterogeneity for high-modality representation learning

    Paul Pu Liang, Yiwei Lyu, Xiang Fan, Jeffrey Tsaw, Yudong Liu, Shentong Mo, Dani Yogatama, Louis-Philippe Morency, and Ruslan Salakhutdinov. High-modality multimodal transformer: Quantifying modality & interaction heterogeneity for high-modality representation learning. arXiv preprint arXiv:2203.01311, 2022

  38. [38]

    Inflora: Interference-free low-rank adaptation for continual learning

    Yan-Shuo Liang and Wu-Jun Li. Inflora: Interference-free low-rank adaptation for continual learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  39. [39]

    Med-flamingo: a multimodal medical few-shot learner (2023).URL: https://arxiv

    Michael Moor, Qian Huang, Shirley Wu, Michihiro Yasunaga, Cyril Zakka, Yash Dalmia, Eduardo Pontes Reis, Pranav Rajpurkar, and Jure Leskovec. Med-flamingo: a multimodal medical few-shot learner (2023).URL: https://arxiv. org/abs/2307.15189, 2023

  40. [40]

    Multi- modal contrastive learning with limoe: the language-image mixture of experts.Advances in Neural Information Processing Systems, 35:9564–9576, 2022

    Basil Mustafa, Carlos Riquelme, Joan Puigcerver, Rodolphe Jenatton, and Neil Houlsby. Multi- modal contrastive learning with limoe: the language-image mixture of experts.Advances in Neural Information Processing Systems, 35:9564–9576, 2022

  41. [41]

    Multimodal deep learning

    Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, Andrew Y Ng, et al. Multimodal deep learning. InIcml, volume 11, pages 689–696, 2011

  42. [42]

    On expert estimation in hierar- chical mixture of experts: Beyond softmax gating functions.arXiv preprint arXiv:2410.02935, 2024

    Huy Nguyen, Xing Han, Carl Harris, Suchi Saria, and Nhat Ho. On expert estimation in hierar- chical mixture of experts: Beyond softmax gating functions.arXiv preprint arXiv:2410.02935, 2024

  43. [43]

    eICU Collaborative Research Database.PhysioNet, April 2019

    Tom Pollard, Alistair Johnson, Jesse Raffa, Leo Anthony Celi, Omar Badawi, and Roger Mark. eICU Collaborative Research Database.PhysioNet, April 2019. Version 2.0

  44. [44]

    The future of digital health with federated learning.NPJ digital medicine, 3(1):119, 2020

    Nicola Rieke, Jonny Hancox, Wenqi Li, Fausto Milletari, Holger R Roth, Shadi Albarqouni, Spyridon Bakas, Mathieu N Galtier, Bennett A Landman, Klaus Maier-Hein, et al. The future of digital health with federated learning.NPJ digital medicine, 3(1):119, 2020

  45. [45]

    An Overview of Multi-Task Learning in Deep Neural Networks

    Sebastian Ruder. An overview of multi-task learning in deep neural networks.arXiv preprint arXiv:1706.05098, 2017

  46. [46]

    Progressive Neural Networks

    Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks.arXiv preprint arXiv:1606.04671, 2016

  47. [47]

    Divide and not forget: Ensemble of selectively trained experts in continual learning.arXiv preprint arXiv:2401.10191, 2024

    Grzegorz Rype´s´c, Sebastian Cygert, Valeriya Khan, Tomasz Trzci´nski, Bartosz Zieli´nski, and Bartłomiej Twardowski. Divide and not forget: Ensemble of selectively trained experts in continual learning.arXiv preprint arXiv:2401.10191, 2024

  48. [48]

    Gradient projection memory for continual learning

    Gobinda Saha, Isha Garg, and Kaushik Roy. Gradient projection memory for continual learning. InInternational Conference on Learning Representations, 2021. 12

  49. [49]

    Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

    Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks.arXiv preprint arXiv:1312.6120, 2013

  50. [50]

    Le, Geoffrey E

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V . Le, Geoffrey E. Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of- experts layer. In5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017

  51. [51]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

  52. [52]

    Missing modalities imputation via cascaded residual autoencoder

    Luan Tran, Xiaoming Liu, Jiayu Zhou, and Rong Jin. Missing modalities imputation via cascaded residual autoencoder. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1405–1414, 2017

  53. [53]

    Joel A. Tropp. An introduction to matrix concentration inequalities.Foundations and Trends in Machine Learning, 8(1–2):1–230, 2015

  54. [54]

    Multi-task learning for dense prediction tasks: A survey.IEEE transactions on pattern analysis and machine intelligence, 44(7):3614–3633, 2021

    Simon Vandenhende, Stamatios Georgoulis, Wouter Van Gansbeke, Marc Proesmans, Dengxin Dai, and Luc Van Gool. Multi-task learning for dense prediction tasks: A survey.IEEE transactions on pattern analysis and machine intelligence, 44(7):3614–3633, 2021

  55. [55]

    Cambridge University Press, 2018

    Roman Vershynin.High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge University Press, 2018

  56. [56]

    A comprehensive survey of continual learning: Theory, method and application.IEEE transactions on pattern analysis and machine intelligence, 46(8):5362–5383, 2024

    Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. A comprehensive survey of continual learning: Theory, method and application.IEEE transactions on pattern analysis and machine intelligence, 46(8):5362–5383, 2024

  57. [57]

    Multimodal learning with incomplete modalities by knowledge distillation

    Qi Wang, Liang Zhan, Paul Thompson, and Jiayu Zhou. Multimodal learning with incomplete modalities by knowledge distillation. InProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1828–1838, 2020

  58. [58]

    Orthogonal subspace learning for language model continual learning

    Xiao Wang, Tianze Chen, Qiming Ge, Han Xia, Rong Bao, Rui Zheng, Qi Zhang, Tao Gui, and Xuanjing Huang. Orthogonal subspace learning for language model continual learning. In Findings of the Association for Computational Linguistics: EMNLP 2023, 2023

  59. [59]

    The alzheimer’s disease neuroimaging initiative 3: Continued innovation for clinical trial improvement.Alzheimers Dement, 13(5):561–571, December 2016

    Michael W Weiner, Dallas P Veitch, Paul S Aisen, Laurel A Beckett, Nigel J Cairns, Robert C Green, Danielle Harvey, Clifford R Jack, Jr, William Jagust, John C Morris, Ronald C Petersen, Jennifer Salazar, Andrew J Saykin, Leslie M Shaw, Arthur W Toga, John Q Trojanowski, and Alzheimer’s Disease Neuroimaging Initiative. The alzheimer’s disease neuroimaging...

  60. [60]

    Dynamic modeling of patients, modalities and tasks via multi-modal multi-task mixture of experts

    Chenwei Wu, Zitao Shuai, Zhengxu Tang, Luning Wang, and Liyue Shen. Dynamic modeling of patients, modalities and tasks via multi-modal multi-task mixture of experts. InThe thirteenth international conference on learning representations, 2025

  61. [61]

    Multimodal learning with transformers: A survey

    Peng Xu, Xiatian Zhu, and David A Clifton. Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(10):12113–12132, 2023

  62. [62]

    Moral: Moe augmented lora for llms’ lifelong learning.arXiv preprint arXiv:2402.11260, 2024

    Shu Yang, Muhammad Asif Ali, Cheng-Long Wang, Lijie Hu, and Di Wang. Moral: Moe augmented lora for llms’ lifelong learning.arXiv preprint arXiv:2402.11260, 2024

  63. [63]

    Unleashing the power of multi-task learning: A comprehensive survey spanning traditional, deep, and pretrained foundation model eras.arXiv preprint arXiv:2404.18961, 2024

    Jun Yu, Yutong Dai, Xiaokang Liu, Jin Huang, Yishan Shen, Ke Zhang, Rong Zhou, Eashan Adhikarla, Wenxuan Ye, Yixin Liu, et al. Unleashing the power of multi-task learning: A comprehensive survey spanning traditional, deep, and pretrained foundation model eras.arXiv preprint arXiv:2404.18961, 2024

  64. [64]

    Samworth

    Yi Yu, Tengyao Wang, and Richard J. Samworth. A useful variant of the Davis–Kahan theorem for statisticians.Biometrika, 102(2):315–323, 2015. 13

  65. [65]

    Flex-moe: Modeling arbitrary modality combination via the flexible mixture-of-experts.Advances in Neural Information Processing Systems, 37:98782– 98805, 2024

    Sukwon Yun, Inyoung Choi, Jie Peng, Yangfan Wu, Jingxuan Bao, Qiyiwen Zhang, Jiayi Xin, Qi Long, and Tianlong Chen. Flex-moe: Modeling arbitrary modality combination via the flexible mixture-of-experts.Advances in Neural Information Processing Systems, 37:98782– 98805, 2024

  66. [66]

    FLAME: Adaptive Mixture-of-Experts for Continual Multimodal Multi-Task Learning

    Yu Zhang and Qiang Yang. A survey on multi-task learning.IEEE transactions on knowledge and data engineering, 34(12):5586–5609, 2021. 14 Supplementary Material for “FLAME: Adaptive Mixture-of-Experts for Continual Multimodal Multi-Task Learning” Appendix Contents •A.Extended Related Works •B.Proof of Proposition 1 •C.Dataset Details •D.Additional Experime...

  67. [67]

    expand the pool

    couples MoE with LoRA, using low-rank adapters as experts so that lifelong adaptation of LLMs can be achieved with minimal trainable parameters. SEED [47] tackles class-incremental learning by maintaining a fixed-size expert ensemble and selectively fine-tuning only the single expert whose distributions overlap least with the new task. Despite these advan...