arxiv: 2605.09355 · v1 · submitted 2026-05-10 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

FLAME: Adaptive Mixture-of-Experts for Continual Multimodal Multi-Task Learning

Rama Chellappa, Shravan Chaudhari, Suchi Saria, Tanvi Ranade, Xing Han

Pith reviewed 2026-05-12 04:09 UTC · model grok-4.3

classification 💻 cs.LG

keywords mixture-of-expertscontinual learningmultimodal learningmulti-task learningcatastrophic forgettingparameter efficiencyhealthcare applications

0 comments

The pith

FLAME enables multimodal models to pretrain on multiple tasks jointly and then adapt to new tasks with unseen modality combinations by expanding only lightweight routers while compressing expert knowledge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a mixture-of-experts framework that addresses both multi-task pretraining, where related tasks can share representations when available together, and continual learning, where new tasks arrive later with different modality mixes such as images plus text or sensors. It achieves this through modality-specific routers that handle tokens from each data type independently and by folding accumulated expert parameters into low-rank memory subspaces rather than adding full new experts. A reader would care because real deployments start with an incomplete task set yet still need the efficiency gains from joint training, and standard models either forget prior skills or grow inefficiently large. The design keeps total capacity nearly fixed after initial pretraining while supporting flexible modality combinations. Experiments on healthcare multimodal benchmarks indicate it matches joint-training accuracy, reduces forgetting on earlier tasks, and uses fewer added parameters than baselines.

Core claim

The FLAME framework supports training on multimodal tasks with diverse modality configurations by leveraging modality-specific routers that process tokens from each modality across tasks, and enables continual learning over sequential multimodal tasks within a fixed-capacity MoE by compressing accumulated expert knowledge into low-rank memory subspaces while expanding only the lightweight routers, demonstrating competitive multitask pretraining performance while alleviating catastrophic forgetting and improving parameter efficiency on healthcare multimodal benchmarks.

What carries the argument

Modality-specific routers in a sparse Mixture-of-Experts model paired with low-rank compression of accumulated expert parameters, allowing modular capacity growth without full model retraining or interference.

If this is right

Models can jointly train on co-available multimodal tasks to borrow representational strength across related objectives.
New tasks arriving sequentially can be incorporated by expanding only the routers while keeping the expert pool fixed.
Parameter count grows sublinearly because expert knowledge is compressed rather than duplicated or expanded fully.
The same architecture maintains competitive accuracy on multiple healthcare multimodal benchmarks across both pretraining and adaptation phases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same routing-plus-compression pattern could be applied to non-healthcare domains such as robotics or autonomous systems where sensor modalities evolve over time.
If low-rank subspaces reliably capture task knowledge, then expert modules in other modular architectures may also admit efficient lifelong compression without explicit redesign.
Testing the framework on single-modality continual learning would isolate whether modality-specific routers are necessary or whether generic routers suffice.
This points toward lifelong-learning systems in which total parameter growth remains bounded even as the number of encountered tasks increases without limit.

Load-bearing premise

Compressing accumulated expert knowledge into low-rank memory subspaces across sequential tasks preserves performance without substantial degradation or interference between modalities.

What would settle it

A substantial rise in error rates on earlier tasks after adding new tasks with previously unseen modality combinations, even after applying the low-rank compression, would show that the method fails to alleviate catastrophic forgetting.

Figures

Figures reproduced from arXiv: 2605.09355 by Rama Chellappa, Shravan Chaudhari, Suchi Saria, Tanvi Ranade, Xing Han.

**Figure 2.** Figure 2: FLAME multi-task pretraining architecture. Flexi-Modal tasks with overlapping but [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Cumulative top-K energy across all 10 expert sublayers (5 experts × {fc1, fc2}) of the first cross-modal MoE block, after FLAME multi-task pretraining on all 9 healthcare tasks. Left: input-covariance spectrum {λj} of Ci = Ez[zz⊤], computed over inputs the router actually dispatches to each expert. Center: weight-only Frobenius spectrum {σ 2 i,k} of Wi . Right: data-aware functional energy Ei,k = σ 2 i,kv … view at source ↗

**Figure 4.** Figure 4: Overview of FLAME’s CL procedure for MoE, with improved parameter efficiency. so the forward pass for task t excludes any component reserved at a later stage. The per-stage rank rt is a hyperparameter, fixing the cumulative reserved rank P t rt and turning capacity planning into an explicit choice under a fixed-size MoE. The full set of trainable and frozen parameters at stage t, together with the per-s… view at source ↗

**Figure 5.** Figure 5: Per-stage continual-learning trajectories across four task sequences (columns: Setup 1–4) [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Pairwise task performance. Left: AUROC, Right: AUPRC. Each row represents the focal [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Ablation over number of experts; Left: AUROC, Right: AUPRC. FLAME jointly trained [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

**Figure 5.** Figure 5: Among the standard CL baselines, LORA is the only one that retains earlier-stage AUROC by freezing the base, so the relevant question is whether FLAME-CL can match LORA’s retention at a smaller per-stage parameter cost. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_5.png] view at source ↗

**Figure 8.** Figure 8: Comparison against Lifelong-PT [6], an MoE-based continual-learning method that grows the expert pool with each new task. Per-stage trajectories for the four sequences (columns: Setup 1 to Setup 4) across three metrics (rows: AUROC, encoder parameter count in millions, MoE parameter count in millions). x-axis: training-stage checkpoint; within each stage, AUROC panels overlay one line per task seen so far.… view at source ↗

**Figure 9.** Figure 9: Comparison against LORA [22], the strongest retention baseline among the standard CL methods. Per-stage trajectories for the four sequences (columns: Setup 1 to Setup 4) across three metrics (rows: AUROC, encoder parameter count in millions, MoE parameter count in millions). x-axis: training-stage checkpoint; within each stage, AUROC panels overlay one line per task seen so far. FLAME-CL (red) matches LORA… view at source ↗

**Figure 10.** Figure 10: Per-stage AUPRC for the four continual-learning sequences (S1–S4); axes and grouping as in [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗

**Figure 11.** Figure 11: Expert input spectra after single-task FLAME pretraining on the three MIMIC-IV tasks [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗

**Figure 12.** Figure 12: Expert input spectra after single-task FLAME pretraining on the two eICU tasks. [PITH_FULL_IMAGE:figures/full_fig_p028_12.png] view at source ↗

**Figure 13.** Figure 13: Expert input spectra after single-task FLAME pretraining on the three EMBED breast [PITH_FULL_IMAGE:figures/full_fig_p029_13.png] view at source ↗

**Figure 14.** Figure 14: Expert input spectra after single-task FLAME pretraining on ADNI (multi-modal tabular [PITH_FULL_IMAGE:figures/full_fig_p030_14.png] view at source ↗

**Figure 15.** Figure 15: Expert input spectra under joint FLAME pretraining. Despite the larger and more [PITH_FULL_IMAGE:figures/full_fig_p030_15.png] view at source ↗

**Figure 16.** Figure 16: Per-modality routing for the three MIMIC-IV tasks. Modalities: chest X-ray ( [PITH_FULL_IMAGE:figures/full_fig_p031_16.png] view at source ↗

**Figure 17.** Figure 17: Per-modality routing for the two eICU tasks. [PITH_FULL_IMAGE:figures/full_fig_p031_17.png] view at source ↗

**Figure 18.** Figure 18: Per-modality routing for the three EMBED breast-imaging tasks. Mammography views [PITH_FULL_IMAGE:figures/full_fig_p032_18.png] view at source ↗

**Figure 19.** Figure 19: Per-modality routing for ADNI Alzheimer’s diagnosis (T1–T5 imaging biomarkers). [PITH_FULL_IMAGE:figures/full_fig_p032_19.png] view at source ↗

**Figure 20.** Figure 20: Per-task, per-modality routing under joint MIMIC-IV pretraining over {IHM, LOS, [PITH_FULL_IMAGE:figures/full_fig_p033_20.png] view at source ↗

**Figure 21.** Figure 21: Per-task, per-modality routing under joint pretraining over all 9 tasks across MIMIC-IV, [PITH_FULL_IMAGE:figures/full_fig_p035_21.png] view at source ↗

**Figure 22.** Figure 22: S1: per-modality, per-stage routing under MIMIC-IV continual learning [PITH_FULL_IMAGE:figures/full_fig_p036_22.png] view at source ↗

**Figure 23.** Figure 23: S2: per-modality, per-stage routing under eICU continual learning [PITH_FULL_IMAGE:figures/full_fig_p036_23.png] view at source ↗

**Figure 24.** Figure 24: S3: per-modality, per-stage routing under EMBED continual learning [PITH_FULL_IMAGE:figures/full_fig_p037_24.png] view at source ↗

**Figure 25.** Figure 25: S4: per-modality, per-stage routing under the mixed cross-dataset continual se [PITH_FULL_IMAGE:figures/full_fig_p037_25.png] view at source ↗

read the original abstract

Real-world model deployment across multiple domains requires multimodal models to operate under two complementary regimes: (1) multi-task pretraining, tasks are co-available at design time where related tasks could borrow representational strength from one another, (2) continual adaptation, in which new tasks emerge after deployment with previously unseen modality combinations. However, neither regime alone suffices: the pretraining task set is never exhaustive, while bypassing joint training forfeits the transfer gains and efficiency among co-trainable tasks. Sparse Mixture-of-Experts (MoE) is a natural fit for this dual requirement: sparse activation enables modular capacity expansion as new tasks arrive, while routing decouples modality-level computation from task-level composition. In this work, we propose a scalable MoE framework for multitask pretraining and continual learning across flexible modality combinations. The framework is designed to support training on multimodal tasks with diverse modality configurations by leveraging modality-specific routers that process tokens from each modality across tasks. Furthermore, it enables continual learning over sequential multimodal tasks within a fixed-capacity MoE by compressing accumulated expert knowledge into low-rank memory subspaces, while expanding only the lightweight routers. We validate the effectiveness of our method on multiple healthcare multimodal benchmarks. It demonstrates competitive multitask pretraining performance while alleviating catastrophic forgetting and improving parameter efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FLAME adds modality-specific routers and low-rank compression to MoE so models can do joint pretraining then adapt to new multimodal tasks by expanding only routers, but the abstract gives no metrics or ablations to check if the compression actually works.

read the letter

FLAME combines modality-specific routers with low-rank compression of experts in a Mixture-of-Experts setup. The goal is to support both joint pretraining on multiple multimodal tasks and later continual adaptation when new tasks with different modality mixes arrive. The new piece is the way it keeps the expert pool at fixed capacity by folding prior knowledge into low-rank memory while only growing the routers for new tasks. This targets the practical issue in domains like healthcare where data sources evolve over time. The design tries to preserve transfer benefits from joint training without catastrophic forgetting or parameter explosion. The paper does a decent job framing the dual requirement of pretraining and sequential adaptation. Sparse MoE is a natural fit here because routing can separate modality processing from task composition. The main weakness is the lack of supporting numbers. The abstract mentions competitive results and reduced forgetting on healthcare benchmarks, but gives no specific metrics, baseline comparisons, or ablation studies. Without those, it is hard to judge whether the low-rank compression actually retains the necessary information across modalities or if it introduces interference. The concern about degradation in cross-modal interactions looks like it needs direct testing. This work is aimed at people building multimodal systems that must keep learning after deployment. Readers interested in efficient continual learning for MoE models would find the architecture worth examining. It deserves a serious referee. The core idea addresses a genuine deployment challenge with a plausible mechanism, even though the current evidence is preliminary. I would send it for review with the expectation that the authors show ablations on the compression step and actual performance numbers.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces FLAME, a sparse Mixture-of-Experts framework for multimodal multi-task learning that supports both joint pretraining on co-available tasks and continual adaptation to new tasks with previously unseen modality combinations. Modality-specific routers process tokens from each modality, while accumulated expert knowledge is compressed into low-rank memory subspaces and only lightweight routers are expanded to maintain fixed model capacity. The approach is claimed to achieve competitive multitask pretraining performance, alleviate catastrophic forgetting, and improve parameter efficiency on multiple healthcare multimodal benchmarks.

Significance. If the low-rank compression successfully retains task- and modality-specific information without substantial cross-modal interference or performance degradation, the framework would offer a scalable solution for real-world deployment of multimodal models in dynamic domains such as healthcare. The decoupling of modality-level computation via routers and the fixed-capacity continual learning mechanism address a practical gap between exhaustive joint pretraining and naive sequential adaptation. However, the absence of any quantitative metrics, baselines, ablation studies, or validation of the low-rank approximation in the provided abstract makes it difficult to determine whether the central claims are supported.

major comments (2)

[Abstract] Abstract: the central claims of 'competitive multitask pretraining performance' and 'alleviating catastrophic forgetting' are stated without any reported metrics (e.g., task accuracies, forgetting rates such as average accuracy drop across tasks), baselines (standard MoE, fine-tuning, or replay methods), or ablation results on the low-rank compression. This leaves the effectiveness of compressing expert knowledge into low-rank subspaces unsupported by visible evidence.
[Abstract] Abstract: no details are provided on the low-rank compression implementation, including rank selection criterion, reconstruction error, or modality-specific interference metrics. This is load-bearing for the claim that performance is preserved across sequential tasks with varying modality combinations, as the skeptic concern notes that discarded cross-modal interactions could cause degradation even with sparse routing.

minor comments (1)

[Abstract] The abstract would be strengthened by briefly naming the specific healthcare multimodal benchmarks used and the number of sequential tasks in the continual learning experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the practical value of FLAME in addressing continual multimodal multi-task learning. We agree that the abstract would be strengthened by incorporating key quantitative results and a brief description of the low-rank compression mechanism. We have revised the abstract to address these points and provide point-by-point responses below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claims of 'competitive multitask pretraining performance' and 'alleviating catastrophic forgetting' are stated without any reported metrics (e.g., task accuracies, forgetting rates such as average accuracy drop across tasks), baselines (standard MoE, fine-tuning, or replay methods), or ablation results on the low-rank compression. This leaves the effectiveness of compressing expert knowledge into low-rank subspaces unsupported by visible evidence.

Authors: We agree that the abstract would benefit from including representative quantitative evidence. The full manuscript reports experimental results on healthcare multimodal benchmarks, including comparisons against standard MoE, fine-tuning, and replay baselines, along with ablation studies on the low-rank compression and metrics such as task accuracies and forgetting rates (average accuracy drop across tasks). To make this evidence visible in the abstract, we have revised it to summarize key performance figures demonstrating competitive multitask pretraining and reduced catastrophic forgetting. revision: yes
Referee: [Abstract] Abstract: no details are provided on the low-rank compression implementation, including rank selection criterion, reconstruction error, or modality-specific interference metrics. This is load-bearing for the claim that performance is preserved across sequential tasks with varying modality combinations, as the skeptic concern notes that discarded cross-modal interactions could cause degradation even with sparse routing.

Authors: The full manuscript provides these implementation details in the Methods section, including the low-rank factorization approach, rank selection via explained variance threshold, reported reconstruction errors, and analysis of modality-specific interference to confirm limited cross-modal degradation. We acknowledge that a concise reference in the abstract would better address potential skeptic concerns. We have therefore revised the abstract to include a brief description of the low-rank compression process and its role in preserving performance across sequential tasks. revision: yes

Circularity Check

0 steps flagged

No circularity: architectural design choices validated on external benchmarks

full rationale

The paper proposes FLAME as a new MoE architecture for multimodal continual learning, specifying design elements such as modality-specific routers and low-rank compression of expert knowledge into memory subspaces. These are presented as engineering decisions to support multitask pretraining and sequential adaptation, not as outputs derived from equations or parameters that are fitted and then renamed as predictions. The abstract and description contain no self-referential definitions, no fitted-input predictions, and no load-bearing self-citations that reduce the central claims to tautologies. Performance is assessed via empirical results on independent healthcare multimodal benchmarks, keeping the derivation chain self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities beyond the proposed framework itself are detailed.

pith-pipeline@v0.9.0 · 5540 in / 1010 out tokens · 27785 ms · 2026-05-12T04:09:37.051715+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear
Proposition 1: If Ci has ϵ-effective rank r∗... output Wiz is concentrated in the rank-r∗ subspace

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · 6 internal anchors

[1]

arXiv preprint arXiv:2410.11222 , year=

Pedram Akbarian, Huy Nguyen, Xing Han, and Nhat Ho. Quadratic gating mixture of experts: Statistical insights into self-attention.arXiv preprint arXiv:2410.11222, 2024

work page arXiv 2024
[2]

Implicit regularization in deep matrix factorization.Advances in neural information processing systems, 32, 2019

Sanjeev Arora, Nadav Cohen, Wei Hu, and Yuping Luo. Implicit regularization in deep matrix factorization.Advances in neural information processing systems, 32, 2019

work page 2019
[3]

Neural networks as kernel learners: The silent alignment effect

Alexander Atanasov, Blake Bordelon, and Cengiz Pehlevan. Neural networks as kernel learners: The silent alignment effect. InInternational Conference on Learning Representations, 2022

work page 2022
[4]

Multimodal machine learn- ing: A survey and taxonomy.IEEE transactions on pattern analysis and machine intelligence, 41(2):423–443, 2018

Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. Multimodal machine learn- ing: A survey and taxonomy.IEEE transactions on pattern analysis and machine intelligence, 41(2):423–443, 2018

work page 2018
[5]

Multitask learning.Machine learning, 28(1):41–75, 1997

Rich Caruana. Multitask learning.Machine learning, 28(1):41–75, 1997

work page 1997
[6]

Lifelong language pretraining with distribution-specialized experts

Wuyang Chen, Yanqi Zhou, Nan Du, Yanping Huang, James Laudon, Zhifeng Chen, and Claire Cui. Lifelong language pretraining with distribution-specialized experts. InInternational Conference on Machine Learning, pages 5383–5395. PMLR, 2023

work page 2023
[7]

Chandler Davis and William M. Kahan. The rotation of eigenvectors by a perturbation. III. SIAM Journal on Numerical Analysis, 7(1):1–46, 1970

work page 1970
[8]

A continual learning survey: Defying forgetting in classification tasks.IEEE transactions on pattern analysis and machine intelligence, 44(7):3366– 3385, 2021

Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Aleš Leonardis, Gregory Slabaugh, and Tinne Tuytelaars. A continual learning survey: Defying forgetting in classification tasks.IEEE transactions on pattern analysis and machine intelligence, 44(7):3366– 3385, 2021

work page 2021
[9]

The mnist database of handwritten digit images for machine learning research [best of the web].IEEE Signal Processing Magazine, 29(6):141–142, 2012

Li Deng. The mnist database of handwritten digit images for machine learning research [best of the web].IEEE Signal Processing Magazine, 29(6):141–142, 2012

work page 2012
[10]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

work page 2022
[11]

Efficiently identifying task groupings for multi-task learning.Advances in Neural Information Processing Systems, 34:27503–27516, 2021

Chris Fifty, Ehsan Amid, Zhe Zhao, Tianhe Yu, Rohan Anil, and Chelsea Finn. Efficiently identifying task groupings for multi-task learning.Advances in Neural Information Processing Systems, 34:27503–27516, 2021

work page 2021
[12]

Implicit regularization of discrete gradient dynamics in linear neural networks

Gauthier Gidel, Francis Bach, and Simon Lacoste-Julien. Implicit regularization of discrete gradient dynamics in linear neural networks. InAdvances in Neural Information Processing Systems, 2019

work page 2019
[13]

Mimic-iv-ecg: Diagnostic electrocardiogram matched subset.Type: dataset, 2023

Brian Gow, Tom Pollard, Larry A Nathanson, Alistair Johnson, Benjamin Moody, Chrystinne Fernandes, Nathaniel Greenbaum, Jonathan W Waks, Parastou Eslami, Tanner Carbonati, et al. Mimic-iv-ecg: Diagnostic electrocardiogram matched subset.Type: dataset, 2023

work page 2023
[14]

Implicit bias of gradient descent on linear convolutional networks.Advances in neural information processing systems, 31, 2018

Suriya Gunasekar, Jason D Lee, Daniel Soudry, and Nati Srebro. Implicit bias of gradient descent on linear convolutional networks.Advances in neural information processing systems, 31, 2018

work page 2018
[15]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 10

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Sparsely activated mixture-of-experts are robust multi-task learners.arXiv preprint arXiv:2204.07689, 2022

Shashank Gupta, Subhabrata Mukherjee, Krishan Subudhi, Eduardo Gonzalez, Damien Jose, Ahmed H Awadallah, and Jianfeng Gao. Sparsely activated mixture-of-experts are robust multi-task learners.arXiv preprint arXiv:2204.07689, 2022

work page arXiv 2022
[17]

Guiding mixture-of-experts with temporal multimodal interactions.arXiv preprint arXiv:2509.25678, 2025

Xing Han, Hsing-Huan Chung, Joydeep Ghosh, Paul Pu Liang, and Suchi Saria. Guiding mixture-of-experts with temporal multimodal interactions.arXiv preprint arXiv:2509.25678, 2025

work page arXiv 2025
[18]

Dynamic combination of heterogeneous models for hierarchical time series

Xing Han, Jing Hu, and Joydeep Ghosh. Dynamic combination of heterogeneous models for hierarchical time series. In2022 IEEE International Conference on Data Mining Workshops (ICDMW), pages 1207–1216. IEEE, 2022

work page 2022
[19]

Fusemoe: Mixture-of-experts transformers for fleximodal fusion.Advances in Neural Information Processing Systems, 37:67850–67900, 2024

Xing Han, Huy Nguyen, Carl Harris, Nhat Ho, and Suchi Saria. Fusemoe: Mixture-of-experts transformers for fleximodal fusion.Advances in Neural Information Processing Systems, 37:67850–67900, 2024

work page 2024
[20]

Multitask learning and benchmarking with clinical time series data.Scientific data, 6(1):96, 2019

Hrayr Harutyunyan, Hrant Khachatrian, David C Kale, Greg Ver Steeg, and Aram Galstyan. Multitask learning and benchmarking with clinical time series data.Scientific data, 6(1):96, 2019

work page 2019
[21]

Hendawy, J

Ahmed Hendawy, Jan Peters, and Carlo D’Eramo. Multi-task reinforcement learning with mixture of orthogonal experts.arXiv preprint arXiv:2311.11385, 2023

work page arXiv 2023
[22]

LoRA: Low-rank adaptation of large language models

Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022

work page 2022
[23]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Neural tangent kernel: Convergence and generalization in neural networks

Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. InAdvances in Neural Information Processing Systems, 2018

work page 2018
[25]

Perceiver: General perception with iterative attention

Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. Perceiver: General perception with iterative attention. InInternational conference on machine learning, pages 4651–4664. PMLR, 2021

work page 2021
[26]

The EMory BrEast imaging dataset (EMBED): A racially diverse, granular dataset of 3.4 million screening and diagnostic mammographic images.Radiol

Jiwoong J Jeong, Brianna L Vey, Ananth Bhimireddy, Thomas Kim, Thiago Santos, Ramon Correa, Raman Dutt, Marina Mosunjac, Gabriela Oprea-Ilies, Geoffrey Smith, Minjae Woo, Christopher R McAdams, Mary S Newell, Imon Banerjee, Judy Gichoya, and Hari Trivedi. The EMory BrEast imaging dataset (EMBED): A racially diverse, granular dataset of 3.4 million screeni...

work page 2023
[27]

Mixtral of Experts

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Mimic-cxr database.PhysioNet10, 13026(C2JT1Q):5, 2024

Alistair Johnson, Tom Pollard, Roger Mark, Seth Berkowitz, and Steven Horng. Mimic-cxr database.PhysioNet10, 13026(C2JT1Q):5, 2024

work page 2024
[29]

Mimic-iv, a freely accessible electronic health record dataset.Scientific data, 10(1):1, 2023

Alistair EW Johnson, Lucas Bulgarelli, Lu Shen, Alvin Gayles, Ayad Shammout, Steven Horng, Tom J Pollard, Sicheng Hao, Benjamin Moody, Brian Gow, et al. Mimic-iv, a freely accessible electronic health record dataset.Scientific data, 10(1):1, 2023

work page 2023
[30]

The universal weight subspace hypothesis, 2025

Prakhar Kaushik, Shravan Chaudhari, Ankit Vaidya, Rama Chellappa, and Alan Yuille. The universal weight subspace hypothesis, 2025

work page 2025
[31]

Eigenlorax: Recy- cling adapters to find principal subspaces for resource-efficient adaptation and inference

Prakhar Kaushik, Ankit Vaidya, Shravan Chaudhari, and Alan Yuille. Eigenlorax: Recy- cling adapters to find principal subspaces for resource-efficient adaptation and inference. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 649–659, 2025. 11

work page 2025
[32]

Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

work page 2017
[33]

Concentration inequalities and moment bounds for sample covariance operators.Bernoulli, 23(1):110–133, 2017

Vladimir Koltchinskii and Karim Lounici. Concentration inequalities and moment bounds for sample covariance operators.Bernoulli, 23(1):110–133, 2017

work page 2017
[34]

Continual learning for domain adaptation in chest x-ray classification

Matthias Lenga, Heinrich Schulz, and Axel Saalbach. Continual learning for domain adaptation in chest x-ray classification. InMedical Imaging with Deep Learning, pages 413–423. PMLR, 2020

work page 2020
[35]

Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36:28541–28564, 2023

Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36:28541–28564, 2023

work page 2023
[36]

Theory on mixture-of- experts in continual learning.arXiv preprint arXiv:2406.16437, 2024

Hongbo Li, Sen Lin, Lingjie Duan, Yingbin Liang, and Ness B Shroff. Theory on mixture-of- experts in continual learning.arXiv preprint arXiv:2406.16437, 2024

work page arXiv 2024
[37]

High-modality multimodal transformer: Quantifying modality & interaction heterogeneity for high-modality representation learning

Paul Pu Liang, Yiwei Lyu, Xiang Fan, Jeffrey Tsaw, Yudong Liu, Shentong Mo, Dani Yogatama, Louis-Philippe Morency, and Ruslan Salakhutdinov. High-modality multimodal transformer: Quantifying modality & interaction heterogeneity for high-modality representation learning. arXiv preprint arXiv:2203.01311, 2022

work page arXiv 2022
[38]

Inflora: Interference-free low-rank adaptation for continual learning

Yan-Shuo Liang and Wu-Jun Li. Inflora: Interference-free low-rank adaptation for continual learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[39]

Med-flamingo: a multimodal medical few-shot learner (2023).URL: https://arxiv

Michael Moor, Qian Huang, Shirley Wu, Michihiro Yasunaga, Cyril Zakka, Yash Dalmia, Eduardo Pontes Reis, Pranav Rajpurkar, and Jure Leskovec. Med-flamingo: a multimodal medical few-shot learner (2023).URL: https://arxiv. org/abs/2307.15189, 2023

work page arXiv 2023
[40]

Multi- modal contrastive learning with limoe: the language-image mixture of experts.Advances in Neural Information Processing Systems, 35:9564–9576, 2022

Basil Mustafa, Carlos Riquelme, Joan Puigcerver, Rodolphe Jenatton, and Neil Houlsby. Multi- modal contrastive learning with limoe: the language-image mixture of experts.Advances in Neural Information Processing Systems, 35:9564–9576, 2022

work page 2022
[41]

Multimodal deep learning

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, Andrew Y Ng, et al. Multimodal deep learning. InIcml, volume 11, pages 689–696, 2011

work page 2011
[42]

On expert estimation in hierar- chical mixture of experts: Beyond softmax gating functions.arXiv preprint arXiv:2410.02935, 2024

Huy Nguyen, Xing Han, Carl Harris, Suchi Saria, and Nhat Ho. On expert estimation in hierar- chical mixture of experts: Beyond softmax gating functions.arXiv preprint arXiv:2410.02935, 2024

work page arXiv 2024
[43]

eICU Collaborative Research Database.PhysioNet, April 2019

Tom Pollard, Alistair Johnson, Jesse Raffa, Leo Anthony Celi, Omar Badawi, and Roger Mark. eICU Collaborative Research Database.PhysioNet, April 2019. Version 2.0

work page 2019
[44]

The future of digital health with federated learning.NPJ digital medicine, 3(1):119, 2020

Nicola Rieke, Jonny Hancox, Wenqi Li, Fausto Milletari, Holger R Roth, Shadi Albarqouni, Spyridon Bakas, Mathieu N Galtier, Bennett A Landman, Klaus Maier-Hein, et al. The future of digital health with federated learning.NPJ digital medicine, 3(1):119, 2020

work page 2020
[45]

An Overview of Multi-Task Learning in Deep Neural Networks

Sebastian Ruder. An overview of multi-task learning in deep neural networks.arXiv preprint arXiv:1706.05098, 2017

work page internal anchor Pith review arXiv 2017
[46]

Progressive Neural Networks

Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks.arXiv preprint arXiv:1606.04671, 2016

work page internal anchor Pith review arXiv 2016
[47]

Divide and not forget: Ensemble of selectively trained experts in continual learning.arXiv preprint arXiv:2401.10191, 2024

Grzegorz Rype´s´c, Sebastian Cygert, Valeriya Khan, Tomasz Trzci´nski, Bartosz Zieli´nski, and Bartłomiej Twardowski. Divide and not forget: Ensemble of selectively trained experts in continual learning.arXiv preprint arXiv:2401.10191, 2024

work page arXiv 2024
[48]

Gradient projection memory for continual learning

Gobinda Saha, Isha Garg, and Kaushik Roy. Gradient projection memory for continual learning. InInternational Conference on Learning Representations, 2021. 12

work page 2021
[49]

Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks.arXiv preprint arXiv:1312.6120, 2013

work page Pith review arXiv 2013
[50]

Le, Geoffrey E

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V . Le, Geoffrey E. Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of- experts layer. In5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017

work page 2017
[51]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

Missing modalities imputation via cascaded residual autoencoder

Luan Tran, Xiaoming Liu, Jiayu Zhou, and Rong Jin. Missing modalities imputation via cascaded residual autoencoder. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1405–1414, 2017

work page 2017
[53]

Joel A. Tropp. An introduction to matrix concentration inequalities.Foundations and Trends in Machine Learning, 8(1–2):1–230, 2015

work page 2015
[54]

Multi-task learning for dense prediction tasks: A survey.IEEE transactions on pattern analysis and machine intelligence, 44(7):3614–3633, 2021

Simon Vandenhende, Stamatios Georgoulis, Wouter Van Gansbeke, Marc Proesmans, Dengxin Dai, and Luc Van Gool. Multi-task learning for dense prediction tasks: A survey.IEEE transactions on pattern analysis and machine intelligence, 44(7):3614–3633, 2021

work page 2021
[55]

Cambridge University Press, 2018

Roman Vershynin.High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge University Press, 2018

work page 2018
[56]

A comprehensive survey of continual learning: Theory, method and application.IEEE transactions on pattern analysis and machine intelligence, 46(8):5362–5383, 2024

Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. A comprehensive survey of continual learning: Theory, method and application.IEEE transactions on pattern analysis and machine intelligence, 46(8):5362–5383, 2024

work page 2024
[57]

Multimodal learning with incomplete modalities by knowledge distillation

Qi Wang, Liang Zhan, Paul Thompson, and Jiayu Zhou. Multimodal learning with incomplete modalities by knowledge distillation. InProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1828–1838, 2020

work page 2020
[58]

Orthogonal subspace learning for language model continual learning

Xiao Wang, Tianze Chen, Qiming Ge, Han Xia, Rong Bao, Rui Zheng, Qi Zhang, Tao Gui, and Xuanjing Huang. Orthogonal subspace learning for language model continual learning. In Findings of the Association for Computational Linguistics: EMNLP 2023, 2023

work page 2023
[59]

The alzheimer’s disease neuroimaging initiative 3: Continued innovation for clinical trial improvement.Alzheimers Dement, 13(5):561–571, December 2016

Michael W Weiner, Dallas P Veitch, Paul S Aisen, Laurel A Beckett, Nigel J Cairns, Robert C Green, Danielle Harvey, Clifford R Jack, Jr, William Jagust, John C Morris, Ronald C Petersen, Jennifer Salazar, Andrew J Saykin, Leslie M Shaw, Arthur W Toga, John Q Trojanowski, and Alzheimer’s Disease Neuroimaging Initiative. The alzheimer’s disease neuroimaging...

work page 2016
[60]

Dynamic modeling of patients, modalities and tasks via multi-modal multi-task mixture of experts

Chenwei Wu, Zitao Shuai, Zhengxu Tang, Luning Wang, and Liyue Shen. Dynamic modeling of patients, modalities and tasks via multi-modal multi-task mixture of experts. InThe thirteenth international conference on learning representations, 2025

work page 2025
[61]

Multimodal learning with transformers: A survey

Peng Xu, Xiatian Zhu, and David A Clifton. Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(10):12113–12132, 2023

work page 2023
[62]

Moral: Moe augmented lora for llms’ lifelong learning.arXiv preprint arXiv:2402.11260, 2024

Shu Yang, Muhammad Asif Ali, Cheng-Long Wang, Lijie Hu, and Di Wang. Moral: Moe augmented lora for llms’ lifelong learning.arXiv preprint arXiv:2402.11260, 2024

work page arXiv 2024
[63]

Unleashing the power of multi-task learning: A comprehensive survey spanning traditional, deep, and pretrained foundation model eras.arXiv preprint arXiv:2404.18961, 2024

Jun Yu, Yutong Dai, Xiaokang Liu, Jin Huang, Yishan Shen, Ke Zhang, Rong Zhou, Eashan Adhikarla, Wenxuan Ye, Yixin Liu, et al. Unleashing the power of multi-task learning: A comprehensive survey spanning traditional, deep, and pretrained foundation model eras.arXiv preprint arXiv:2404.18961, 2024

work page arXiv 2024
[64]

Samworth

Yi Yu, Tengyao Wang, and Richard J. Samworth. A useful variant of the Davis–Kahan theorem for statisticians.Biometrika, 102(2):315–323, 2015. 13

work page 2015
[65]

Flex-moe: Modeling arbitrary modality combination via the flexible mixture-of-experts.Advances in Neural Information Processing Systems, 37:98782– 98805, 2024

Sukwon Yun, Inyoung Choi, Jie Peng, Yangfan Wu, Jingxuan Bao, Qiyiwen Zhang, Jiayi Xin, Qi Long, and Tianlong Chen. Flex-moe: Modeling arbitrary modality combination via the flexible mixture-of-experts.Advances in Neural Information Processing Systems, 37:98782– 98805, 2024

work page 2024
[66]

FLAME: Adaptive Mixture-of-Experts for Continual Multimodal Multi-Task Learning

Yu Zhang and Qiang Yang. A survey on multi-task learning.IEEE transactions on knowledge and data engineering, 34(12):5586–5609, 2021. 14 Supplementary Material for “FLAME: Adaptive Mixture-of-Experts for Continual Multimodal Multi-Task Learning” Appendix Contents •A.Extended Related Works •B.Proof of Proposition 1 •C.Dataset Details •D.Additional Experime...

work page 2021
[67]

expand the pool

couples MoE with LoRA, using low-rank adapters as experts so that lifelong adaptation of LLMs can be achieved with minimal trainable parameters. SEED [47] tackles class-incremental learning by maintaining a fixed-size expert ensemble and selectively fine-tuning only the single expert whose distributions overlap least with the new task. Despite these advan...

work page