arxiv: 2605.14364 · v1 · submitted 2026-05-14 · 💻 cs.LG

Recognition: no theorem link

MoRe: Modular Representations for Principled Continual Representation Learning on Squantial Data

Jiaqi Sun , Boyang Sun , Mohamad Rasmy , Xiangchen Song , Kun Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:52 UTC · model grok-4.3

classification 💻 cs.LG

keywords continual learningmodular representationssequential datarepresentation learningidentifiabilityhierarchical modulestime-delayed dependenciescatastrophic forgetting

0 comments

The pith

MoRe decomposes sequential representations into identifiable hierarchies of fundamental and specific modules to support continual adaptation while preserving prior knowledge by construction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Continual learning fails when new data overwrites old representations in models trained on streams of information. The paper argues that time-delayed dependencies within sequential data already encode an intrinsic modular structure, where basic representations give rise to more specific ones. MoRe extracts this structure as a hierarchy of modules equipped with identifiability guarantees, so that new information can be incorporated by reusing, aligning, or expanding modules without touching earlier ones. This separation of concerns yields measurable gains in the stability-plasticity balance on both synthetic sequences and real LLM activations. A sympathetic reader would see the work as replacing task-boundary engineering with data-driven modularity for long-term adaptation.

Core claim

MoRe identifies modularity directly in the representation space by decomposing knowledge into a hierarchy of fundamental and specific modules that carry identifiability guarantees; this decomposition is recovered from time-delayed dependencies in the data, allowing new modules to be added or aligned during adaptation while old modules remain untouched by construction, thereby improving the plasticity-stability trade-off without reference to explicit task boundaries.

What carries the argument

MoRe's hierarchical module decomposition with identifiability guarantees, recovered from time-delayed dependencies to separate fundamental from specific representations.

If this is right

New tasks are handled by module expansion or alignment rather than full parameter updates, reducing interference with stored knowledge.
Representations acquire explicit hierarchical structure that can be inspected for which parts are reused across tasks.
The same module set can be carried forward across arbitrary numbers of sequential domains without explicit rehearsal buffers.
Plasticity is localized to new or aligned modules while stability is enforced on all prior modules.
The framework applies directly to internal activations of large models, as demonstrated on LLM hidden states.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the modular decomposition generalizes beyond sequences, similar dependency signals might be engineered in static data to obtain reusable components for transfer learning.
Identifiability could support auditing which specific knowledge remains after long adaptation streams, useful for safety-critical continual systems.
The hierarchy might be combined with existing architectural modularity techniques to reduce the number of parameters that must be stored for each new domain.
Testing on longer, more heterogeneous streams would reveal whether the fundamental modules remain stable across distribution shifts larger than those in the reported benchmarks.

Load-bearing premise

Time-delayed dependencies in sequential data naturally expose an intrinsic modular organization that can be recovered independently of any task labels or boundaries.

What would settle it

A recovery experiment in which the learned modules, after adaptation to new sequences, fail to reconstruct performance on held-out earlier sequences at levels comparable to isolated training, or in which the claimed identifiability cannot be verified by re-identifying the same modules from fresh data draws.

Figures

Figures reproduced from arXiv: 2605.14364 by Boyang Sun, Jiaqi Sun, Kun Zhang, Mohamad Rasmy, Xiangchen Song.

**Figure 2.** Figure 2: The scatter plot of estimated latents with true for [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Plasticity, stability compared with baselines and gate decision accuracy [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Per-layer concept concentration across three datasets [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Two-layer linear non-Gaussian synthetic experiment. Each panel plots the learned scalar [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗

**Figure 6.** Figure 6: Two-layer nonlinear synthetic experiment. The learned representation remains strongly [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: Three-layer synthetic experiment, first-layer focus. The learned scalar representation is [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

**Figure 8.** Figure 8: Three-layer synthetic experiment, full cross-layer comparison. Each panel compares [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗

read the original abstract

Continual learning requires models to adapt to new data while preserving previously acquired knowledge. At its core, this challenge can be viewed as principled one-step adaptation: incorporating new information with minimal interference to existing representations. Most existing approaches address this challenge by modifying model parameters or architectures in a supervised, task-specific manner. However, the underlying issue is representational: tasks require distinct yet structured representations that can be selectively updated without disrupting representations, while structure should reflect intrinsic organization in the data rather than task boundaries. In sequential data, time-delayed dependencies provide a natural signal for uncovering this organization, revealing how fundamental representations give rise to more specific ones. Inspired by the modular organization of the human brain, we propose MoRe, a framework that identifies modularity in the representation itself rather than allocating it at the architectural level. MoRe decomposes knowledge into a hierarchy of fundamental and specific modules with identifiability guarantees, enabling principled module reuse, alignment, and expansion during adaptation while preserving old modules by construction. Experiments on synthetic benchmarks and real-world LLM activations demonstrate interpretable hierarchical structure, improved plasticity-stability trade-offs, suggesting MoRe as a principled foundation for continual adaptation

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MoRe tries to ground continual learning in identifiable modular hierarchies from time-delayed dependencies, but the guarantees look underspecified without the derivations.

read the letter

The main point is that MoRe proposes a representation-centric approach to continual learning by decomposing sequential data into a hierarchy of fundamental and specific modules using time-delayed dependencies, with claimed identifiability guarantees that allow module reuse and preservation without task-specific supervision. It shifts the focus from parameter tweaks or task boundaries to intrinsic data organization, which is a reasonable reframing of the stability-plasticity issue. The experiments on synthetic benchmarks and LLM activations apparently produce interpretable hierarchies and better trade-offs, giving some concrete evidence that the decomposition can be useful in practice. That part is a step forward from purely ad-hoc modular methods. The soft spot is the identifiability claim. The stress-test note is on target here: the abstract asserts guarantees but does not show the statistical assumptions or proof steps that would make the decomposition unique from delayed correlations alone. If the full paper supplies those conditions and demonstrates that modules are recovered independently rather than fitted, the preservation-by-construction argument strengthens considerably. Otherwise the hierarchy risks being another way to regularize rather than a principled recovery. This is aimed at the continual learning crowd, especially anyone working with sequential data who wants a more structured alternative to current modular setups. A reader already familiar with identifiability results in representation learning would get the most out of it. I would send it to peer review so the derivations and experimental controls can be checked directly.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes MoRe, a framework for continual representation learning on sequential data that decomposes knowledge into a hierarchy of fundamental and specific modules identified from time-delayed dependencies. It claims identifiability guarantees that enable principled module reuse, alignment, and expansion during adaptation while preserving old modules by construction, in contrast to task-specific parameter or architecture modifications. Experiments on synthetic benchmarks and real-world LLM activations are reported to demonstrate interpretable hierarchical structure and improved plasticity-stability trade-offs.

Significance. If the identifiability guarantees hold under clearly stated statistical assumptions, MoRe would provide a representation-centric foundation for continual learning that leverages intrinsic sequential structure rather than external task boundaries, potentially yielding more stable and interpretable adaptation. The validation on LLM activations adds practical relevance, and the modular hierarchy concept aligns with cognitive inspirations in a way that could influence future work on unsupervised continual representation learning.

major comments (2)

[Abstract] Abstract: the central claim of 'identifiability guarantees' for the hierarchy of fundamental and specific modules is asserted without any derivation, statement of statistical assumptions (e.g., independence, non-Gaussianity, or delay structure), or proof that time-delayed dependencies induce a unique decomposition. This is load-bearing for the assertions of principled reuse, alignment, expansion, and preservation 'by construction'.
[Method (inferred from abstract description)] The mapping from time-delayed correlations to an identifiable modular hierarchy is presented as independent of task boundaries, yet no conditions are supplied showing that this mapping is one-to-one rather than many-to-one; without such conditions the guarantees of unique recovery and non-interference cannot be verified.

minor comments (2)

[Abstract] The abstract would be strengthened by a concise statement of the concrete objective or loss used to recover the modules from delayed dependencies.
Notation for 'fundamental' versus 'specific' modules should be introduced with explicit definitions or equations when first used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback on the identifiability claims central to MoRe. We address each major comment below and will revise the manuscript accordingly to improve clarity on assumptions and proofs.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of 'identifiability guarantees' for the hierarchy of fundamental and specific modules is asserted without any derivation, statement of statistical assumptions (e.g., independence, non-Gaussianity, or delay structure), or proof that time-delayed dependencies induce a unique decomposition. This is load-bearing for the assertions of principled reuse, alignment, expansion, and preservation 'by construction'.

Authors: We agree the abstract is too concise on this point. The full manuscript (Section 3.2 and Theorem 1) derives identifiability from independent non-Gaussian latent factors whose time-delayed correlations induce a unique hierarchical decomposition. We will revise the abstract to state the key assumptions explicitly (e.g., 'under the assumptions of independent non-Gaussian sources and time-delayed dependencies') and add a parenthetical reference to the theorem, thereby grounding the claims of reuse, alignment, expansion, and preservation by construction. revision: yes
Referee: [Method (inferred from abstract description)] The mapping from time-delayed correlations to an identifiable modular hierarchy is presented as independent of task boundaries, yet no conditions are supplied showing that this mapping is one-to-one rather than many-to-one; without such conditions the guarantees of unique recovery and non-interference cannot be verified.

Authors: Theorem 1 in Section 3.2 proves the mapping is one-to-one under the stated conditions of source independence, non-Gaussianity, and the specific delay structure; the decomposition is recovered uniquely from the data's intrinsic time-delayed statistics and does not rely on task boundaries. This directly yields unique recovery and non-interference, as fundamental modules remain fixed while new specific modules are appended. We will add an explicit statement of these conditions at the start of the method section and include a brief proof outline for verification. revision: yes

Circularity Check

1 steps flagged

Identifiability guarantees asserted without derivation or external grounding

specific steps

other [Abstract]
"MoRe decomposes knowledge into a hierarchy of fundamental and specific modules with identifiability guarantees, enabling principled module reuse, alignment, and expansion during adaptation while preserving old modules by construction."

The identifiability is invoked to justify 'principled' and 'by construction' properties, yet the text provides no derivation, no statement of the conditions that would make the decomposition unique, and no link to external benchmarks or proofs. The guarantee therefore functions as an unverified premise rather than an independently established result.

full rationale

The central claim rests on 'identifiability guarantees' for the modular hierarchy derived from time-delayed dependencies. The abstract states this property enables reuse/alignment/expansion 'by construction' but supplies no equations, statistical assumptions (e.g., independence or delay structure), or proof of uniqueness. No self-citation chain or fitted-parameter reduction is visible in the provided text; the guarantee functions as an imported premise rather than a derived result. This produces moderate circularity risk because the load-bearing uniqueness is not shown to be independent of the target claims, yet the paper remains self-contained on other axes and does not reduce the entire framework to a tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The proposal rests on the domain assumption that sequential data contains recoverable modular structure via time delays; no free parameters or new entities with independent evidence are specified in the abstract.

axioms (1)

domain assumption Time-delayed dependencies in sequential data provide a natural signal for uncovering intrinsic modular organization in representations.
Invoked as the basis for identifying modules rather than imposing task boundaries.

invented entities (1)

Hierarchy of fundamental and specific modules no independent evidence
purpose: To decompose knowledge for reuse, alignment, and expansion while preserving old modules by construction.
New postulated structure introduced by the framework.

pith-pipeline@v0.9.0 · 5515 in / 1184 out tokens · 40254 ms · 2026-05-15T02:52:29.009043+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 5 internal anchors

[1]

Aljundi, F

R. Aljundi, F. Babiloni, M. Elhoseiny, M. Rohrbach, and T. Tuytelaars. Memory aware synapses: Learning what (not) to forget. InProceedings of the European conference on computer vision (ECCV), pages 139–154, 2018

work page 2018
[2]

Aljundi, P

R. Aljundi, P. Chakravarty, and T. Tuytelaars. Expert gate: Lifelong learning with a network of experts. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3366–3375, 2017

work page 2017
[3]

Biderman, H

S. Biderman, H. Schoelkopf, Q. G. Anthony, H. Bradley, K. O’Brien, E. Hallahan, M. A. Khan, S. Purohit, U. S. Prashanth, E. Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. InInternational conference on machine learning, pages 2397–2430. PMLR, 2023

work page 2023
[4]

W. Chen, Y . Zhou, N. Du, Y . Huang, J. Laudon, Z. Chen, and C. Cui. Lifelong language pretraining with distribution-specialized experts. InInternational Conference on Machine Learning, pages 5383–5395. PMLR, 2023

work page 2023
[5]

PathNet: Evolution Channels Gradient Descent in Super Neural Networks

C. Fernando, D. Banarse, C. Blundell, Y . Zwols, D. Ha, A. A. Rusu, A. Pritzel, and D. Wier- stra. Pathnet: Evolution channels gradient descent in super neural networks.arXiv preprint arXiv:1701.08734, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[6]

E. Fini, V . G. T. Da Costa, X. Alameda-Pineda, E. Ricci, K. Alahari, and J. Mairal. Self- supervised models are continual learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9621–9630, 2022

work page 2022
[7]

Gomez-Villa, B

A. Gomez-Villa, B. Twardowski, L. Yu, A. D. Bagdanov, and J. Van de Weijer. Continually learning self-supervised representations with projected functional regularization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3867–3877, 2022

work page 2022
[8]

Houlsby, A

N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. At- tariyan, and S. Gelly. Parameter-efficient transfer learning for nlp. InInternational conference on machine learning, pages 2790–2799. PMLR, 2019

work page 2019
[9]

D. Hu, S. Yan, Q. Lu, L. Hong, H. Hu, Y . Zhang, Z. Li, X. Wang, and J. Feng. How well does self-supervised pre-training perform with streaming data?arXiv preprint arXiv:2104.12081, 2021

work page arXiv 2021
[10]

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

work page 2022
[11]

Kirkpatrick, R

J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

work page 2017
[12]

H. Li, S. Lin, L. Duan, Y . Liang, and N. B. Shroff. Theory on mixture-of-experts in continual learning.arXiv preprint arXiv:2406.16437, 2024

work page arXiv 2024
[13]

Li and D

Z. Li and D. Hoiem. Learning without forgetting. In B. Leibe, J. Matas, N. Sebe, and M. Welling, editors,Computer Vision – ECCV 2016, pages 614–629, Cham, 2016. Springer International Publishing

work page 2016
[14]

Z. Li, Y . Shen, K. Zheng, R. Cai, X. Song, M. Gong, G. Chen, and K. Zhang. On the identi- fication of temporal causal representation with instantaneous dependence. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[15]

Liang and W.-J

Y .-S. Liang and W.-J. Li. Inflora: Interference-free low-rank adaptation for continual learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23638–23647, 2024

work page 2024
[16]

W. Liu, F. Zhu, and C.-L. Liu. Branch-tuning: balancing stability and plasticity for continual self-supervised learning.IEEE Transactions on Neural Networks and Learning Systems, 2025. 10

work page 2025
[17]

Learning Sparse Neural Networks through $L_0$ Regularization

C. Louizos, M. Welling, and D. P. Kingma. Learning sparse neural networks through l_0 regularization.arXiv preprint arXiv:1712.01312, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[18]

Madaan, J

D. Madaan, J. Yoon, Y . Li, Y . Liu, and S. J. Hwang. Representational continuity for unsupervised continual learning.arXiv preprint arXiv:2110.06976, 2021

work page arXiv 2021
[19]

Mallya and S

A. Mallya and S. Lazebnik. Packnet: Adding multiple tasks to a single network by iterative pruning. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 7765–7773, 2018

work page 2018
[20]

& Ponti, E

J. Pfeiffer, S. Ruder, I. Vuli ´c, and E. M. Ponti. Modular deep learning.arXiv preprint arXiv:2302.11529, 2023

work page arXiv 2023
[21]

E. M. Ponti, A. Sordoni, Y . Bengio, and S. Reddy. Combining parameter-efficient modules for task-level generalisation. InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 687–702, 2023

work page 2023
[22]

D. Rao, F. Visin, A. Rusu, R. Pascanu, Y . W. Teh, and R. Hadsell. Continual unsupervised representation learning.Advances in neural information processing systems, 32, 2019

work page 2019
[23]

Rebuffi, A

S.-A. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert. icarl: Incremental classifier and representation learning. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2001–2010, 2017

work page 2001
[24]

A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell. Progressive neural networks.arXiv preprint arXiv:1606.04671, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[25]

Serra, D

J. Serra, D. Suris, M. Miron, and A. Karatzoglou. Overcoming catastrophic forgetting with hard attention to the task. InInternational conference on machine learning, pages 4548–4557. PMLR, 2018

work page 2018
[26]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. Outra- geously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[27]

J. S. Smith, L. Karlinsky, V . Gutta, P. Cascante-Bonilla, D. Kim, A. Arbelle, R. Panda, R. Feris, and Z. Kira. Coda-prompt: Continual decomposed attention-based prompting for rehearsal-free continual learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11909–11919, 2023

work page 2023
[28]

X. Song, J. Sun, Z. Li, Y . Zheng, and K. Zhang. LLM interpretability with identifiable temporal- instantaneous representation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[29]

Tafazoli, F

S. Tafazoli, F. M. Bouchacourt, A. Ardalan, N. T. Markov, M. Uchimura, M. G. Mattar, N. D. Daw, and T. J. Buschman. Building compositional tasks with shared neural subspaces.Nature, 650(8100):164–172, 2026

work page 2026
[30]

C. I. Tang, L. Qendro, D. Spathis, F. Kawsar, C. Mascolo, and A. Mathur. Kaizen: Practical self-supervised continual learning with continual fine-tuning. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2841–2850, 2024

work page 2024
[31]

G. Team, T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, S. Pathak, L. Sifre, M. Rivière, M. S. Kale, J. Love, et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

K. Tian, Z. Zhao, Y . Chen, N. Ge, S. Cao, X. Han, J. Gu, and S. Yu. Domain-specific schema reuse supports flexible learning to learn in the primate brain.Nature Communications, 2026

work page 2026
[33]

Veniat, L

T. Veniat, L. Denoyer, and M. Ranzato. Efficient continual learning with modular networks and task-driven priors, 2021. 11

work page 2021
[34]

L. Wang, X. Zhang, H. Su, and J. Zhu. A comprehensive survey of continual learning: Theory, method and application.IEEE transactions on pattern analysis and machine intelligence, 46(8):5362–5383, 2024

work page 2024
[35]

X. Wang, T. Chen, Q. Ge, H. Xia, R. Bao, R. Zheng, Q. Zhang, T. Gui, and X.-J. Huang. Orthogonal subspace learning for language model continual learning. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 10658–10671, 2023

work page 2023
[36]

Z. Wang, Z. Zhang, S. Ebrahimi, R. Sun, H. Zhang, C.-Y . Lee, X. Ren, G. Su, V . Perot, J. Dy, et al. Dualprompt: Complementary prompting for rehearsal-free continual learning. InEuropean conference on computer vision, pages 631–648. Springer, 2022

work page 2022
[37]

Z. Wang, Z. Zhang, C.-Y . Lee, H. Zhang, R. Sun, X. Ren, G. Su, V . Perot, J. Dy, and T. Pfister. Learning to prompt for continual learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 139–149, 2022

work page 2022
[38]

W. Yao, G. Chen, and K. Zhang. Temporally disentangled representation learning.Advances in Neural Information Processing Systems, 35:26492–26503, 2022

work page 2022
[39]

J. Yu, Y . Zhuge, L. Zhang, P. Hu, D. Wang, H. Lu, and Y . He. Boosting continual learning of vision-language models via mixture-of-experts adapters. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23219–23230, 2024

work page 2024
[40]

MoRe: Modular Representations for Principled Continual Learning on LLMs

K. Zhang, S. Xie, I. Ng, and Y . Zheng. Causal representation learning from multiple distributions: A general setting.arXiv preprint arXiv:2402.05052, 2024. 12 Appendices for“MoRe: Modular Representations for Principled Continual Learning on LLMs” Appendices Contents A Definitions and Proofs 13 A.1 Definitions . . . . . . . . . . . . . . . . . . . . . . ....

work page arXiv 2024
[41]

Second, with the density estimators fixed, we update fi using Eq. (25). In practice, we warm up the encoder with Lpred +L rec before enabling the CMI penalty. This warm-up prevents early density-estimation noise from collapsing the representation. After warm-up, the prediction loss can be hinge-thresholded at the warm-up value so that the CMI term shapes ...

work page 2048
[42]

expands and freezes distribution-specialized experts and gating dimensions for continual language pre-training, while MoE-Adapters [39] attach task-specific adapter experts to a frozen vision-language backbone with a distribution-discriminative auto-selector. With large pre-trained models, parameter- efficient fine-tuning (PEFT) methods adapt only small t...

work page
[43]

reparameterizes pre-trained weights through an interference-eliminating subspace. Prompt-based continual learning extends the PEFT idea by learning small prompt memories or complementary prompts to manage task-specific and task-invariant knowledge without replay [37, 36, 27]. Despite their empirical strengths, these supervised CL methods predominantly tre...

work page