pith. sign in

arxiv: 2606.00382 · v1 · pith:57I6TZ7Ynew · submitted 2026-05-29 · 💻 cs.LG

CRMA: A Spectrally-Bounded Backbone for Modular Continual Fine-Tuning of LLMs

Pith reviewed 2026-06-28 23:06 UTC · model grok-4.3

classification 💻 cs.LG
keywords continual learninglarge language modelsresidual adaptersspectral boundsbackward transfermodular fine-tuningSinkhorn normalizationcatastrophic forgetting
0
0 comments X

The pith

CRMA keeps the shared weights of large language models trainable across sequential tasks by enforcing a spectral bound on its mixing matrix at every step.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a residual adapter called CRMA whose internal mixing matrix is forced to be doubly stochastic via Sinkhorn normalization at each forward pass. This construction, by Birkhoff's theorem, guarantees that the spectral norm of the matrix stays at most 1, giving a structural bound rather than a learned penalty. The resulting backbone can therefore continue to update during fine-tuning on new domains while still supporting modular per-task adapters. Experiments on Mistral-7B and other models across multiple sequential domains show that loss drift on prior tasks drops from over 40 percent to near zero and that holdout performance on old tasks actually improves. The method requires no replay buffers, no growing memory, and no distillation.

Core claim

CRMA supplies a continuously trainable shared substrate for modular continual fine-tuning by making its residual mixing matrix doubly stochastic at every forward pass, which by Birkhoff's theorem enforces a spectral-norm bound of 1 and thereby prevents interference between tasks while still allowing positive backward transfer.

What carries the argument

The CRMA residual adapter, whose mixing matrix M is kept doubly stochastic at every forward pass by Sinkhorn normalization so that Birkhoff's theorem directly supplies the bound ||M||_2 <= 1.

If this is right

  • Modular per-task LoRA adapters can be trained on top of a backbone that itself keeps updating without producing catastrophic forgetting.
  • Prior-task holdout loss improves rather than degrades, yielding positive backward transfer without any replay or distillation.
  • The same forgetting-prevention effect appears across model sizes from 1.1 B to 9.2 B parameters and across four architecture families.
  • No per-task memory growth or replay buffer is required to maintain performance on earlier domains.
  • Inference-time toggling of the CRMA component alone restores access to sequentially acquired knowledge on the same weights.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same spectral-bound construction could be tested on vision or multimodal backbones to check whether the forgetting-prevention pattern generalizes beyond language models.
  • If the bound remains tight at 100-plus tasks, the approach would remove the usual scaling barrier that forces practitioners to choose between replay and freezing.
  • An open question left by the work is whether the Sinkhorn step can be replaced by a cheaper projection while still guaranteeing the same norm bound.

Load-bearing premise

That Sinkhorn normalization will continue to keep the spectral norm of the mixing matrix at or below 1 across long sequences of tasks and larger models without the bound loosening or other unstated factors producing the observed gains.

What would settle it

An experiment on a sixth sequential domain in which the measured spectral norm of M exceeds 1.0 or prior-task holdout loss rises by more than 1 percent relative to the CRMA baseline.

Figures

Figures reproduced from arXiv: 2606.00382 by Ashwin Shanmugasundaram, Aswini Nutakki, Kiran Nayudu, Sai Vinay Naidu.

Figure 1
Figure 1. Figure 1: End-to-end CRMA training pipeline. Training data passes through the dataset optimizer, [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: CRMA as a block. The adapter is a learned non-expansive residual module placed [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Spectral norm across 867 training steps spanning 5 sequential domains (Gemma-2-9B). [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Cross-domain retention across sequential phases on all 5 models. Each row is a training [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Multi-model continual learning comparison: CRMA vs. NAIVE sequential fine-tuning [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Standard benchmark scores for Gemma-2-9B before and after 5-phase sequential CRMA [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Per-domain holdout NLL loss across 4 sequential training phases on Mistral-7B. Each [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Per-task forgetting on the v8.1 controlled ablation (Mistral-7B, 4 sequential domains). [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: 3-seed seed-robustness check on Mistral-7B, 5 sequential continual-learning domains. [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Inference-time ablation on Gemma-2-9B across all five sequential domains. Same model [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗
read the original abstract

Sequential fine-tuning of large language models forces a choice: let the shared substrate keep learning and accept catastrophic forgetting, or freeze it after task one and foreclose cross-task refinement. Per-task adapter methods (LoRAHub, AdapterFusion, PackNet, Progressive Networks) take the second path. We introduce CRMA (Constrained Residual Mixing Adapter), a residual adapter whose internal mixing matrix M is doubly-stochastic at every forward pass via Sinkhorn normalization, so by Birkhoff's theorem ||M||_2 <= 1 holds by construction -- a structural bound, not a penalty. CRMA's spectrally bounded backbone provides a continuously trained shared substrate that earlier modular methods could not, while preserving their forgetting guarantees. On Mistral-7B across 5 sequential domains and 3 seeds, modular per-task LoRA on a CRMA backbone reduces loss-relative drift from +42.96% +/- 5.5 (naive sequential fine-tuning) to -0.17% +/- 0.17, with disjoint per-seed ranges, and improves prior-task holdout loss by 1.99% +/- 0.54 over a matched frozen-substrate baseline. Three independent experimental setups (Mistral-7B 4-domain controlled ablation, TinyLlama 3-domain contamination-controlled replication, Mistral-7B cross-domain probes at 7B) all show positive backward transfer -- without replay buffers, without growing per-task memory, and without distillation. An inference-time ablation on Gemma-2-9B confirms CRMA mediates access to sequentially trained knowledge: 98/100 vs. 38/100 on the same weights and same questions with only CRMA injection toggled. 867 logged training steps verify ||M||_2 = 1.0 within float32 precision (max deviation 1.2 x 10^-7). The forgetting-prevention effect holds across 1.1B-9.2B parameters and four architecture families.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces CRMA (Constrained Residual Mixing Adapter), a residual adapter whose mixing matrix M is rendered doubly-stochastic at every forward pass via Sinkhorn normalization. By Birkhoff's theorem this enforces ||M||_2 ≤ 1 by construction, supplying a structural spectral bound on the shared substrate. The approach permits continuous training of a common backbone while attaching per-task LoRA modules, thereby avoiding catastrophic forgetting without replay buffers or growing memory. Experiments on Mistral-7B (5 domains, 3 seeds), TinyLlama (3-domain replication), and Gemma-2-9B (inference toggle) report loss-relative drift reduced from +42.96% ± 5.5 to -0.17% ± 0.17, positive backward transfer of 1.99% ± 0.54 on prior-task holdouts, and an inference ablation showing 98/100 vs. 38/100 success when CRMA is toggled. A 867-step numerical check confirms the bound holds to within 1.2 × 10^{-7} in float32.

Significance. If the experimental controls hold, the work supplies both a parameter-free spectral guarantee and reproducible evidence of positive backward transfer across model scales (1.1B–9.2B) and four architecture families. The explicit 867-step verification of the norm bound and the contamination-controlled TinyLlama replication are concrete strengths that distinguish the contribution from penalty-based or post-hoc regularization approaches.

minor comments (3)
  1. [Experimental results] The abstract states that the per-seed drift ranges are disjoint; the main experimental section should tabulate the three individual seed values (or explicit ranges) for both CRMA and the naive baseline so readers can verify the separation directly.
  2. [Baselines] The description of the 'matched frozen-substrate baseline' should explicitly state whether the baseline receives the same total number of gradient steps on the shared weights as the CRMA runs, or whether the comparison is only at equal adapter capacity.
  3. Notation for the spectral-norm deviation (1.2 x 10^-7) should be rendered consistently in scientific notation throughout the text and any supplementary verification logs.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the detailed summary, positive assessment of the spectral guarantee and experimental controls, and the recommendation for minor revision. We appreciate the recognition of the 867-step verification, contamination-controlled replication, and cross-scale results.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's central derivation applies the external Birkhoff theorem to the output of Sinkhorn normalization on M to obtain the spectral bound ||M||_2 <=1 by construction; this is a direct invocation of a known result, not a self-referential definition or fitted input. Experimental outcomes (drift reduction, backward transfer) are reported from separate controlled ablations and replications rather than being predicted from the bound. No self-citation chains, ansatzes smuggled via prior work, or renamings of known results appear in the load-bearing steps. The derivation chain is therefore self-contained against external mathematical facts and empirical verification.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; the ledger is populated only with items explicitly named in the abstract. Birkhoff's theorem is invoked to justify the bound. No free parameters or invented entities are stated.

axioms (1)
  • standard math Birkhoff's theorem implies that any doubly-stochastic matrix satisfies ||M||_2 <= 1
    Invoked in the abstract to establish the structural bound on the mixing matrix M after Sinkhorn normalization.

pith-pipeline@v0.9.1-grok · 5920 in / 1500 out tokens · 20162 ms · 2026-06-28T23:06:18.091032+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 8 canonical work pages · 2 internal anchors

  1. [1]

    French, R. M. (1999). Catastrophic forgetting in connectionist networks.Trends in Cognitive Sciences, 3(4):128–135

  2. [2]

    A., et al

    Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., et al. (2017). Overcoming catastrophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13):3521–3526

  3. [3]

    Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. (2018). Spectral normalization for generative adversarial networks. InInternational Conference on Learning Representations (ICLR)

  4. [4]

    Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., de Laroussilhe, Q., Gesmundo, A., et al. (2019). Parameter-efficient transfer learning for NLP. InInternational Conference on Machine Learning (ICML)

  5. [5]

    Li, X. L. and Liang, P. (2021). Prefix-tuning: Optimizing continuous prompts for generation. InAssociation for Computational Linguistics (ACL)

  6. [6]

    Lester, B., Al-Rfou, R., and Constant, N. (2021). The power of scale for parameter- efficient prompt tuning. InConference on Empirical Methods in Natural Language Processing (EMNLP)

  7. [7]

    J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., and Chen, W

    Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., and Chen, W. (2022). LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Rep- resentations (ICLR)

  8. [8]

    Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer, L. (2023). QLoRA: Efficient fine- tuning of quantized LLMs. InAdvances in Neural Information Processing Systems (NeurIPS). 32

  9. [9]

    Frascaroli, E., Benaglia, R., Boschini, M., Moschella, L., Fiorini, C., Rodola, E., and Calderara, S. (2023). Latent spectral regularization for continual learning.arXiv preprint arXiv:2301.03345

  10. [10]

    Wang, X., Zhang, Y., Chen, T., Gao, S., Jin, S., Yang, X., et al. (2023). TRACE: A comprehensive benchmark for continual learning in large language models.arXiv preprint arXiv:2310.06762

  11. [11]

    Cuturi, M. (2013). Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems (NeurIPS)

  12. [12]

    E., Ablin, P., Blondel, M., and Peyré, G

    Sander, M. E., Ablin, P., Blondel, M., and Peyré, G. (2022). Sinkformers: Transformers with doubly stochastic attention. InInternational Conference on Artificial Intelligence and Statistics (AISTATS)

  13. [13]

    Tay, Y., Bahri, D., Yang, L., Metzler, D., and Juan, D.-C. (2020). Sparse Sinkhorn attention. InInternational Conference on Machine Learning (ICML)

  14. [14]

    Sinkhorn, R. (1964). A relationship between arbitrary positive matrices and doubly stochastic matrices.Annals of Mathematical Statistics, 35(2):876–879

  15. [15]

    Horn, R. A. and Johnson, C. R. (1990).Matrix Analysis. Cambridge University Press

  16. [16]

    Meng, F., Wang, Z., and Zhang, M. (2024). PiSSA: Principal singular values and singular vectors adaptation of large language models. InAdvances in Neural Information Processing Systems (NeurIPS), Spotlight

  17. [17]

    Meng, F. (2026). Beyond retention: Orchestrating structural safety and plasticity in continual learning for LLMs.arXiv preprint arXiv:2601.18255

  18. [18]

    Hayou, S., Ghosh, N., and Yu, B. (2024). LoRA+: Efficient low rank adaptation of large models. InInternational Conference on Machine Learning (ICML)

  19. [19]

    F., Lan, Q., Rahman, P., Mahmood, A

    Dohare, S., Hernandez-Garcia, J. F., Lan, Q., Rahman, P., Mahmood, A. R., and Sutton, R. S. (2024). Loss of plasticity in deep continual learning.Nature, 632, 768–774

  20. [20]

    Lewandowski, A., Bortkiewicz, M., Kumar, S., Gyorgy, A., Schuurmans, D., Ostaszewski, M., and Machado, M. C. (2025). Learning continually by spectral regularization. InInternational Conference on Learning Representations (ICLR)

  21. [21]

    and Pilanci, M

    Zhang, F. and Pilanci, M. (2024). Spectral adapter: Fine-tuning in spectral space. InAdvances in Neural Information Processing Systems (NeurIPS)

  22. [22]

    and Xie, X

    Xiong, Y. and Xie, X. (2025). OPLoRA: Orthogonal projection LoRA prevents catastrophic forgetting during parameter-efficient fine-tuning.arXiv preprint arXiv:2510.13003

  23. [23]

    Food and Drug Administration (2024)

    U.S. Food and Drug Administration (2024). Marketing submission recommendations for a predeterminedchangecontrolplanforartificialintelligence/machinelearning(AI/ML)-enabled device software functions. FDA Guidance Document

  24. [24]

    Birkhoff, G. (1946). Three observations on linear algebra.Univ. Nac. Tucumán Rev. Ser. A, 5, 147–151. 33

  25. [25]

    Mena, G., Belanger, D., Linderman, S., and Snoek, J. (2018). Learning latent permutations with Gumbel-Sinkhorn networks. InInternational Conference on Learning Representations (ICLR)

  26. [26]

    Progressive Neural Networks

    Rusu, A. A., Rabinowitz, N. C., Desjardins, G., Soyer, H., Kirkpatrick, J., Kavukcuoglu, K., Pascanu, R., and Hadsell, R. (2016). Progressive neural networks.arXiv preprint arXiv:1606.04671

  27. [27]

    and Lazebnik, S

    Mallya, A. and Lazebnik, S. (2018). PackNet: Adding multiple tasks to a single network by iterative pruning. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  28. [28]

    Overcomingcatastrophicforgetting with hard attention to the task

    Serra, J., Surís, D., Miron, M., andKaratzoglou, A.(2018). Overcomingcatastrophicforgetting with hard attention to the task. InInternational Conference on Machine Learning (ICML)

  29. [29]

    H., and Gao, J

    Wang, Y., Agarwal, V., Mukherjee, S., Liu, X., Gao, J., Awadallah, A. H., and Gao, J. (2024). O-LoRA: Orthogonal low-rank adaptation for large language model based continual learning. InInternational Conference on Learning Representations (ICLR)

  30. [30]

    and Li, Z

    Liang, H. and Li, Z. (2024). InfLoRA: Interference-free low-rank adaptation for continual learning. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

  31. [31]

    Gouk, H., Frank, E., Pfahringer, B., and Cree, M. J. (2021). Regularisation of neural networks by enforcing Lipschitz continuity.Machine Learning, 110(2):393–416

  32. [32]

    Y., Pang, T., Du, C., and Lin, M

    Huang, C., Liu, Q., Lin, B. Y., Pang, T., Du, C., and Lin, M. (2024). LoRAHub: Efficient cross-task generalization via dynamic LoRA composition. InConference on Language Modeling (COLM)

  33. [33]

    and Hoiem, D

    Li, Z. and Hoiem, D. (2018). Learning without forgetting.IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(12):2935–2947

  34. [34]

    and Ranzato, M

    Lopez-Paz, D. and Ranzato, M. (2017). Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems (NeurIPS)

  35. [35]

    van de Ven, G. M. and Tolias, A. S. (2019). Three scenarios for continual learning.arXiv preprint arXiv:1904.07734

  36. [36]

    Biderman, D., Ortiz, J. J. G., Portes, J., Paul, M., Greengard, P., Jennings, C., et al. (2024). LoRA learns less and forgets less.Transactions on Machine Learning Research (TMLR)

  37. [37]

    Pfeiffer, J., Kamath, A., Rücklé, A., Cho, K., and Gurevych, I. (2021). AdapterFusion: Non- destructive task composition for transfer learning. InConference of the European Chapter of the Association for Computational Linguistics (EACL)

  38. [38]

    Buehler, E. L. and Buehler, M. J. (2024). X-LoRA: Mixture of low-rank adapter experts, a flexible framework for large language models.arXiv preprint arXiv:2402.07148

  39. [39]

    Dou, S., Zhou, E., Liu, Y., Gao, S., Zhao, J., Shen, W., et al. (2024). LoRAMoE: Alleviating world knowledge forgetting in large language models via MoE-style plugin. InAssociation for Computational Linguistics (ACL)

  40. [40]

    Saha, G., Garg, I., and Roy, K. (2021). Gradient projection memory for continual learning. In International Conference on Learning Representations (ICLR). 34

  41. [41]

    Wang, S., Li, X., Sun, J., and Xu, Z. (2021). Training networks in null space of feature covariance for continual learning. InComputer Vision and Pattern Recognition (CVPR)

  42. [42]

    Nayudu, K., Nutakki, A., Naidu, S.V., and Shanmugasundaram, A. (2026a). CRMA com- ponent isolation: Architectural ablation for modular continual fine-tuning.ModelBrew AI Technical Report(in preparation)

  43. [43]

    Wu, X., Huang, S., and Wei, F. (2024). Mixture of LoRA Experts.arXiv preprint arXiv:2404.13628

  44. [44]

    Zhao, W., Wang, S., Hu, Y., Zhao, Y., Qin, B., Zhang, X., Yang, Q., Xu, D., and Che, W. (2024). SAPT: A Shared Attention Framework for Parameter-Efficient Continual Learning of Large Language Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL)

  45. [45]

    Wang, L., Zhang, X., Su, H., and Zhu, J. (2024). A Comprehensive Survey of Continual Learning: Theory, Method and Application.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)

  46. [46]

    Yadav, P., Tam, D., Choshen, L., Raffel, C., and Bansal, M. (2023). TIES-Merging: Resolving Interference When Merging Models. InAdvances in Neural Information Processing Systems (NeurIPS)

  47. [47]

    Yu, L., Yu, B., Yu, H., Huang, F., and Li, Y. (2024). Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch. InInternational Conference on Machine Learning (ICML)

  48. [48]

    Yang, E., Wang, Z., Shen, L., Liu, S., Guo, G., Wang, X., and Tao, D. (2024). AdaMerging: Adaptive Model Merging for Multi-Task Learning. InInternational Conference on Learning Representations (ICLR). A Subspace Overlap as a Routing Signal: A Within-Team Exploratory Study Motivation.Section 5.1 flags same-vertical routing as a load-bearing untested case f...

  49. [49]

    Single seed per cell. Held-out scoring uses deterministic generation against reference answers (Jaccard threshold≥0.3at 1.1B,≥0.25at 7B).Methodological caveat:the per-scale Jaccard threshold differs by scale, which is a researcher degree of freedom that mechanically inflates the absolute retention numbers at 7B relative to 1.1B. We were not able to re-sco...

  50. [50]

    RAG dominates fine-tuning on personal- knowledge text

    (in preparation) and are available on request. Pair (dogs↔X) LoRA-side overlap CRMA-side overlap Cars (uncorrelated)0.339±0.092 0.384±0.138 Dogs Advice (medium)0.366±0.100 0.464±0.221 Cats (correlated)0.461±0.087 0.612±0.187 exported as a projector for the next, providing a measurable structural quantity (principal-angle overlap) that does not require a t...