pith. sign in

arxiv: 2411.02813 · v3 · pith:IHSPOL72new · submitted 2024-11-05 · 💻 cs.LG

Sparse Orthogonal Parameters Tuning for Continual Learning

Pith reviewed 2026-05-23 18:03 UTC · model grok-4.3

classification 💻 cs.LG
keywords continual learningsparse orthogonal parameterspre-trained modelscatastrophic forgettingparameter tuningdelta parameters
0
0 comments X

The pith

Merging sparse orthogonal parameters from multiple tasks prevents catastrophic forgetting when adapting pre-trained models to streaming data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines continual learning where pre-trained models must handle successive tasks without losing prior knowledge. It introduces SoTU, which tunes sparse orthogonal parameters so that updates from different tasks can be combined without interference. The approach is tested on standard benchmarks and works without redesigning classifiers or other components. A reader would care because it simplifies adaptation to new data while keeping earlier performance intact through the orthogonality property.

Core claim

The paper claims that knowledge from multiple domains can be transformed into a fusion of orthogonal delta parameters, and that this fusion allows models to maintain feature representations across streaming tasks without catastrophic forgetting, yielding strong results on diverse continual learning benchmarks as a plug-and-play method.

What carries the argument

The fusion of orthogonal delta parameters obtained from sparse tuning on successive tasks.

If this is right

  • SoTU serves as a plug-and-play adapter for any pre-trained model on streaming data.
  • Optimal feature representations emerge without the need for complex classifier designs.
  • The method succeeds across multiple standard continual learning benchmarks.
  • Orthogonality in the delta parameters preserves prior task performance during fusion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same orthogonal fusion principle could be tested on prompt-based or other parameter-efficient methods.
  • Limits of the approach could be probed by scaling to models with billions of parameters or longer task sequences.
  • Orthogonality might interact with regularization techniques to further reduce interference.

Load-bearing premise

The effectiveness of SoTU lies in the transformation of knowledge learned from multiple domains into the fusion of orthogonal delta parameters.

What would settle it

An experiment that merges the same parameters without enforcing orthogonality and measures whether performance on earlier tasks drops sharply compared with the orthogonal case.

Figures

Figures reproduced from arXiv: 2411.02813 by Hai-Jian Ke, Jia-Yu Yao, Kun-Peng Ning, Li Yuan, Yong-Hong Tian, Yu-Yang Liu.

Figure 1
Figure 1. Figure 1: The sparse orthogonal characteristic of pre-trained parameters. We randomly sample delta parameters and merge them from multiple domains into one feature extractor with different masking rates. We show that merging high-sparsity deltas can maintain comparable or even superior performance on all three tasks, while low sparsity will cause seriously parameter collisions, resulting in poor performance (right o… view at source ↗
Figure 2
Figure 2. Figure 2: The framework of SoTU. Facing the continual tasks, we first fine-tune the pre-trained ViT and obtain the task-specific [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The cosine similarity from ∆ˆθ 1 to ∆ˆθ 10 with 10% to 90% masking rates in CIFAR100 for 10 tasks, where the values in the i-th row and j-th column represent the similarity between ∆ˆθ i and ∆ˆθ j . We demonstrate that high-sparsity deltas (randomly masking 90% delta parameters in Figure 3e) are more orthogonal to each other. Note that Wk will not participate in the following processes, and its role is onl… view at source ↗
Figure 4
Figure 4. Figure 4: Continual learning performance comparison with PTM-based methods on three difficult datasets, [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The visualization of attention map in delta masking and merging processes. Randomly masking delta parameters can [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The characteristics of delta masking and merging. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
read the original abstract

Continual learning methods based on pre-trained models (PTM) have recently gained attention which adapt to successive downstream tasks without catastrophic forgetting. These methods typically refrain from updating the pre-trained parameters and instead employ additional adapters, prompts, and classifiers. In this paper, we from a novel perspective investigate the benefit of sparse orthogonal parameters for continual learning. We found that merging sparse orthogonality of models learned from multiple streaming tasks has great potential in addressing catastrophic forgetting. Leveraging this insight, we propose a novel yet effective method called SoTU (Sparse Orthogonal Parameters TUning). We hypothesize that the effectiveness of SoTU lies in the transformation of knowledge learned from multiple domains into the fusion of orthogonal delta parameters. Experimental evaluations on diverse CL benchmarks demonstrate the effectiveness of the proposed approach. Notably, SoTU achieves optimal feature representation for streaming data without necessitating complex classifier designs, making it a Plug-and-Play solution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes SoTU (Sparse Orthogonal Parameters Tuning), a continual learning method for pre-trained models that avoids updating base parameters and instead uses additional adapters. It claims that merging sparse orthogonal delta parameters learned across streaming tasks mitigates catastrophic forgetting. The central hypothesis is that effectiveness arises from transforming multi-domain knowledge into fused orthogonal deltas. Experiments on diverse CL benchmarks are reported to show that SoTU yields optimal feature representations for streaming data without requiring complex classifier designs, positioning it as a plug-and-play solution.

Significance. If the orthogonality-based fusion mechanism can be isolated and shown to outperform sparsity or merging alone, the approach would offer a lightweight, parameter-efficient alternative to existing PTM-based CL methods. The plug-and-play aspect without complex classifiers could simplify deployment in streaming settings. However, the current presentation supplies no equations, derivations, or controlled ablations, so the significance cannot yet be assessed beyond the abstract-level assertion.

major comments (3)
  1. [Abstract] Abstract: The hypothesis that effectiveness 'lies in the transformation of knowledge learned from multiple domains into the fusion of orthogonal delta parameters' is stated without any derivation, equation, or formal argument showing why orthogonality (versus sparsity alone or alternative merging operators) is required to prevent task interference.
  2. [Method] Method (inferred from abstract description): No ablation or controlled comparison is described that isolates the orthogonality constraint from the sparsity or merging mechanics; without this, performance gains cannot be attributed to the claimed orthogonal-fusion mechanism rather than other factors.
  3. [Experiments] Experiments (inferred from abstract): The claim of 'optimal feature representation ... without necessitating complex classifier designs' is unsupported by any reported baselines, metrics, error bars, or quantitative comparisons in the provided text, leaving the plug-and-play assertion uninspectable.
minor comments (1)
  1. [Abstract] Abstract: The sentence 'We from a novel perspective investigate' is grammatically incomplete and should be rephrased for readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment below and outline the revisions we will undertake to improve clarity and support for our claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The hypothesis that effectiveness 'lies in the transformation of knowledge learned from multiple domains into the fusion of orthogonal delta parameters' is stated without any derivation, equation, or formal argument showing why orthogonality (versus sparsity alone or alternative merging operators) is required to prevent task interference.

    Authors: The abstract provides a high-level summary of the hypothesis. The full manuscript describes the SoTU method, including how sparse orthogonal delta parameters are learned and merged across tasks. To strengthen the presentation, we will add a formal discussion or derivation in the revised version that explains why the orthogonality constraint helps mitigate task interference compared to sparsity or other merging strategies alone. revision: yes

  2. Referee: [Method] Method (inferred from abstract description): No ablation or controlled comparison is described that isolates the orthogonality constraint from the sparsity or merging mechanics; without this, performance gains cannot be attributed to the claimed orthogonal-fusion mechanism rather than other factors.

    Authors: We agree that isolating the orthogonality component is valuable for attributing performance gains. Our current experiments focus on overall effectiveness, but we will incorporate targeted ablation studies in the revision comparing SoTU against variants that remove the orthogonality constraint or employ alternative merging operators. revision: yes

  3. Referee: [Experiments] Experiments (inferred from abstract): The claim of 'optimal feature representation ... without necessitating complex classifier designs' is unsupported by any reported baselines, metrics, error bars, or quantitative comparisons in the provided text, leaving the plug-and-play assertion uninspectable.

    Authors: The provided text consists of the abstract, which summarizes the results. The full manuscript includes experimental evaluations on diverse CL benchmarks with quantitative comparisons. In the revision, we will make the supporting baselines, metrics, and results more explicit to substantiate the claims regarding optimal feature representations and the plug-and-play nature of the approach. revision: yes

Circularity Check

0 steps flagged

No derivation chain or equations present; claims rest on empirical hypothesis and experiments.

full rationale

The provided abstract and context contain no equations, derivations, or load-bearing mathematical steps. The central statement is explicitly labeled a hypothesis ('We hypothesize that the effectiveness of SoTU lies in the transformation of knowledge learned from multiple domains into the fusion of orthogonal delta parameters') rather than a derived result. No self-citations, fitted parameters renamed as predictions, or ansatzes are visible. The method is presented as a plug-and-play empirical approach validated on benchmarks, with no reduction of outputs to inputs by construction. This is the common case of a method paper whose claims are not mathematically self-referential.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities can be identified or audited.

pith-pipeline@v0.9.0 · 5694 in / 944 out tokens · 35421 ms · 2026-05-23T18:03:37.272646+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 3 internal anchors

  1. [1]

    Chen, C. P. 1996. A rapid supervised learning neural network for function interpolation and approximation. IEEE Transactions on Neural Networks, 7(5): 1220--1230

  2. [2]

    Chen, Z.; and Liu, B. 2018. Lifelong machine learning. Synth. Lect. Artif. Intell. Mach. Learn., 12(3): 1--207

  3. [3]

    De Lange, M.; Aljundi, R.; Masana, M.; Parisot, S.; Jia, X.; Leonardis, A.; Slabaugh, G.; and Tuytelaars, T. 2021. A continual learning survey: Defying forgetting in classification tasks. IEEE transactions on pattern analysis and machine intelligence, 44(7): 3366--3385

  4. [4]

    De Lange, M.; Aljundi, R.; Masana, M.; Parisot, S.; Jia, X.; Leonardis, A., Leonardis; Slabaugh, G.; and Tuytelaars, T. 2022. A Continual Learning Survey: Defying Forgetting in Classification Tasks. IEEE Trans. Pattern Anal. Mach. Intell., 44(7): 3366--3385

  5. [5]

    Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-Fei, L. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 248--255. Ieee

  6. [6]

    Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929

  7. [7]

    Frankle, J.; and Carbin, M. 2018. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635

  8. [8]

    Gao, Q.; Zhao, C.; Sun, Y.; Xi, T.; Zhang, G.; Ghanem, B.; and Zhang, J. 2023. A unified continual learning framework with general parameter-efficient tuning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 11483--11493

  9. [9]

    Hendrycks, D.; Basart, S.; Mu, N.; Kadavath, S.; Wang, F.; Dorundo, E.; Desai, R.; Zhu, T.; Parajuli, S.; Guo, M.; et al. 2021 a . The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF international conference on computer vision, 8340--8349

  10. [10]

    Hendrycks, D.; Zhao, K.; Basart, S.; Steinhardt, J.; and Song, D. 2021 b . Natural adversarial examples. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 15262--15271

  11. [11]

    Janson, P.; Zhang, W.; Aljundi, R.; and Elhoseiny, M. 2022. A simple baseline that questions the use of pretrained-models in continual learning. arXiv preprint arXiv:2210.04428

  12. [12]

    A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al

    Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A. A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al. 2017. Overcoming catastrophic forgetting in neural networks. Proc. Natl. Acad. Sci. U.S.A.., 114(13): 3521--3526

  13. [13]

    Krizhevsky, A.; Hinton, G.; et al. 2009. Learning multiple layers of features from tiny images

  14. [14]

    Li, Z.; and Hoiem, D. 2018. Learning without Forgetting. IEEE Trans. Pattern Anal. Mach. Intell., 40(12): 2935--2947

  15. [15]

    Liu, Y.; Cong, Y.; Sun, G.; Zhang, T.; Dong, J.; and Liu, H. 2021. L3DOC: Lifelong 3D Object Classification. IEEE Transactions on Image Processing, 30: 7486--7498

  16. [16]

    Lopez-Paz, D.; and Ranzato, M. 2017. Gradient episodic memory for continual learning. Advances in neural information processing systems, 30

  17. [17]

    Mai, Z.; Li, R.; Jeong, J.; Quispe, D.; Kim, H.; and Sanner, S. 2022. Online continual learning in image classification: An empirical survey. Neurocomputing, 469: 28--51

  18. [18]

    McCloskey, M.; and Cohen, N. J. 1989. Catastrophic interference in connectionist networks: The sequential learning problem. Psychol. Learn. Motiv., 24: 109--165

  19. [19]

    D.; Gong, D.; Parvaneh, A.; Abbasnejad, E.; and van den Hengel, A

    McDonnell, M. D.; Gong, D.; Parvaneh, A.; Abbasnejad, E.; and van den Hengel, A. 2024. Ranpac: Random projections and pre-trained models for continual learning. Advances in Neural Information Processing Systems, 36

  20. [20]

    D.; McKilliam, R

    McDonnell, M. D.; McKilliam, R. G.; and de Chazal, P. 2016. On the importance of pair-wise feature correlations for image classification. In 2016 International Joint Conference on Neural Networks (IJCNN), 2290--2297. IEEE

  21. [21]

    Pelosin, F. 2022. Simpler is better: off-the-shelf continual learning through pretrained backbones. arXiv preprint arXiv:2205.01586

  22. [22]

    Rebuffi, S.-A.; Kolesnikov, A.; Sperl, G.; and Lampert, C. H. 2017 a . icarl: Incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2001--2010

  23. [23]

    Rebuffi, S.-A.; Kolesnikov, A.; Sperl, G.; and Lampert, C. H. 2017 b . iCaRL: Incremental Classifier and Representation Learning. In IEEE Conf. Comput. Vis. Pattern Recog

  24. [24]

    F.; Kraaijveld, M

    Schmidt, W. F.; Kraaijveld, M. A.; Duin, R. P.; et al. 1992. Feed forward neural networks with random weights. In International conference on pattern recognition, 1--1. IEEE Computer Society Press

  25. [25]

    S.; Karlinsky, L.; Gutta, V.; Cascante-Bonilla, P.; Kim, D.; Arbelle, A.; Panda, R.; Feris, R.; and Kira, Z

    Smith, J. S.; Karlinsky, L.; Gutta, V.; Cascante-Bonilla, P.; Kim, D.; Arbelle, A.; Panda, R.; Feris, R.; and Kira, Z. 2023. Coda-prompt: Continual decomposed attention-based prompting for rehearsal-free continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11909--11919

  26. [26]

    Sun, H.-L.; Zhou, D.-W.; Ye, H.-J.; and Zhan, D.-C. 2023. Pilot: A pre-trained model-based continual learning toolbox. arXiv preprint arXiv:2309.07117

  27. [27]

    Wah, C.; Branson, S.; Welinder, P.; Perona, P.; and Belongie, S. 2011. The caltech-ucsd birds-200-2011 dataset

  28. [28]

    Wang, F.-Y.; Zhou, D.-W.; Ye, H.-J.; and Zhan, D.-C. 2022 a . Foster: Feature boosting and compression for class-incremental learning. In European conference on computer vision, 398--414. Springer

  29. [29]

    Wang, L.; Xie, J.; Zhang, X.; Huang, M.; Su, H.; and Zhu, J. 2024. Hierarchical decomposition of prompt-based continual learning: Rethinking obscured sub-optimality. Advances in Neural Information Processing Systems, 36

  30. [30]

    Wang, Y.; Ma, Z.; Huang, Z.; Wang, Y.; Su, Z.; and Hong, X. 2023. Isolation and impartial aggregation: A paradigm of incremental learning without interference. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, 10209--10217

  31. [31]

    Wang, Z.; Zhang, Z.; Ebrahimi, S.; Sun, R.; Zhang, H.; Lee, C.-Y.; Ren, X.; Su, G.; Perot, V.; Dy, J.; et al. 2022 b . Dualprompt: Complementary prompting for rehearsal-free continual learning. In European Conference on Computer Vision, 631--648. Springer

  32. [32]

    Wang, Z.; Zhang, Z.; Lee, C.-Y.; Zhang, H.; Sun, R.; Ren, X.; Su, G.; Perot, V.; Dy, J.; and Pfister, T. 2022 c . Learning to prompt for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 139--149

  33. [33]

    Yan, S.; Xie, J.; and He, X. 2021. Der: Dynamically expandable representation for class incremental learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 3014--3023

  34. [34]

    Yoon, J.; Yang, E.; Lee, J.; and Hwang, S. J. 2018. Lifelong Learning with Dynamically Expandable Networks. In Int. Conf. Learn. Represent

  35. [35]

    Yu, L.; Yu, B.; Yu, H.; Huang, F.; and Li, Y. 2023. Language models are super mario: Absorbing abilities from homologous models as a free lunch. arXiv preprint arXiv:2311.03099

  36. [36]

    A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark

    Zhai, X.; Puigcerver, J.; Kolesnikov, A.; Ruyssen, P.; Riquelme, C.; Lucic, M.; Djolonga, J.; Pinto, A. S.; Neumann, M.; Dosovitskiy, A.; et al. 2019. A large-scale study of representation learning with the visual task adaptation benchmark. arXiv preprint arXiv:1910.04867

  37. [37]

    Zhang, G.; Wang, L.; Kang, G.; Chen, L.; and Wei, Y. 2023. Slca: Slow learner with classifier alignment for continual learning on a pre-trained model. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 19148--19158

  38. [38]

    Zhou, D.-W.; Sun, H.-L.; Ning, J.; Ye, H.-J.; and Zhan, D.-C. 2024. Continual Learning with Pre-Trained Models: A Survey. arXiv preprint arXiv:2401.16386

  39. [39]

    Zhou, D.-W.; Wang, Q.-W.; Qi, Z.-H.; Ye, H.-J.; Zhan, D.-C.; and Liu, Z. 2023 a . Deep class-incremental learning: A survey. arXiv preprint arXiv:2302.03648

  40. [40]

    Zhou, D.-W.; Wang, Q.-W.; Ye, H.-J.; and Zhan, D.-C. 2022. A model or 603 exemplars: Towards memory-efficient class-incremental learning. arXiv preprint arXiv:2205.13218

  41. [41]

    Zhou, D.-W.; Ye, H.-J.; Zhan, D.-C.; and Liu, Z. 2023 b . Revisiting class-incremental learning with pre-trained models: Generalizability and adaptivity are all you need. arXiv preprint arXiv:2303.07338

  42. [42]

    , " * write output.state after.block = add.period write newline

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

  43. [43]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...