pith. machine review for the scientific record. sign in

arxiv: 2605.07512 · v1 · submitted 2026-05-08 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Hierarchical Dual-Subspace Decoupling for Continual Learning in Vision-Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:38 UTC · model grok-4.3

classification 💻 cs.CV
keywords continual learningclass-incremental learningvision-language modelssubspace decouplingparameter driftcatastrophic forgettingsingular value decomposition
0
0 comments X

The pith

Decomposing parameter updates into general and task-specific subspaces reduces interference and forgetting in vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that sequential tasks produce parameter updates that occupy overlapping low-rank subspaces, which creates cross-task interference and causes catastrophic forgetting in vision-language models. It introduces the Hierarchical Dual-Subspace Decoupling framework that uses a lightweight Feature Modulation Module to split the space explicitly into general and task-specific parts. A General Fusion Module then identifies stable transferable knowledge through relative change evaluation and an adaptive threshold, while a Hierarchical Learning Module applies singular value decomposition with scaling to keep updates inside separate subspace scales. Experiments on standard class-incremental benchmarks show this combination yields state-of-the-art retention of prior knowledge alongside acquisition of new classes. A reader would care because the method shifts focus from simply restricting update size to controlling the geometric structure of those updates.

Core claim

From a subspace perspective, updates induced by different tasks tend to lie in multiple overlapping low-rank subspaces, leading to cross-task subspace interference and severe forgetting. The Hierarchical Dual-Subspace Decoupling framework explicitly decomposes the parameter space into general and task-specific subspaces via a lightweight Feature Modulation Module, evaluates relative parameter changes across tasks with an adaptive threshold in the General Fusion Module to capture stable knowledge, and performs structured parameter decomposition via singular value decomposition together with a scaling mechanism in the Hierarchical Learning Module to constrain updates within distinct subspace 0

What carries the argument

Hierarchical Dual-Subspace Decoupling (HDSD) framework that splits parameters through a Feature Modulation Module, then applies General Fusion Module for adaptive stable-knowledge selection and Hierarchical Learning Module for SVD-based scale-constrained decomposition.

If this is right

  • Models acquire new classes while retaining performance on earlier ones by preserving transferable knowledge in the general subspace.
  • Parameter drift is limited because updates are forced into distinct scale-separated subspaces rather than allowed to overlap freely.
  • The method achieves state-of-the-art results on conventional class-incremental learning benchmarks for vision-language models.
  • Cross-task interference is lowered through explicit decomposition instead of implicit regularization alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same subspace-overlap diagnosis could be tested in continual learning settings for large language models where parameter drift is also observed.
  • The dual decomposition could be combined with existing regularization or replay methods to produce additive gains in retention.
  • Running the modules on longer task sequences would reveal whether the low-rank subspace assumption holds as task count grows.

Load-bearing premise

That updates from different tasks occupy overlapping low-rank subspaces which can be explicitly separated into general and task-specific components to reduce interference.

What would settle it

A direct measurement on a standard benchmark showing that the decomposition modules leave subspace overlap and parameter drift unchanged or that accuracy on previous classes remains the same when the dual-subspace split is removed.

Figures

Figures reproduced from arXiv: 2605.07512 by Cheng Deng, Kun Wei, Mengxin Qin, Xiang Zhang, Xu Yang.

Figure 1
Figure 1. Figure 1: Subspace overlap on CIFAR-100 (B0 Inc10). Vision–language pre-trained models (VLMs) like CLIP [3] have emerged as a promising paradigm for CIL. Thanks to their robust cross-modal alignment, continual adaptation requires updating only a marginal set of parameters. Consequently, recent methods typically freeze the backbone and insert lightweight tunable modules, such as prompts or adapters, to capture new ta… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed HDSD. FMM consists of GFM and HLM. To tackle this issue, we propose a Hierarchical Dual￾Subspace Decoupling (HDSD) framework, which ex￾plicitly governs cross-task parameter interactions. As shown in [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Summary of the proposed approach. (a) Overall structure features a Feature Modulation [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The training phase of our pro￾posed hierarchical approach in HLM. Hierarchical Training. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The test phase of our proposed hierarchical approach in HLM. Hierarchical Testing. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Accuracy curves of RAPF and our method on ImageNet-R. The left figure shows the results [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Performance under different threshold values on ImageNet-100 (B0 Inc10). The threshold τ used in the General Fusion Module (GFM) is defined based on the distribution of the relative param￾eter change Γ. To determine an appropriate value, we conduct experiments on the ImageNet-100 dataset with 0 base classes and an increment size of 10, resulting in a total of 10 tasks. We vary the percentile parameter q in… view at source ↗
Figure 8
Figure 8. Figure 8: Additional accuracy curves on ImageNet-100 and CIFAR-100 under different base-session [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
read the original abstract

Class-incremental learning aims to continuously acquire new knowledge while preserving previously learned information, thereby mitigating catastrophic forgetting. Existing methods primarily restrict parameter updates but often overlook their structural properties in high-dimensional spaces. From a subspace perspective, updates induced by different tasks tend to lie in multiple overlapping low-rank subspaces, leading to cross-task subspace interference and severe forgetting. To address this issue, we propose HDSD, a Hierarchical Dual-Subspace Decoupling framework for continual learning in vision-language models. Specifically, we introduce a lightweight Feature Modulation Module (FMM) that explicitly decomposes the parameter space into general and task-specific subspaces. Building on this design, we develop two complementary components. First, a General Fusion Module (GFM) evaluates relative parameter changes across tasks and uses an adaptive threshold to capture stable and transferable knowledge. Second, a Hierarchical Learning Module (HLM) performs structured parameter decomposition via Singular Value Decomposition (SVD) and uses a scaling mechanism to constrain updates within distinct subspace scales. Together, these designs reduce subspace interference and parameter drift. Extensive experiments on conventional benchmarks show that HDSD achieves state-of-the-art results.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes HDSD, a Hierarchical Dual-Subspace Decoupling framework for class-incremental continual learning in vision-language models. It posits that task-induced parameter updates occupy overlapping low-rank subspaces causing interference and forgetting, and introduces three modules to address this: the Feature Modulation Module (FMM) for explicit decomposition into general and task-specific subspaces, the General Fusion Module (GFM) that applies an adaptive threshold to retain stable knowledge, and the Hierarchical Learning Module (HLM) that uses SVD-based decomposition plus a scaling mechanism to constrain updates at different subspace scales. The central claim is that these designs reduce subspace interference and parameter drift, yielding state-of-the-art results on standard benchmarks.

Significance. If the mechanistic claims and empirical gains are substantiated, the work could meaningfully advance continual learning by shifting focus from generic regularization to explicit structural decomposition of parameter updates in high-dimensional VLM spaces. The subspace perspective is a potentially useful lens, but its impact hinges on demonstrating that the proposed modules causally mitigate interference rather than acting through incidental regularization.

major comments (3)
  1. [Abstract] Abstract: The premise that 'updates induced by different tasks tend to lie in multiple overlapping low-rank subspaces, leading to cross-task subspace interference' is asserted without any supporting diagnostic (e.g., principal angles between task subspaces, cosine similarity of update directions, or effective rank of task gradients). Consequently, it is unclear whether the reported gains arise from the dual-subspace decoupling or from the generic effects of the adaptive threshold and scaling.
  2. [Method] Method (FMM/GFM/HLM): The adaptive threshold in GFM and the scaling mechanism in HLM are listed as free parameters, yet no derivation, optimization procedure, or ablation isolating their contribution to subspace separation is provided. This leaves open the possibility that performance improvements are driven by added constraints rather than the claimed hierarchical dual-subspace structure.
  3. [Experiments] Experiments: The SOTA claim is stated without reference to specific tables, error bars, or module-wise ablations. Without quantitative evidence that interference metrics improve post-decomposition (or that baselines with equivalent regularization do not match the gains), the causal link between the proposed modules and reduced forgetting remains unverified.
minor comments (1)
  1. [Abstract] The abstract would be strengthened by naming the specific benchmarks and providing at least one quantitative result (e.g., average accuracy or forgetting rate) to ground the SOTA assertion.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. We agree that stronger empirical grounding for the subspace interference premise, clearer justification for the hyperparameters, and more explicit experimental reporting would strengthen the manuscript. Below we respond point-by-point to the major comments and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract] The premise that 'updates induced by different tasks tend to lie in multiple overlapping low-rank subspaces, leading to cross-task subspace interference' is asserted without any supporting diagnostic (e.g., principal angles between task subspaces, cosine similarity of update directions, or effective rank of task gradients). Consequently, it is unclear whether the reported gains arise from the dual-subspace decoupling or from the generic effects of the adaptive threshold and scaling.

    Authors: We acknowledge that the abstract states the subspace-overlap premise without accompanying diagnostics. This observation originated from preliminary gradient analyses we performed during method development. In the revised manuscript we will add a dedicated diagnostic subsection (likely in Section 3 or the appendix) that reports (i) principal angles between the dominant subspaces of task-specific updates, (ii) average cosine similarity of update directions across consecutive tasks, and (iii) effective rank of the task gradients before and after each module. These metrics will be computed on the same benchmarks used for the main results, thereby directly linking the claimed interference reduction to the performance gains. revision: yes

  2. Referee: [Method] The adaptive threshold in GFM and the scaling mechanism in HLM are listed as free parameters, yet no derivation, optimization procedure, or ablation isolating their contribution to subspace separation is provided. This leaves open the possibility that performance improvements are driven by added constraints rather than the claimed hierarchical dual-subspace structure.

    Authors: The adaptive threshold in GFM is computed on-the-fly from the relative magnitude of parameter changes across tasks (specifically, the ratio of Frobenius norms of consecutive task updates), while the scaling factors in HLM are derived from the singular-value spectrum obtained by SVD on the task-specific subspace. Both are therefore data-dependent rather than purely free hyperparameters; their only tunable aspect is a small set of scaling coefficients that we select via grid search on a 5% held-out validation split of the first task. In the revision we will (i) provide a short stability-analysis derivation showing why these choices preserve general knowledge, (ii) report the exact search ranges and selected values, and (iii) add an ablation table that isolates the contribution of each mechanism to the measured subspace-separation metrics (principal angles and overlap ratios). revision: yes

  3. Referee: [Experiments] The SOTA claim is stated without reference to specific tables, error bars, or module-wise ablations. Without quantitative evidence that interference metrics improve post-decomposition (or that baselines with equivalent regularization do not match the gains), the causal link between the proposed modules and reduced forgetting remains unverified.

    Authors: We agree that the current experimental section would benefit from more explicit cross-references and additional controls. In the revised version we will: (a) explicitly cite the main result tables (currently Tables 1–3) when claiming SOTA performance, (b) report mean and standard deviation over three random seeds for all methods, (c) add a module-wise ablation study that measures both accuracy and the same interference diagnostics (principal angles, cosine similarity, effective rank) before and after each component, and (d) include a controlled comparison against baselines that receive equivalent regularization strength but lack the explicit dual-subspace decomposition. These additions will make the causal contribution of the hierarchical decoupling clearer. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method is an architectural response to an independently stated subspace assumption

full rationale

The paper states an assumption about overlapping low-rank subspaces causing interference, then directly proposes HDSD with FMM (decomposition into general/task-specific subspaces), GFM (adaptive threshold for stable knowledge), and HLM (SVD plus scaling) as mitigation. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. Results are reported on external conventional benchmarks, keeping the derivation self-contained and falsifiable without reduction to its own inputs.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 3 invented entities

The central claim rests on a domain assumption about subspace overlap plus several newly introduced modules whose effectiveness is not independently evidenced in the abstract.

free parameters (2)
  • adaptive threshold
    Used in the General Fusion Module to capture stable knowledge; value selection mechanism not specified.
  • scaling mechanism parameters
    Applied in the Hierarchical Learning Module to constrain updates at distinct subspace scales.
axioms (1)
  • domain assumption Updates induced by different tasks tend to lie in multiple overlapping low-rank subspaces, leading to cross-task subspace interference
    Directly stated in the abstract as the motivation for the approach.
invented entities (3)
  • Feature Modulation Module (FMM) no independent evidence
    purpose: Explicitly decomposes the parameter space into general and task-specific subspaces
    Lightweight module introduced as the foundation of the framework.
  • General Fusion Module (GFM) no independent evidence
    purpose: Evaluates relative parameter changes across tasks using an adaptive threshold
    New component for capturing stable knowledge.
  • Hierarchical Learning Module (HLM) no independent evidence
    purpose: Performs structured parameter decomposition via SVD and applies scaling to constrain updates
    New component for hierarchical subspace handling.

pith-pipeline@v0.9.0 · 5503 in / 1390 out tokens · 38293 ms · 2026-05-11T02:38:08.071627+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages

  1. [1]

    Learning without forgetting.IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017

    Zhizhong Li and Derek Hoiem. Learning without forgetting.IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017

  2. [2]

    icarl: Incremental classifier and representation learning

    Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. icarl: Incremental classifier and representation learning. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2001–2010, 2017

  3. [3]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  4. [4]

    Learning without forgetting for vision-language models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Da-Wei Zhou, Yuanhan Zhang, Yan Wang, Jingyi Ning, Han-Jia Ye, De-Chuan Zhan, and Ziwei Liu. Learning without forgetting for vision-language models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  5. [5]

    Class-incremental learning with clip: Adaptive representation adjustment and parameter fusion

    Linlan Huang, Xusheng Cao, Haori Lu, and Xialei Liu. Class-incremental learning with clip: Adaptive representation adjustment and parameter fusion. InEuropean Conference on Computer Vision, pages 214–231. Springer, 2024

  6. [6]

    Class-incremental learning: survey and performance evaluation on image classification.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(5): 5513–5533, 2022

    Marc Masana, Xialei Liu, Bartłomiej Twardowski, Mikel Menta, Andrew D Bagdanov, and Joost Van De Weijer. Class-incremental learning: survey and performance evaluation on image classification.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(5): 5513–5533, 2022

  7. [7]

    Learning without memorizing

    Prithviraj Dhar, Rajat Vikram Singh, Kuan-Chuan Peng, Ziyan Wu, and Rama Chellappa. Learning without memorizing. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5138–5146, 2019

  8. [8]

    Rtra: Rapid training of regularization-based approaches in continual learning

    Sahil Nokhwal and Nirman Kumar. Rtra: Rapid training of regularization-based approaches in continual learning. In2023 10th International Conference on Soft Computing & Machine Intelligence (ISCMI), pages 188–192. IEEE, 2023

  9. [9]

    Regularizing second-order influences for contin- ual learning

    Zhicheng Sun, Yadong Mu, and Gang Hua. Regularizing second-order influences for contin- ual learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20166–20175, 2023

  10. [10]

    Regularization-based efficient continual learning in deep state-space models

    Yuanhang Zhang, Zhidi Lin, Yiyong Sun, Feng Yin, and Carsten Fritsche. Regularization-based efficient continual learning in deep state-space models. In2024 27th International Conference on Information Fusion (FUSION), pages 1–8. IEEE, 2024

  11. [11]

    Incremental embedding learning with disentangled representation translation.IEEE Transactions on Neural Networks and Learning Systems, 35(3):3821–3833, 2022

    Kun Wei, Da Chen, Yuhong Li, Xu Yang, Cheng Deng, and Dacheng Tao. Incremental embedding learning with disentangled representation translation.IEEE Transactions on Neural Networks and Learning Systems, 35(3):3821–3833, 2022

  12. [12]

    Gdumb: A simple approach that questions our progress in continual learning

    Ameya Prabhu, Philip HS Torr, and Puneet K Dokania. Gdumb: A simple approach that questions our progress in continual learning. InEuropean conference on computer vision, pages 524–540. Springer, 2020. 10

  13. [13]

    Triple-memory networks: A brain-inspired method for continual learning.IEEE Transactions on Neural Networks and Learning Systems, 33(5):1925–1934, 2021

    Liyuan Wang, Bo Lei, Qian Li, Hang Su, Jun Zhu, and Yi Zhong. Triple-memory networks: A brain-inspired method for continual learning.IEEE Transactions on Neural Networks and Learning Systems, 33(5):1925–1934, 2021

  14. [14]

    Augmented memory replay-based continual learning approaches for network intrusion detection.Advances in Neural Information Processing Systems, 36:17156–17169, 2023

    Sumohana Channappayya, Bheemarjuna Reddy Tamma, et al. Augmented memory replay-based continual learning approaches for network intrusion detection.Advances in Neural Information Processing Systems, 36:17156–17169, 2023

  15. [15]

    Pseudo replay-based class con- tinual learning for online new category anomaly detection in advanced manufacturing.IISE Transactions, 57(12):1407–1421, 2025

    Yuxuan Li, Tianxin Xie, Chenang Liu, and Zhangyue Shi. Pseudo replay-based class con- tinual learning for online new category anomaly detection in advanced manufacturing.IISE Transactions, 57(12):1407–1421, 2025

  16. [16]

    Not just selection, but exploration: Online class-incremental continual learning via dual view consistency

    Yanan Gu, Xu Yang, Kun Wei, and Cheng Deng. Not just selection, but exploration: Online class-incremental continual learning via dual view consistency. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7442–7451, 2022

  17. [17]

    Coscl: Cooperation of small continual learners is stronger than a big one

    Liyuan Wang, Xingxing Zhang, Qian Li, Jun Zhu, and Yi Zhong. Coscl: Cooperation of small continual learners is stronger than a big one. InEuropean Conference on Computer Vision, pages 254–271. Springer, 2022

  18. [18]

    Continual object detection via prototypical task correlation guided gating mechanism

    Binbin Yang, Xinchi Deng, Han Shi, Changlin Li, Gengwei Zhang, Hang Xu, Shen Zhao, Liang Lin, and Xiaodan Liang. Continual object detection via prototypical task correlation guided gating mechanism. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9255–9264, 2022

  19. [19]

    Continual learning on dynamic graphs via parameter isolation

    Peiyan Zhang, Yuchen Yan, Chaozhuo Li, Senzhang Wang, Xing Xie, Guojie Song, and Sunghun Kim. Continual learning on dynamic graphs via parameter isolation. InProceedings of the 46th international ACM SIGIR conference on research and development in information retrieval, pages 601–611, 2023

  20. [20]

    Isola- tion and impartial aggregation: A paradigm of incremental learning without interference

    Yabin Wang, Zhiheng Ma, Zhiwu Huang, Yaowei Wang, Zhou Su, and Xiaopeng Hong. Isola- tion and impartial aggregation: A paradigm of incremental learning without interference. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 10209–10217, 2023

  21. [21]

    Class incremental learning via contrastive complementary augmentation.IEEE Transactions on Image Processing, 2025

    Xi Wang, Xu Yang, Kun Wei, Yanan Gu, and Cheng Deng. Class incremental learning via contrastive complementary augmentation.IEEE Transactions on Image Processing, 2025

  22. [22]

    Preventing zero-shot transfer degradation in continual learning of vision-language models

    Zangwei Zheng, Mingyuan Ma, Kai Wang, Ziheng Qin, Xiangyu Yue, and Yang You. Preventing zero-shot transfer degradation in continual learning of vision-language models. InProceedings of the IEEE/CVF international conference on computer vision, pages 19125–19136, 2023

  23. [23]

    Lifelong machine learning with deep streaming linear discriminant analysis

    Tyler L Hayes and Christopher Kanan. Lifelong machine learning with deep streaming linear discriminant analysis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 220–221, 2020

  24. [24]

    When prompt-based incremental learning does not meet strong pretraining

    Yu-Ming Tang, Yi-Xing Peng, and Wei-Shi Zheng. When prompt-based incremental learning does not meet strong pretraining. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1706–1716, 2023

  25. [25]

    Prototype completion with primitive knowledge for few-shot learning

    Baoquan Zhang, Xutao Li, Yunming Ye, Zhichao Huang, and Lisai Zhang. Prototype completion with primitive knowledge for few-shot learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3754–3762, 2021

  26. [26]

    A unified continual learning framework with general parameter-efficient tuning

    Qiankun Gao, Chen Zhao, Yifan Sun, Teng Xi, Gang Zhang, Bernard Ghanem, and Jian Zhang. A unified continual learning framework with general parameter-efficient tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11483–11493, 2023

  27. [27]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009

  28. [28]

    The many faces of robustness: A critical analysis of out-of-distribution generalization

    Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. InProceedings of the IEEE/CVF international conference on computer vision, pages 8340–8349, 2021. 11

  29. [29]

    Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012

  30. [30]

    Learning to prompt for continual learning

    Zifeng Wang, Zizhao Zhang, Chen-Yu Lee, Han Zhang, Ruoxi Sun, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer Dy, and Tomas Pfister. Learning to prompt for continual learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 139–149, 2022

  31. [31]

    Dualprompt: Complementary prompting for rehearsal-free continual learning

    Zifeng Wang, Zizhao Zhang, Sayna Ebrahimi, Ruoxi Sun, Han Zhang, Chen-Yu Lee, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer Dy, et al. Dualprompt: Complementary prompting for rehearsal-free continual learning. InEuropean conference on computer vision, pages 631–648. Springer, 2022

  32. [32]

    Coda-prompt: Continual decomposed attention-based prompting for rehearsal-free continual learning

    James Seale Smith, Leonid Karlinsky, Vyshnavi Gutta, Paola Cascante-Bonilla, Donghyun Kim, Assaf Arbelle, Rameswar Panda, Rogerio Feris, and Zsolt Kira. Coda-prompt: Continual decomposed attention-based prompting for rehearsal-free continual learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11909–11919, 2023

  33. [33]

    Revisiting class- incremental learning with pre-trained models: Generalizability and adaptivity are all you need

    Da-Wei Zhou, Zi-Wen Cai, Han-Jia Ye, De-Chuan Zhan, and Ziwei Liu. Revisiting class- incremental learning with pre-trained models: Generalizability and adaptivity are all you need. International Journal of Computer Vision, 133(3):1012–1032, 2025

  34. [34]

    External knowledge injection for clip-based class-incremental learning

    Da-Wei Zhou, Kai-Wen Li, Jingyi Ning, Han-Jia Ye, Lijun Zhang, and De-Chuan Zhan. External knowledge injection for clip-based class-incremental learning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3314–3325, 2025

  35. [35]

    Bofa: Bridge- layer orthogonal low-rank fusion for clip-based class-incremental learning

    Lan Li, Tao Hu, Da-Wei Zhou, Jia-Qi Yang, Han-Jia Ye, and De-Chuan Zhan. Bofa: Bridge- layer orthogonal low-rank fusion for clip-based class-incremental learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 22967–22975, 2026. 12 A Additional Learning Curves The learning curves on ImageNet-100 and CIFAR-100 are shown i...