Recognition: unknown
GR4CIL: Gap-compensated Routing for CLIP-based Class Incremental Learning
Pith reviewed 2026-05-10 05:30 UTC · model grok-4.3
The pith
GR4CIL adds orthogonal compensation to CLIP models so task routing stays accurate as new classes arrive without losing zero-shot ability.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GR4CIL preserves task-specific visual knowledge while maintaining an incrementally stable shared textual semantic space, and introduces an orthogonal compensation mechanism to mitigate modality-gap-induced bias, enhance within-task discrimination, and enlarge the score margin between the ground-truth task and competing tasks, thereby enabling more reliable task-aware routing over learned knowledge while retaining the zero-shot generalization capability.
What carries the argument
The orthogonal compensation mechanism that adjusts features to reduce modality gap bias and widen score margins between the true task and others.
Load-bearing premise
The orthogonal compensation successfully reduces modality gap bias and widens score margins between tasks without destabilizing the shared textual space or creating new interference.
What would settle it
An experiment in which the compensation step fails to increase the ground-truth task score margin over strong baselines, or in which cross-task routing accuracy does not improve as tasks accumulate, would falsify the central claim.
Figures
read the original abstract
Class-Incremental Learning (CIL) aims to continuously acquire new categories while preserving previously learned knowledge. Recently, Contrastive Language-Image Pre-trained (CLIP) models have shown strong potential for CIL due to their powerful generalization ability. However, existing methods still face two key challenges: shared-parameter adaptation tends to cause old-knowledge drift, and task-specific knowledge organization often leads to poorly calibrated cross-task responses, making reliable routing difficult. To address these issues, we propose GR4CIL, a framework combining task discrimination and knowledge routing for CLIP-based CIL. GR4CIL preserves task-specific visual knowledge while maintaining an incrementally stable shared textual semantic space, thereby reducing interference across tasks. Moreover, we introduce an orthogonal compensation mechanism to mitigate modality-gap-induced bias, enhance within-task discrimination, and enlarge the score margin between the ground-truth task and competing tasks. As a result, GR4CIL enables more reliable task-aware routing over learned knowledge while retaining the zero-shot generalization capability. Experiments on multiple benchmarks show that GR4CIL consistently outperforms strong baselines.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes GR4CIL, a framework for CLIP-based class-incremental learning that integrates task discrimination with knowledge routing. It preserves task-specific visual adapters while maintaining an incrementally stable shared textual semantic space, and introduces an orthogonal compensation mechanism to reduce modality-gap bias, improve within-task discrimination, and enlarge score margins between the ground-truth task and competitors. This is claimed to enable reliable task-aware routing without sacrificing CLIP's zero-shot generalization. Experiments on standard CIL benchmarks are reported to show consistent gains over strong baselines.
Significance. If the central claims hold, the work would advance CLIP-based continual learning by offering a concrete way to decouple visual adaptation from textual stability and to compensate for modality gaps at the routing stage. The retention of zero-shot capability alongside incremental gains is a notable strength, as is the empirical validation across multiple benchmarks. The approach could influence future designs that seek to keep foundation-model semantic spaces intact during incremental updates.
major comments (2)
- [§3.3] §3.3 (Orthogonal Compensation): The manuscript asserts that the mechanism mitigates modality-gap bias and enlarges margins 'without destabilizing the shared textual semantic space or creating new inter-task interference,' yet supplies no explicit projection operator, orthogonality constraint, or regularization term (e.g., no loss of the form ||P_t^T P_t - I|| or subspace projection onto fixed text embeddings). Because this mechanism is load-bearing for both the routing reliability claim and the zero-shot retention claim, its absence of formal definition and supporting ablations constitutes a major gap.
- [§4.2] §4.2 (Ablation Studies): The reported gains on task-aware routing are attributed to the combination of stable textual space and orthogonal compensation, but the ablation table does not isolate the effect of removing the orthogonality constraint while keeping the compensation magnitude fixed. Without this control, it is impossible to verify that the observed margin enlargement is due to orthogonality rather than simple scaling or post-hoc calibration.
minor comments (3)
- [§3.1] Notation for the task-specific visual adapters and the shared text encoder is introduced without a clear table of symbols; readers must infer the distinction between V_t and the frozen text encoder T from context.
- [Figure 2] Figure 2 (framework overview) labels the compensation block but does not indicate whether the compensation is applied only at inference or also during adapter training; a small annotation would remove ambiguity.
- [§2] The related-work section cites several recent CLIP-CIL papers but omits discussion of orthogonal-projection techniques from the broader continual-learning literature (e.g., orthogonal gradient descent methods); a brief comparison would strengthen positioning.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which helps clarify the presentation of our orthogonal compensation mechanism and the supporting experiments. We address each major comment below and commit to revisions that strengthen the formalization and empirical validation.
read point-by-point responses
-
Referee: [§3.3] §3.3 (Orthogonal Compensation): The manuscript asserts that the mechanism mitigates modality-gap bias and enlarges margins 'without destabilizing the shared textual semantic space or creating new inter-task interference,' yet supplies no explicit projection operator, orthogonality constraint, or regularization term (e.g., no loss of the form ||P_t^T P_t - I|| or subspace projection onto fixed text embeddings). Because this mechanism is load-bearing for both the routing reliability claim and the zero-shot retention claim, its absence of formal definition and supporting ablations constitutes a major gap.
Authors: We acknowledge that the current description in §3.3 presents the orthogonal compensation primarily at a conceptual level without an explicit operator or constraint equation. The mechanism projects visual adapter outputs onto the orthogonal complement of the estimated modality-gap direction derived from the fixed text embeddings, which is intended to avoid interference with the shared textual space. To address this gap, we will revise §3.3 to include the precise projection formula, the orthogonality condition, and a brief derivation showing why it preserves textual stability. We will also add targeted ablations quantifying the effect on zero-shot accuracy and inter-task score margins. These changes will make the load-bearing claims fully supported. revision: yes
-
Referee: [§4.2] §4.2 (Ablation Studies): The reported gains on task-aware routing are attributed to the combination of stable textual space and orthogonal compensation, but the ablation table does not isolate the effect of removing the orthogonality constraint while keeping the compensation magnitude fixed. Without this control, it is impossible to verify that the observed margin enlargement is due to orthogonality rather than simple scaling or post-hoc calibration.
Authors: We agree that the existing ablation table in §4.2 does not contain the requested control experiment. The current variants remove compensation entirely or disable task discrimination, but do not apply a non-orthogonal compensation of identical magnitude. In the revision we will insert an additional row (or sub-table) that applies compensation without the orthogonality constraint at fixed scale and reports the resulting changes in within-task discrimination, cross-task margins, and routing accuracy. This will isolate the contribution of orthogonality from mere scaling effects. revision: yes
Circularity Check
No significant circularity; empirical proposal with independent validation
full rationale
The paper describes GR4CIL as a framework that combines task discrimination, knowledge routing, and an orthogonal compensation mechanism for CLIP-based class-incremental learning. The abstract and provided description contain no equations, derivations, parameter fits, or self-citations that reduce any claimed result to its inputs by construction. Benefits such as stable textual space, enlarged score margins, and reliable routing are presented as outcomes of the proposed architecture rather than tautological restatements. Experiments on benchmarks are invoked as external validation, leaving the method self-contained without load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Towards continual learning desiderata via hsic-bottleneck orthogonalization and equiangular embedding,
D. Li, T. Wang, J. Chen, Q. Ren, K. Kawaguchi, and Z. Zeng, “Towards continual learning desiderata via hsic-bottleneck orthogonalization and equiangular embedding,” inProceedings of the AAAI Conference on Artificial Intelligence, pp. 13464–13473, 2024
2024
-
[2]
Harnessing neural unit dynamics for effective and scalable class-incremental learning,
D. Li, T. Wang, J. Chen, W. Dai, and Z. Zeng, “Harnessing neural unit dynamics for effective and scalable class-incremental learning,” inInternational Conference on Machine Learning, pp. 28688–28705, 2024
2024
-
[3]
Learning transferable visual models from natural language supervi- sion,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark,et al., “Learning transferable visual models from natural language supervi- sion,” inInternational conference on machine learning, pp. 8748–8763, PmLR, 2021
2021
-
[4]
Mind the gap: Preserving and compensating for the modality gap in clip-based continual learning,
L. Huang, X. Cao, H. Lu, Y . Meng, F. Yang, and X. Liu, “Mind the gap: Preserving and compensating for the modality gap in clip-based continual learning,” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3777–3786, 2025
2025
-
[5]
Boosting continual learning of vision-language models via mixture-of-experts adapters,
J. Yu, Y . Zhuge, L. Zhang, P. Hu, D. Wang, H. Lu, and Y . He, “Boosting continual learning of vision-language models via mixture-of-experts adapters,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23219–23230, 2024
2024
-
[6]
External knowledge injec- tion for clip-based class-incremental learning,
D.-W. Zhou, K.-W. Li, J. Ning, H.-J. Ye, L. Zhang, and D.-C. Zhan, “External knowledge injec- tion for clip-based class-incremental learning,” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3314–3325, 2025
2025
-
[7]
LADA: Scalable label-specific CLIP adapter for continual learning,
M.-L. Luo, Z.-H. Zhou, T. Wei, and M.-L. Zhang, “LADA: Scalable label-specific CLIP adapter for continual learning,” inForty-second International Conference on Machine Learning, 2025
2025
-
[8]
Class-incremental learning with clip: Adaptive rep- resentation adjustment and parameter fusion,
L. Huang, X. Cao, H. Lu, and X. Liu, “Class-incremental learning with clip: Adaptive rep- resentation adjustment and parameter fusion,” inEuropean Conference on Computer Vision, pp. 214–231, Springer, 2024
2024
-
[9]
Clap4clip: Continual learning with probabilistic finetuning for vision-language models,
S. Jha, D. Gong, and L. Yao, “Clap4clip: Continual learning with probabilistic finetuning for vision-language models,”Advances in neural information processing systems, vol. 37, pp. 129146–129186, 2024
2024
-
[10]
Catastrophic forgetting in connectionist networks,
R. M. French, “Catastrophic forgetting in connectionist networks,”Trends in Cognitive Sciences, vol. 3, no. 4, pp. 128–135, 1999
1999
-
[11]
C-clip: Multimodal continual learning for vision-language model,
W. Liu, F. Zhu, L. Wei, and Q. Tian, “C-clip: Multimodal continual learning for vision-language model,” inThe Thirteenth International Conference on Learning Representations, 2025
2025
-
[12]
Clip-adapter: Better vision-language models with feature adapters,
P. Gao, S. Geng, R. Zhang, T. Ma, R. Fang, Y . Zhang, H. Li, and Y . Qiao, “Clip-adapter: Better vision-language models with feature adapters,”International journal of computer vision, vol. 132, no. 2, pp. 581–595, 2024
2024
-
[13]
Learning to prompt for vision-language models,
K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision-language models,” International journal of computer vision, vol. 130, no. 9, pp. 2337–2348, 2022
2022
-
[14]
SD- loRA: Scalable decoupled low-rank adaptation for class incremental learning,
Y . Wu, H. Piao, L.-K. Huang, R. Wang, W. Li, H. Pfister, D. Meng, K. Ma, and Y . Wei, “SD- loRA: Scalable decoupled low-rank adaptation for class incremental learning,” inThe Thirteenth International Conference on Learning Representations, 2025
2025
-
[15]
On the discrimination and consistency for exemplar-free class incremental learning,
T. Wang, J. Guo, D. Li, and Z. Chen, “On the discrimination and consistency for exemplar-free class incremental learning,” inProceedings of the Thirty-Fourth International Joint Confer- ence on Artificial Intelligence, IJCAI-25(J. Kwok, ed.), pp. 6424–6432, International Joint Conferences on Artificial Intelligence Organization, 8 2025. Main Track
2025
-
[16]
Contin- ual learning of image classes with language guidance from a vision-language model,
W. Zhang, Y . Huang, W. Zhang, T. Zhang, Q. Lao, Y . Yu, W.-S. Zheng, and R. Wang, “Contin- ual learning of image classes with language guidance from a vision-language model,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 12, pp. 13152–13163, 2024. 10
2024
-
[17]
Visual class incremental learning with textual priors guidance based on an adapted vision-language model,
W. Zhang, T. Yu, R. Wang, J. Xie, E. Trucco, W.-S. Zheng, and X. Yang, “Visual class incremental learning with textual priors guidance based on an adapted vision-language model,” IEEE Transactions on Multimedia, 2025
2025
-
[18]
Semantic-guided LoRA Parameters Generation,
M. Li, Y . Chen, Z. Rao, C. Jiang, and J. Guo, “Semantic-guided LoRA Parameters Generation,” arXiv e-prints, p. arXiv:2509.10535, Sept. 2025
-
[19]
A theoretical study on solving continual learning,
G. Kim, C. Xiao, T. Konishi, Z. Ke, and B. Liu, “A theoretical study on solving continual learning,”Advances in Neural Information Processing Systems, vol. 35, pp. 5065–5079, 2022
2022
-
[20]
Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning,
V . W. Liang, Y . Zhang, Y . Kwon, S. Yeung, and J. Y . Zou, “Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning,”Advances in Neural Information Processing Systems, vol. 35, pp. 17612–17625, 2022
2022
-
[21]
Slca: Slow learner with classifier alignment for continual learning on a pre-trained model,
G. Zhang, L. Wang, G. Kang, L. Chen, and Y . Wei, “Slca: Slow learner with classifier alignment for continual learning on a pre-trained model,” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 19148–19158, 2023
2023
-
[22]
Dualprompt: Complementary prompting for rehearsal-free continual learning,
Z. Wang, Z. Zhang, S. Ebrahimi, R. Sun, H. Zhang, C.-Y . Lee, X. Ren, G. Su, V . Perot, J. Dy,et al., “Dualprompt: Complementary prompting for rehearsal-free continual learning,” inComputer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVI, pp. 631–648, Springer, 2022
2022
-
[23]
Coda-prompt: Continual decomposed attention-based prompting for rehearsal- free continual learning,
J. S. Smith, L. Karlinsky, V . Gutta, P. Cascante-Bonilla, D. Kim, A. Arbelle, R. Panda, R. Feris, and Z. Kira, “Coda-prompt: Continual decomposed attention-based prompting for rehearsal- free continual learning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11909–11919, 2023
2023
-
[24]
Learning to prompt for continual learning,
Z. Wang, Z. Zhang, C.-Y . Lee, H. Zhang, R. Sun, X. Ren, G. Su, V . Perot, J. Dy, and T. Pfister, “Learning to prompt for continual learning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 139–149, 2022
2022
-
[25]
Preventing zero-shot transfer degradation in continual learning of vision-language models,
Z. Zheng, M. Ma, K. Wang, Z. Qin, X. Yue, and Y . You, “Preventing zero-shot transfer degradation in continual learning of vision-language models,” inProceedings of the IEEE/CVF international conference on computer vision, pp. 19125–19136, 2023
2023
-
[26]
Learning without forgetting for vision-language models,
D.-W. Zhou, Y . Zhang, Y . Wang, J. Ning, H.-J. Ye, D.-C. Zhan, and Z. Liu, “Learning without forgetting for vision-language models,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
2025
-
[27]
Magmax: Leveraging model merging for seamless continual learning,
D. Marczak, B. Twardowski, T. Trzci´nski, and S. Cygert, “Magmax: Leveraging model merging for seamless continual learning,” inEuropean Conference on Computer Vision, pp. 379–395, Springer, 2024
2024
-
[28]
Provable guarantees for understanding out-of-distribution detection,
P. Morteza and Y . Li, “Provable guarantees for understanding out-of-distribution detection,” in Proceedings of the AAAI Conference on Artificial Intelligence, pp. 7831–7840, 2022
2022
-
[29]
How to exploit hyperspherical embeddings for out-of- distribution detection?,
Y . Ming, Y . Sun, O. Dia, and Y . Li, “How to exploit hyperspherical embeddings for out-of- distribution detection?,” inInternational Conference on Learning Representations, 2023
2023
-
[30]
Learning with mixture of prototypes for out-of-distribution detection,
H. Lu, D. Gong, S. Wang, J. Xue, L. Yao, and K. Moore, “Learning with mixture of prototypes for out-of-distribution detection,” inInternational Conference on Learning Representations, 2024
2024
-
[31]
Continual learning based on ood detection and task masking,
G. Kim, S. Esmaeilpour, C. Xiao, and B. Liu, “Continual learning based on ood detection and task masking,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 3856–3866, 2022
2022
-
[32]
A multi-head model for continual learning via out-of-distribution replay,
G. Kim, B. Liu, and Z. Ke, “A multi-head model for continual learning via out-of-distribution replay,” inConference on Lifelong Learning Agents, pp. 548–563, PMLR, 2022
2022
-
[33]
Class incremental learning via likelihood ratio based task prediction,
H. Lin, Y . Shao, W. Qian, N. Pan, Y . Guo, and B. Liu, “Class incremental learning via likelihood ratio based task prediction,” inInternational Conference on Learning Representations, 2024. 11
2024
-
[34]
Mitigate the gap: Improving cross-modal alignment in clip,
S. Eslami and G. de Melo, “Mitigate the gap: Improving cross-modal alignment in clip,” inThe Thirteenth International Conference on Learning Representations, 2025
2025
-
[35]
Cross the gap: Exposing the intra-modal misalignment in clip via modality inversion,
M. Mistretta, A. Baldrati, L. Agnolucci, M. Bertini, and A. D. Bagdanov, “Cross the gap: Exposing the intra-modal misalignment in clip via modality inversion,” inThe Thirteenth International Conference on Learning Representations, 2025
2025
-
[36]
Two effects, one trigger: On the modality gap, object bias, and information imbalance in contrastive vision-language models,
S. Schrodi, D. T. Hoffmann, M. Argus, V . Fischer, and T. Brox, “Two effects, one trigger: On the modality gap, object bias, and information imbalance in contrastive vision-language models,” inThe Thirteenth International Conference on Learning Representations, 2025
2025
-
[37]
On the value of cross-modal misalignment in multimodal representation learning,
Y . Cai, Y . Liu, E. Gao, T. Jiang, Z. Zhang, A. van den Hengel, and J. Q. Shi, “On the value of cross-modal misalignment in multimodal representation learning,” inAdvances in Neural Information Processing Systems (NeurIPS), 2025
2025
-
[38]
Post-pre-training for modality alignment in vision-language foundation models,
S. Yamaguchi, D. Feng, S. Kanai, K. Adachi, and D. Chijiwa, “Post-pre-training for modality alignment in vision-language foundation models,” inProceedings of the Computer Vision and Pattern Recognition Conference, pp. 4256–4266, 2025
2025
-
[39]
Imagenet: A large-scale hierarchical image database,
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255, IEEE, 2009
2009
-
[40]
Lora: Low-rank adaptation of large language models.,
E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen,et al., “Lora: Low-rank adaptation of large language models.,”Iclr, vol. 1, no. 2, p. 3, 2022
2022
-
[41]
Learning multiple layers of features from tiny images,
A. Krizhevsky, G. Hinton,et al., “Learning multiple layers of features from tiny images,” Handbook of Systemic Autoimmune Diseases, 2009
2009
-
[42]
The many faces of robustness: A critical analysis of out-of-distribution generalization,
D. Hendrycks, S. Basart, N. Mu, S. Kadavath, F. Wang, E. Dorundo, R. Desai, T. Zhu, S. Parajuli, M. Guo,et al., “The many faces of robustness: A critical analysis of out-of-distribution generalization,” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8340–8349, 2021
2021
-
[43]
A model or 603 exemplars: Towards memory-efficient class-incremental learning,
D.-W. Zhou, Q.-W. Wang, H.-J. Ye, and D.-C. Zhan, “A model or 603 exemplars: Towards memory-efficient class-incremental learning,” inInternational Conference on Learning Repre- sentations, 2023
2023
-
[44]
Clip model is an efficient continual learner,
V . Thengane, S. Khan, M. Hayat, and F. Khan, “Clip model is an efficient continual learner,” arXiv preprint arXiv:2210.03114, 2022
-
[45]
Revisiting class-incremental learning with pre-trained models: Generalizability and adaptivity are all you need,
D.-W. Zhou, Z.-W. Cai, H.-J. Ye, D.-C. Zhan, and Z. Liu, “Revisiting class-incremental learning with pre-trained models: Generalizability and adaptivity are all you need,”International Journal of Computer Vision, vol. 133, no. 3, pp. 1012–1032, 2025
2025
-
[46]
A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks
D. Hendrycks and K. Gimpel, “A baseline for detecting misclassified and out-of-distribution examples in neural networks,”arXiv preprint arXiv:1610.02136, 2016
work page internal anchor Pith review arXiv 2016
-
[47]
Cats and dogs,
O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. Jawahar, “Cats and dogs,” in2012 IEEE conference on computer vision and pattern recognition, pp. 3498–3505, IEEE, 2012
2012
-
[48]
Food-101–mining discriminative components with random forests,
L. Bossard, M. Guillaumin, and L. Van Gool, “Food-101–mining discriminative components with random forests,” inEuropean conference on computer vision, pp. 446–461, Springer, 2014. 12 A Theoretical Proofs and Clarification Feasible set in the text subspace.For task t, let the text feature matrix be Tt =U tΣtV⊤ t , and let Pt =U tU⊤ t be the orthogonal proj...
2014
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.