arxiv: 2604.21927 · v2 · submitted 2026-04-23 · 💻 cs.LG

Recognition: unknown

Fine-Tuning Regimes Define Distinct Continual Learning Problems

Elena Burceanu, Paul-Tiberiu Iordache

Authors on Pith no claims yet

Pith reviewed 2026-05-09 22:59 UTC · model grok-4.3

classification 💻 cs.LG

keywords continual learningfine-tuning regimestrainable depthmethod rankingtask incremental learningcatastrophic forgettingadaptation subspaces

0 comments

The pith

The relative ranking of continual learning methods is not preserved when the trainable depth of the fine-tuning regime changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that fine-tuning regimes, defined by the subspace of trainable parameters, are themselves a variable that shapes continual learning problems. It formalizes these regimes as projected optimization and shows through experiments that changing trainable depth alters the update signals for both fitting new tasks and retaining old knowledge. Testing four methods across five regimes and five datasets reveals inconsistent method rankings, with deeper regimes tied to larger updates and more forgetting. A sympathetic reader cares because this means standard fixed-regime benchmarks may produce conclusions that fail to generalize when adaptation depth varies.

Core claim

Adaptation regimes formalized as projected optimization over fixed trainable subspaces cause the relative ranking of methods to shift, as deeper regimes produce larger update magnitudes, higher forgetting rates, and a tighter link between the two.

What carries the argument

Projected optimization over fixed trainable subspaces, which changes the effective update signal for current-task fitting and knowledge preservation.

If this is right

Relative rankings of methods such as online EWC, LwF, SI, and GEM are not consistent across the five trainable depth regimes.
Deeper regimes produce larger update magnitudes and higher forgetting.
The correlation between update magnitude and forgetting strengthens in deeper regimes.
Comparative conclusions about continual learning methods depend on the chosen fine-tuning regime.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Standard CL benchmarks may need to treat trainable depth as an explicit experimental factor rather than a fixed default.
Method design could benefit from explicit robustness testing across multiple adaptation depths.
The observed link between update size and forgetting in deeper regimes suggests potential for new regularization strategies tuned to regime depth.

Load-bearing premise

That differences in method rankings across regimes are driven by the choice of trainable depth rather than other unstated implementation details.

What would settle it

An experiment repeating the exact setup across the same five regimes where method rankings remain identical would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.21927 by Elena Burceanu, Paul-Tiberiu Iordache.

**Figure 2.** Figure 2: Ranking sensitivity persists across other benchmarks. Each panel shows the mean Kendall’s [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Gradient magnitude and forgetting both increase with trainable depth, but the main effect goes beyond these [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Continual learning (CL) studies how models acquire tasks sequentially while retaining previously learned knowledge. Despite substantial progress in benchmarking CL methods, comparative evaluations typically keep the fine-tuning regime fixed. In this paper, we argue that the fine-tuning regime, defined by the trainable parameter subspace, is itself a key evaluation variable. We formalize adaptation regimes as projected optimization over fixed trainable subspaces, showing that changing the trainable depth alters the effective update signal through which both current task fitting and knowledge preservation operate. This analysis motivates the hypothesis that method comparisons need not be invariant across regimes. We test this hypothesis in task incremental CL, five trainable depth regimes, and four standard methods: online EWC, LwF, SI, and GEM. Across five benchmark datasets, namely MNIST, Fashion MNIST, KMNIST, QMNIST, and CIFAR-100, and across 11 task orders per dataset, we find that the relative ranking of methods is not consistently preserved across regimes. We further show that deeper adaptation regimes are associated with larger update magnitudes, higher forgetting, and a stronger relationship between the two. These results show that comparative conclusions in CL can depend strongly on the chosen fine-tuning regime, motivating regime-aware evaluation protocols that treat trainable depth as an explicit experimental factor.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows CL method rankings shift with trainable depth but the experiments may not cleanly isolate that from hyperparameter effects.

read the letter

The main thing to know is that this paper finds the usual ranking of continual learning methods like online EWC, LwF, SI, and GEM does not stay the same when you change how much of the network you allow to train. They treat the fine-tuning regime as a projected optimization over different subspaces and test the idea on five datasets with 11 task orders each. Deeper regimes show bigger updates and more forgetting, which lines up with the ranking flips they report. That observation is the core empirical result. It is new in the sense that prior CL work mostly fixes the regime and treats method comparisons as stable, so highlighting the dependence is a useful reminder for benchmark design. The multi-dataset, multi-order setup gives the pattern some weight and makes it harder to dismiss as a single-run artifact. The formalization of regimes as subspaces is mostly descriptive but helps frame why the update signal changes. The soft spot is the one the stress-test note flags. The abstract and available details do not confirm whether learning rates, regularization strengths, or other hyperparameters were retuned for each regime. If they were held fixed while effective update magnitudes shift with depth, then some of the ranking changes could trace to mismatched optimization rather than the subspace itself. That would make the claim that regimes define distinct problems less direct than presented. The paper uses only four methods, which is narrow but reasonable for a focused test. This work is for researchers who run continual learning experiments and care about what their comparisons actually measure. A reader who designs benchmarks or worries about evaluation robustness would find it relevant and could use the point to push for regime-aware protocols. It deserves a serious referee because the empirical pattern is testable and the variable they flag is practical, even if the current evidence needs tighter controls to support the stronger interpretation. I would send it to review with a request for explicit hyperparameter details and an ablation that retunes per regime.

Referee Report

2 major / 2 minor

Summary. The paper claims that fine-tuning regimes in continual learning—defined by the trainable parameter subspace (specifically trainable depth)—constitute distinct problems. It formalizes regimes as projected optimization over fixed subspaces, which alters effective update signals for task fitting and knowledge retention. Empirically, across five datasets (MNIST, Fashion MNIST, KMNIST, QMNIST, CIFAR-100), 11 task orders per dataset, and five depth regimes, the relative rankings of online EWC, LwF, SI, and GEM are not preserved; deeper regimes also show larger update magnitudes, higher forgetting, and stronger correlation between them. This motivates regime-aware evaluation protocols.

Significance. If the central empirical pattern holds after controlling for confounds, the result is significant for the CL field: it challenges the implicit assumption that method comparisons are regime-invariant and shows that benchmark conclusions can depend on the chosen trainable subspace. The broad evaluation (five datasets, 11 orders) provides a reasonably strong empirical basis and gives credit to the authors for testing the hypothesis at scale rather than on a single benchmark. This could shift evaluation practices toward treating trainable depth as an explicit factor, affecting both method development and reproducibility.

major comments (2)

Methods section: the central claim that ranking changes are driven by trainable depth regimes (rather than implementation artifacts) requires that hyperparameters (learning rates, regularization coefficients for the EWC penalty, LwF distillation loss, SI, and GEM) were either re-tuned per regime or explicitly held fixed with justification. The manuscript notes larger update magnitudes in deeper regimes; if hyperparameters were not adjusted, the observed ranking flips could arise from mismatched optimization strength rather than the projected subspace itself. An explicit statement or ablation on hyperparameter protocol is load-bearing for the hypothesis.
§3 (formalization): while the projected-optimization view is a useful lens, the paper does not derive a quantitative prediction for how the projection operator changes the relative weighting of current-task loss versus preservation terms across methods. Without this or a controlled simulation isolating the projection effect, the link between the formalization and the empirical ranking changes remains interpretive rather than predictive.

minor comments (2)

The five specific trainable depth regimes (e.g., which layers or parameter subsets are frozen) should be defined with a table or explicit list in the main text rather than deferred to the appendix, as this is central to reproducibility.
Figure captions reporting method rankings should include the exact number of runs, error bars, and any statistical test used to support claims of 'not consistently preserved'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of our experimental design and theoretical framing. We address each major comment below and have made revisions to strengthen the manuscript.

read point-by-point responses

Referee: Methods section: the central claim that ranking changes are driven by trainable depth regimes (rather than implementation artifacts) requires that hyperparameters (learning rates, regularization coefficients for the EWC penalty, LwF distillation loss, SI, and GEM) were either re-tuned per regime or explicitly held fixed with justification. The manuscript notes larger update magnitudes in deeper regimes; if hyperparameters were not adjusted, the observed ranking flips could arise from mismatched optimization strength rather than the projected subspace itself. An explicit statement or ablation on hyperparameter protocol is load-bearing for the hypothesis.

Authors: We held all hyperparameters fixed across regimes precisely to isolate the effect of the trainable subspace projection. Re-tuning per regime would have introduced an additional confounding variable, making it impossible to attribute performance differences to the regime itself rather than to optimization strength. This protocol is now stated explicitly in the Methods section with justification, including a note that the larger update magnitudes observed in deeper regimes are a direct consequence of the projection (more parameters receive the full gradient signal) rather than a hyperparameter mismatch. We also added a brief discussion acknowledging that regime-specific tuning could be studied separately but would address a different question. revision: yes
Referee: §3 (formalization): while the projected-optimization view is a useful lens, the paper does not derive a quantitative prediction for how the projection operator changes the relative weighting of current-task loss versus preservation terms across methods. Without this or a controlled simulation isolating the projection effect, the link between the formalization and the empirical ranking changes remains interpretive rather than predictive.

Authors: The formalization in §3 is intended as a mechanistic lens rather than a closed-form predictor; deriving a general quantitative mapping from projection to loss weighting would require strong assumptions on the Hessian and loss geometry that do not hold uniformly across methods and datasets. We have revised §3 to include an expanded discussion of the expected directional effects on current-task versus preservation gradients under projection, and we added a controlled toy simulation in the appendix that isolates the projection operator on a quadratic objective to illustrate the altered effective updates. The primary support for the hypothesis remains the large-scale empirical evaluation across five datasets and 11 orders, which directly tests whether rankings are preserved. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical hypothesis test on standard methods

full rationale

The paper defines adaptation regimes via projected optimization over trainable subspaces and analytically notes that depth changes alter update magnitudes and forgetting. It then states a hypothesis that method rankings need not be invariant and tests this directly via controlled experiments on five datasets, eleven task orders, and four standard CL methods (online EWC, LwF, SI, GEM). No step reduces a claimed prediction or uniqueness result to a fitted quantity defined inside the paper; no self-citation chain is invoked to justify the central claim; and the reported outcome (inconsistent rankings) is an observed empirical pattern rather than a quantity forced by the formalization itself. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard continual learning optimization assumptions and existing benchmark datasets without introducing new free parameters, axioms beyond domain norms, or invented entities.

axioms (1)

domain assumption Projected optimization over fixed trainable subspaces accurately models the effect of changing trainable depth
Invoked when formalizing adaptation regimes as the central modeling step.

pith-pipeline@v0.9.0 · 5522 in / 1118 out tokens · 39319 ms · 2026-05-09T22:59:51.107866+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 19 canonical work pages · 1 internal anchor

[2]

Critical learning periods in deep neural networks,

URLhttps://arxiv.org/abs/1711.08856. Magdalena Biesialska, Katarzyna Biesialska, and Marta R. Costa-jussà. Continual lifelong learning in natural language processing: A survey

work page arXiv
[3]

Continual lifelong learning in natural language processing: A survey

URLhttps://arxiv.org/abs/2012.09823. Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Ales Leonardis, Gregory Slabaugh, and Tinne Tuytelaars. A continual learning survey: Defying forgetting in classification tasks.IEEE Transactions on Pattern Analysis and Machine Intelligence,

work page arXiv 2012
[4]

De Lange, R

URLhttps://doi.org/10.1109/TPAMI.2021.3057446. Itay Evron, E. Moroshko, G. Buzaglo, Maroun Khriesh, B. Marjieh, N. Srebro, and Daniel Soudry. Continual learning in linear classification on separable data. InInternational Conference on Machine Learning,

work page doi:10.1109/tpami.2021.3057446 2021
[5]

Sebastian Farquhar and Yarin Gal

URL https://arxiv.org/pdf/2306.03534.pdf. Sebastian Farquhar and Yarin Gal. Towards robust evaluations of continual learning

work page arXiv
[6]

org/abs/1805.09733

URL https://arxiv. org/abs/1805.09733. Ian J. Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, and Yoshua Bengio. An empirical investigation of catastrophic forgetting in gradient-based neural networks

work page arXiv
[7]

Deep Residual Learning for Image Recognition

URL https://arxiv.org/abs/1512.03385. Yen-Chang Hsu, Yen-Yu Lin, Shih-Chieh Xu, Shao-Hua Sun, and Chu-Song Chen. Re-evaluating continual learning scenarios: A focus on class-incremental learning.arXiv preprint arXiv:1810.12488,

work page internal anchor Pith review arXiv
[8]

Re-evaluating con- tinual learning scenarios: A categorization and case for strong baselines.arXiv preprint arXiv:1810.12488, 2018

URL https://arxiv. org/abs/1810.12488. Maurice G. Kendall. A new measure of rank correlation.Biometrika, 30(1–2):81–93,

work page arXiv
[9]

doi: 10.1093/biomet/30. 1-2.81. James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526,

work page doi:10.1093/biomet/30
[10]

Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell

doi: 10.1073/pnas.1611835114. URL https://doi.org/10.1073/pnas.1611835114. Tatsuya Konishi, M. Kurokawa, C. Ono, Zixuan Ke, Gyuhak Kim, and Bin Liu. Parameter-level soft-masking for continual learning.ArXiv, abs/2306.14775,

work page doi:10.1073/pnas.1611835114
[11]

Fine-tuning can distort pretrained features and underperform out-of-distribution.arXiv preprint arXiv:2202.10054,

URLhttps://arxiv.org/abs/2202.10054. Zhizhong Li and Derek Hoiem. Learning without forgetting,

work page arXiv
[12]

URL https://arxiv.org/abs/1606. 09282. Yan-Shuo Liang and Wu-Jun Li. Inflora: Interference-free low-rank adaptation for continual learning.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 23638–23647,

2024
[13]

Michael McCloskey and Neal J

URL https: //arxiv.org/abs/1706.08840. Michael McCloskey and Neal J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. InPsychology of Learning and Motivation, volume 24, pp. 109–165. Academic Press,

work page arXiv
[14]

Ivan Moshkov, Darragh Hanley, Ivan Sorokin, Shubham Toshniwal, Christof Henkel, Benedikt Schifferer, Wei Du, and Igor Gitman

doi: 10.1016/S0079-7421(08)60536-8. URLhttps://doi.org/10.1016/S0079-7421(08)60536-8. German I. Parisi, Ronald Kemker, Jose L. Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review.Neural Networks,

work page doi:10.1016/s0079-7421(08)60536-8
[15]

Real-time reinforcement learning by sequential actor-critics and experience replay

URL https://doi.org/10.1016/j.neunet. 2019.01.012. 10 Preprint. Maithra Raghu, Chiyuan Zhang, Jon Kleinberg, and Samy Bengio. Transfusion: Understanding transfer learning for medical imaging

work page doi:10.1016/j.neunet 2019
[16]

arXiv preprint arXiv:1902.07208 , year=

URLhttps://arxiv.org/abs/1902.07208. Jonathan Schwarz, Wojciech M. Czarnecki, Jelena Luketina, A. Grabska-Barwinska, Y . Teh, Razvan Pascanu, and R. Hadsell. Progress & compress: A scalable framework for continual learning.ArXiv, abs/1805.06370,

work page arXiv 1902
[17]

Tejas Srinivasan, Ting-Yun Chang, Leticia Pinto-Alva, Georgios Chochlakis, Mohammad Rostami, and Jesse Thomason

URL https://api.semanticscholar.org/CorpusId:21718339. Tejas Srinivasan, Ting-Yun Chang, Leticia Pinto-Alva, Georgios Chochlakis, Mohammad Rostami, and Jesse Thomason. Climb: A continual learning benchmark for vision-and-language tasks.ArXiv, abs/2206.09059,

work page arXiv
[18]

van de Ven and Andreas S

URL https://arxiv. org/abs/1904.07734. Mitchell Wortsman, Vivek Ramanujan, Rosanne Liu, Aniruddha Kembhavi, Mohammad Rastegari, Jason Yosinski, and Ali Farhadi. Supermasks in superposition. InAdvances in Neural Information Process- ing Systems, volume 33,

work page arXiv 1904
[19]

Maciej Wołczyk, Michał Zaj ˛ ac, Razvan Pascanu, Łukasz Kuci´nski, and Piotr Miło´s

URL https://proceedings.neurips.cc/paper/2020/hash/ ad1f8bb9b51f023cdc80cf94bb615aa9-Abstract.html. Maciej Wołczyk, Michał Zaj ˛ ac, Razvan Pascanu, Łukasz Kuci´nski, and Piotr Miło´s. Continual world: A robotic benchmark for continual reinforcement learning,

2020
[20]

Jaehong Yoon, Saehoon Kim, Eunho Yang, and Sung Ju Hwang

URLhttps://arxiv.org/abs/2105.10919. Jaehong Yoon, Saehoon Kim, Eunho Yang, and Sung Ju Hwang. Scalable and order-robust continual learning with additive parameter decomposition.arXiv: Learning,

work page arXiv
[21]

URL https://arxiv.org/pdf/1902.09432. pdf. Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? InAdvances in Neural Information Processing Systems, volume 27,

work page arXiv 1902
[22]

neurips.cc/paper/2014/hash/375c71349b295fbe2dcdca9206851369-Abstract.html

URL https://proceedings. neurips.cc/paper/2014/hash/375c71349b295fbe2dcdca9206851369-Abstract.html. Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence,

2014
[23]

11 Preprint

URL https://arxiv.org/abs/1703.04200. 11 Preprint. APPENDIX A ADDITIONALTHEORETICALDETAILS For completeness, we restate the progress bound used in the main text. Progress bound under projected updates.Let θ+ =θ−ηP S∇Jt(θ),(8) where PS is the orthogonal projector associated with the fixed trainable subset S. Assume that Jt is L-smooth on a neighborhood con...

work page arXiv