arxiv: 2605.12752 · v1 · submitted 2026-05-12 · 💻 cs.LG

Recognition: no theorem link

Low-Rank Adapters Initialization via Gradient Surgery for Continual Learning

Joana Pasquali , Ramiro N. Barros , Arthur S. Bianchessi , Vin\'icius Conte Turani , Jo\~ao Vitor Boer Abitante , Rafaela Cappelari Ravazio , Christian Mattjie , Ot\'avio Parraga

show 2 more authors

Lucas S. Kupssinsk\"u Rodrigo C. Barros

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:16 UTC · model grok-4.3

classification 💻 cs.LG

keywords continual learningLoRAgradient surgerycatastrophic forgettinglow-rank adapterslarge language modelsreplay buffer

0 comments

The pith

SLICE initializes LoRA adapters by projecting current and replay gradients then applying truncated SVD to reduce catastrophic forgetting in continual learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes SLICE as an initialization method for low-rank adapters during sequential fine-tuning of large language models. It accumulates gradients from the current task and a replay buffer of prior tasks, applies a projection operator to resolve directional conflicts, and decomposes the reconciled result with truncated SVD to set the adapter weights. Standard LoRA initializations allow conflicting task gradients to overwrite previously learned directions, increasing forgetting. SLICE channels the updates into subspaces that better preserve prior knowledge while allowing new task acquisition. A sympathetic reader would care because the approach keeps the parameter efficiency and modularity of LoRA while improving the stability-plasticity balance on both ordinary and deliberately conflicting task sequences.

Core claim

SLICE accumulates gradients from both the current task and a replay buffer of prior tasks, reconciles them through a projection operator, and decomposes the result via truncated SVD to initialize the adapter weights, yielding better Average Performance, Final Performance, and Forgetting metrics than vanilla LoRA, LoRA-GA, or LoRAM while preserving General Performance and In Context Performance on the TRACE benchmark and Super-NI sequences.

What carries the argument

The projection operator applied to accumulated current-task and replay gradients followed by truncated SVD, which produces initial LoRA weights that channel updates away from directions that would overwrite prior tasks.

If this is right

SLICE improves Average Performance, Final Performance, and Forgetting metrics relative to vanilla LoRA, LoRA-GA, and LoRAM.
SLICE preserves General Performance and In Context Performance across task sequences.
SLICE retains its stability-plasticity advantage on both standard continual learning sequences and adversarial sequences built from opposing-gradient task pairs.
The method works with existing replay buffers without requiring changes to training dynamics after initialization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Resolving gradient conflicts explicitly at initialization time may prove more reliable than relying solely on training-time regularization for long task sequences.
The same projection-plus-SVD pattern could be tested on other parameter-efficient modules such as prefix tuning or adapters without low-rank structure.
Adversarial task sequences constructed by mining opposing gradients offer a reproducible stress test that future continual learning methods could adopt as a benchmark.

Load-bearing premise

The projection operator applied to accumulated current-task and replay gradients, followed by truncated SVD, reliably channels updates into subspaces that avoid overwriting previously learned directions without introducing new interference or instability.

What would settle it

Finding that SLICE produces higher forgetting rates than vanilla LoRA on the adversarial Super-NI sequences mined from maximally opposing gradient pairs would disprove the central claim.

read the original abstract

LoRA is widely adopted for continual fine-tuning of Large Language Models due to its parameter efficiency, modularity across tasks, and compatibility with replay strategies. However, LoRA-based continual learning remains vulnerable to catastrophic forgetting, whose severity depends on how successive task gradients interact: when consecutive task gradients conflict, standard adapter initializations channel updates into subspaces that overwrite previously learned directions. We propose SLICE, a gradient-surgery-based initialization for LoRA adapters in continual learning. SLICE accumulates gradients from both the current task and a replay buffer of prior tasks, reconciles them through a projection operator, and decomposes the result via truncated SVD to initialize the adapter weights. We evaluate SLICE on the TRACE benchmark and sequences of Super-NI tasks, including a set of adversarial Super-NI sequences that we construct by mining task pairs with maximally opposing gradients. Compared to vanilla LoRA, LoRA-GA, and LoRAM, SLICE consistently achieves a better stability-plasticity trade-off, improving Average Performance, Final Performance and Forgetting metrics while preserving General Performance and In Context Performance across both standard and adversarial continual learning sequences.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SLICE gives LoRA a gradient-projection-plus-SVD initialization that targets forgetting in continual learning, but the abstract supplies no numbers so the size of any gain is still unknown.

read the letter

The central move here is to initialize LoRA adapters by accumulating gradients from the current task and a replay buffer, projecting them to reduce direct conflict, and then taking a truncated SVD to set the low-rank weights. That specific pipeline for initialization in the continual setting is what the paper claims as new relative to vanilla LoRA, LoRA-GA, and LoRAM. The evaluation uses TRACE plus Super-NI sequences, including adversarial pairs the authors constructed by mining tasks with opposing gradients, and reports better average performance, final performance, and lower forgetting while general and in-context performance stay flat. The idea itself is straightforward and low-overhead, which is its main practical appeal: it tries to steer the adapter subspace away from overwriting old directions before training even starts. The adversarial sequences are a useful addition because they directly test the mechanism the method is meant to fix. On the soft side, the abstract contains no tables, no quantitative deltas, no error bars, and no ablation on the projection or SVD steps, so it is impossible to judge whether the reported improvements are consistent or large enough to matter in practice. The soundness therefore rests on the full experiments, which are not visible here. The method is not circular and the procedure is clearly algorithmic, but without the actual results it is hard to know if the projection reliably avoids new interference. This work is aimed at people already using LoRA for sequential adaptation of large models and who need a drop-in initialization that does not add much compute. If the numbers in the full paper hold up under scrutiny, the technique could be worth trying; otherwise it stays a reasonable but unproven tweak. I would send it to peer review because the core claim is testable and the evaluation design looks sensible, even if the current summary is thin on evidence.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces SLICE, a gradient-surgery-based initialization for LoRA adapters in continual learning of LLMs. It accumulates gradients from the current task and a replay buffer, reconciles them via a projection operator, and applies truncated SVD to initialize the adapter weights. Evaluations on the TRACE benchmark and Super-NI task sequences (including adversarially constructed pairs with opposing gradients) claim that SLICE outperforms vanilla LoRA, LoRA-GA, and LoRAM on Average Performance, Final Performance, and Forgetting while preserving General Performance and In Context Performance.

Significance. If the reported gains hold, the work offers a lightweight, parameter-free initialization strategy that directly targets gradient conflicts to improve the stability-plasticity trade-off in LoRA-based continual learning. The inclusion of adversarially mined sequences provides a stronger test of robustness than standard benchmarks alone. The method builds on established operations (gradient accumulation, projection, and truncated SVD) without introducing new learned components.

minor comments (2)

[Abstract] Abstract: the high-level claim of consistent metric improvements would be strengthened by including at least one concrete numerical result (e.g., absolute or relative gains on Average Performance) or reference to a results table.
[Method] The description of the projection operator and its interaction with the replay buffer gradients should include explicit pseudocode or a small worked example to ensure reproducibility.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation for minor revision. The report highlights the lightweight nature of SLICE and the value of the adversarial task sequences, which aligns with our goals. As no specific major comments were provided, we have no points requiring rebuttal or manuscript changes.

Circularity Check

0 steps flagged

No significant circularity; algorithmic procedure is self-contained

full rationale

The paper presents SLICE as a direct algorithmic construction: accumulate current-task and replay gradients, apply a projection operator to reconcile them, then use truncated SVD to obtain the low-rank initialization. No equation or step defines a quantity in terms of itself, renames a fitted parameter as a prediction, or relies on a load-bearing self-citation whose content reduces to the present claim. The stability-plasticity improvements are reported as empirical outcomes on TRACE and Super-NI (including adversarial pairs), not as logical consequences of the method's own definition. The derivation chain therefore remains independent of its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that gradient conflicts between successive tasks are the dominant driver of forgetting in LoRA continual learning and that the described projection-plus-SVD step mitigates this without side effects. No free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Gradient conflicts between consecutive tasks cause overwriting of previously learned directions in LoRA subspaces
Stated directly in the abstract as the source of vulnerability in standard adapter initializations.

pith-pipeline@v0.9.0 · 5550 in / 1269 out tokens · 36482 ms · 2026-05-14T21:16:54.222262+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages

[1]

Super- N atural I nstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks

Wang, Yizhong and Mishra, Swaroop and Alipoormolabashi, Pegah and Kordi, Yeganeh and Mirzaei, Amirreza and Naik, Atharva and Ashok, Arjun and Dhanasekaran, Arut Selvan and Arunkumar, Anjana and Stap, David and Pathak, Eshaan and Karamanolakis, Giannis and Lai, Haizhi and Purohit, Ishan and Mondal, Ishani and Anderson, Jacob and Kuznia, Kirby and Doshi, Kr...

work page doi:10.18653/v1/2022.emnlp-main.340 2022
[2]

2023 , eprint=

TRACE: A Comprehensive Benchmark for Continual Learning in Large Language Models , author=. 2023 , eprint=

work page 2023
[3]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Gated Integration of Low-Rank Adaptation for Continual Learning of Large Language Models , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

work page
[4]

Nature Machine Intelligence , volume=

Rethinking machine unlearning for large language models , author=. Nature Machine Intelligence , volume=. 2025 , publisher=

work page 2025
[5]

2025 , url=

Chenlong Zhang and Zhuoran Jin and Hongbang Yuan and Jiaheng Wei and Tong Zhou and Kang Liu and Jun Zhao and Yubo Chen , booktitle=. 2025 , url=

work page 2025
[6]

Forty-first International Conference on Machine Learning , year=

Mitigating Catastrophic Forgetting in Online Continual Learning by Modeling Previous Task Interrelations via Pareto Optimization , author=. Forty-first International Conference on Machine Learning , year=

work page
[7]

Forty-second International Conference on Machine Learning , year=

Rethinking the Stability-Plasticity Trade-off in Continual Learning from an Architectural Perspective , author=. Forty-second International Conference on Machine Learning , year=

work page
[8]

Thirty-seventh Conference on Neural Information Processing Systems , year=

On the Stability-Plasticity Dilemma in Continual Meta-Learning: Theory and Algorithm , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

work page
[9]

Too Helpful, Too Harmless, Too Honest or Just Right?

Kashyap, Gautam Siddharth and Dras, Mark and Naseem, Usman. Too Helpful, Too Harmless, Too Honest or Just Right?. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1510

work page doi:10.18653/v1/2025.emnlp-main.1510 2025
[10]

ACM Comput

Shi, Haizhou and Xu, Zihao and Wang, Hengyi and Qin, Weiyi and Wang, Wenyuan and Wang, Yibin and Wang, Zifeng and Ebrahimi, Sayna and Wang, Hao , title =. ACM Comput. Surv. , month = nov, articleno =. 2025 , issue_date =. doi:10.1145/3735633 , abstract =

work page doi:10.1145/3735633 2025
[11]

International colloquium on automata, languages, and programming , pages=

Finding frequent items in data streams , author=. International colloquium on automata, languages, and programming , pages=. 2002 , url=

work page 2002
[12]

Overcoming catastrophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13): 3521–3526, 2017

James Kirkpatrick and Razvan Pascanu and Neil Rabinowitz and Joel Veness and Guillaume Desjardins and Andrei A. Rusu and Kieran Milan and John Quan and Tiago Ramalho and Agnieszka Grabska-Barwinska and Demis Hassabis and Claudia Clopath and Dharshan Kumaran and Raia Hadsell , title =. Proceedings of the National Academy of Sciences , volume =. 2017 , doi ...

work page doi:10.1073/pnas.1611835114 2017
[13]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

The Primacy of Magnitude in Low-Rank Adaptation , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

work page
[14]

2024 , url=

Meng, Fanxu and Wang, Zhaohui and Zhang, Muhan , booktitle=. 2024 , url=

work page 2024
[15]

Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) , year=

MiLoRA: Harnessing Minor Singular Components for Parameter-Efficient LLM Finetuning , author=. Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) , year=

work page
[16]

2024 , url=

Wang, Shaowen and Yu, Linxi and Li, Jian , booktitle=. 2024 , url=

work page 2024
[17]

and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle=

Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle=. 2022 , url=

work page 2022
[18]

Orthogonal Subspace Learning for Language Model Continual Learning

Wang, Xiao and Chen, Tianze and Ge, Qiming and Xia, Han and Bao, Rong and Zheng, Rui and Zhang, Qi and Gui, Tao and Huang, Xuanjing. Orthogonal Subspace Learning for Language Model Continual Learning. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.715

work page doi:10.18653/v1/2023.findings-emnlp.715 2023
[19]

Proceedings of the AAAI Conference on Artificial Intelligence , author=

OPLoRA: Orthogonal Projection LoRA Prevents Catastrophic Forgetting During Parameter-Efficient Fine-Tuning , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2026 , month=. doi:10.1609/aaai.v40i40.40703 , abstractNote=

work page doi:10.1609/aaai.v40i40.40703 2026
[20]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[21]

2024 , eprint=

Learning Attentional Mixture of LoRAs for Language Model Continual Learning , author=. 2024 , eprint=

work page 2024
[22]

doi:10.5281/zenodo.12608602 , url =

Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...

work page doi:10.5281/zenodo.12608602
[23]

Advances in Neural Information Processing Systems , editor=

Conflict-Averse Gradient Descent for Multi-task learning , author=. Advances in Neural Information Processing Systems , editor=. 2021 , url=

work page 2021
[24]

International Conference on Learning Representations , year=

Gradient Vaccine: Investigating and Improving Multi-task Optimization in Massively Multilingual Models , author=. International Conference on Learning Representations , year=

work page
[25]

Advances in Neural Information Processing Systems , volume=

Just pick a sign: Optimizing deep multitask models with gradient sign dropout , author=. Advances in Neural Information Processing Systems , volume=

work page
[26]

Advances in neural information processing systems , volume=

Multi-task learning as multi-objective optimization , author=. Advances in neural information processing systems , volume=

work page
[27]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Independent component alignment for multi-task learning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[28]

International Conference on Machine Learning , pages=

Multi-Task Learning as a Bargaining Game , author=. International Conference on Machine Learning , pages=. 2022 , organization=

work page 2022
[29]

International Conference on Learning Representations , year=

Towards Impartial Multi-task Learning , author=. International Conference on Learning Representations , year=

work page
[30]

International Conference on Learning Representations , year=

RotoGrad: Gradient Homogenization in Multitask Learning , author=. International Conference on Learning Representations , year=

work page
[31]

Advances in Neural Information Processing Systems , volume=

Famo: Fast adaptive multitask optimization , author=. Advances in Neural Information Processing Systems , volume=

work page
[32]

Advances in Neural Information Processing Systems , volume=

In defense of the unitary scalarization for deep multi-task learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[33]

Advances in neural information processing systems , volume=

Do current multi-task optimization methods in deep learning even help? , author=. Advances in neural information processing systems , volume=

work page
[34]

Gradient Surgery for Multi-Task Learning , url =

Yu, Tianhe and Kumar, Saurabh and Gupta, Abhishek and Levine, Sergey and Hausman, Karol and Finn, Chelsea , booktitle =. Gradient Surgery for Multi-Task Learning , url =

work page
[35]

Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS) , pages=

Orthogonal Gradient Descent for Continual Learning , author=. Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS) , pages=. 2020 , volume=

work page 2020
[36]

The Thirteenth International Conference on Learning Representations , year=

Unlocking the Power of Function Vectors for Characterizing and Mitigating Catastrophic Forgetting in Continual Instruction Tuning , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[37]

Nature Machine Intelligence , volume=

Three types of incremental learning , author=. Nature Machine Intelligence , volume=. 2022 , publisher=

work page 2022
[38]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

work page 2024
[39]

2019 , eprint=

On Tiny Episodic Memories in Continual Learning , author=. 2019 , eprint=

work page 2019
[40]

Advances in Neural Information Processing Systems 33 (NeurIPS 2020) , pages=

Dark experience for general continual learning: a strong, simple baseline , author=. Advances in Neural Information Processing Systems 33 (NeurIPS 2020) , pages=

work page 2020
[41]

International Conference on Learning Representations , year=

Efficient Lifelong Learning with A-GEM , author=. International Conference on Learning Representations , year=

work page