Parameter Alignment Mitigates Catastrophic Forgetting in Multilingual Expert Language Models
Pith reviewed 2026-06-28 21:57 UTC · model grok-4.3
The pith
Parameter alignment strategies reduce catastrophic forgetting during multilingual continual pretraining at low cost to new language gains.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Forgetting in multilingual expert language models arises from parameter drift during continual pretraining on new languages. Five layer-aware alignment strategies—hard layer freezing, soft regularization, post-hoc weight reversion, and model merging—counter this drift and substantially reduce forgetting while preserving most of the ability to acquire the target languages. On benchmarks spanning 32 training languages plus held-out ones, freezing and regularization best maintain reading comprehension and reasoning, whereas post-hoc reversion delivers the strongest translation performance.
What carries the argument
Layer-aware parameter alignment strategies that directly counteract parameter drift during or after family-based continual pretraining of multilingual expert models.
If this is right
- Layer freezing and regularization best preserve comprehension and reasoning after adding new languages.
- Post-hoc weight reversion produces the largest gains on translation tasks.
- These methods achieve reduced forgetting at minimal expense to acquisition of the target languages.
- Language-family organization alone does not prevent loss of general knowledge needed for downstream tasks.
- Deployment can pair specific alignment strategies to the performance axes that matter most for a given use case.
Where Pith is reading between the lines
- The same alignment logic could be tested on continual pretraining that adds entirely new domains rather than languages.
- If parameter drift is the main driver, similar lightweight alignment steps might apply to other continual-learning settings outside language models.
- The acquisition-forgetting frontier mapped here suggests a tunable trade-off surface that future work could optimize with fewer than five discrete strategies.
Load-bearing premise
Forgetting in multilingual continual pretraining is driven by parameter drift and the chosen benchmarks across four axes adequately reflect real-world downstream performance and retained knowledge.
What would settle it
An experiment applying the alignment strategies yet observing no reduction in forgetting rates relative to the unregularized baselines on the same perplexity, comprehension, reasoning, and translation benchmarks.
Figures
read the original abstract
While continual pretraining~(CPT) is a practical way to extend large language models to new languages, na\"ive finetuning on targeted data erodes existing capabilities through catastrophic forgetting. Organizing training around language families reduces cross-language interference but cannot alone prevent forgetting of the general knowledge needed for downstream tasks. We link this forgetting to parameter drift in multilingual CPT and present a suite of five layer-aware parameter alignment strategies: hard layer freezing, soft regularization, post-hoc weight reversion, and model merging. We systematically compare our alignment strategies against two unregularized CPT baselines on benchmarks spanning 32 training languages from five language families, plus held-out languages, across four evaluation axes: perplexity, reading comprehension, physical reasoning, and translation. Parameter alignment substantially reduces forgetting at minimal cost to language acquisition: layer freezing and regularization best preserve comprehension, whereas post-hoc reversion yields the strongest translation gains. Together, these results map the acquisition--forgetting frontier for family-expert CPT and offer practical deployment guidelines pairing each strategy to the tasks it best serves.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that parameter alignment mitigates catastrophic forgetting during continual pretraining (CPT) of multilingual expert language models. It links forgetting to parameter drift, introduces five layer-aware alignment strategies (hard layer freezing, soft regularization, post-hoc weight reversion, model merging, and one additional), and evaluates them against unregularized CPT baselines on benchmarks spanning 32 languages from five families plus held-out languages. The evaluation covers four axes (perplexity, reading comprehension, physical reasoning, translation) and concludes that alignment substantially reduces forgetting at minimal cost to language acquisition, with layer freezing/regularization optimal for comprehension and post-hoc reversion for translation.
Significance. If the results hold, this work maps the acquisition-forgetting frontier for family-expert CPT and supplies practical deployment guidelines that pair strategies to tasks. The systematic empirical comparison across languages, families, and multiple axes is a strength of the study.
major comments (1)
- [Evaluation section] Evaluation section (and abstract): The central claim that parameter alignment substantially reduces forgetting rests on the premise that the four evaluation axes sufficiently capture retention of general knowledge for downstream tasks. The manuscript provides no additional evidence or ablation showing that perplexity, reading comprehension, physical reasoning, and translation across the 32+ languages are representative rather than task-specific; this assumption is load-bearing for the generality of the reported reductions in forgetting.
minor comments (1)
- [Abstract] Abstract: The text states there are 'five layer-aware parameter alignment strategies' but then enumerates only four (hard layer freezing, soft regularization, post-hoc weight reversion, and model merging). Clarify the fifth strategy or correct the count.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our evaluation design. We address the major comment below.
read point-by-point responses
-
Referee: [Evaluation section] Evaluation section (and abstract): The central claim that parameter alignment substantially reduces forgetting rests on the premise that the four evaluation axes sufficiently capture retention of general knowledge for downstream tasks. The manuscript provides no additional evidence or ablation showing that perplexity, reading comprehension, physical reasoning, and translation across the 32+ languages are representative rather than task-specific; this assumption is load-bearing for the generality of the reported reductions in forgetting.
Authors: We appreciate the referee's point that stronger justification is needed for why these four axes adequately represent retention of general knowledge. Our choice was driven by the need to probe distinct facets of capability retention in a multilingual setting: perplexity for core language modeling, reading comprehension for textual understanding, physical reasoning for factual and inferential knowledge, and translation for cross-lingual transfer. These axes follow standard practice in multilingual LLM evaluation and exhibit consistent patterns across our 32+ languages and five families, lending support to the generality of the forgetting-mitigation results. That said, the current manuscript does not include an explicit discussion or ablation of task representativeness. In revision we will add a concise subsection in the Evaluation section that (a) articulates the rationale for the chosen axes with references to prior multilingual benchmarks, (b) notes their coverage of different capability types, and (c) acknowledges the inherent limits of any finite task suite. This addresses the concern without requiring new experiments. revision: partial
Circularity Check
No circularity: empirical comparison of alignment strategies on benchmarks
full rationale
The paper is an empirical study that evaluates five parameter alignment strategies against CPT baselines on perplexity, reading comprehension, physical reasoning, and translation benchmarks across 32 languages. No equations, derivations, or mathematical claims are present that could reduce to fitted quantities or self-definitions by construction. No uniqueness theorems, ansatzes, or load-bearing self-citations are invoked to justify core premises; results are presented as direct experimental outcomes. The work is self-contained against external benchmarks and does not rename known results or smuggle assumptions via citation chains.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Advances in Neural Information Processing Systems , volume=
Madlad-400: A multilingual and document-level large audited dataset , author=. Advances in Neural Information Processing Systems , volume=
-
[2]
The Unreasonable Effectiveness of Model Merging for Cross-Lingual Transfer in LLM s
Bandarkar, Lucas and Peng, Nanyun. The Unreasonable Effectiveness of Model Merging for Cross-Lingual Transfer in LLM s. Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025). 2025. doi:10.18653/v1/2025.mrl-main.10
-
[3]
2025 , eprint=
Layer Swapping for Zero-Shot Cross-Lingual Transfer in Large Language Models , author=. 2025 , eprint=
2025
-
[4]
2024 , eprint=
Maintaining Plasticity in Continual Learning via Regenerative Regularization , author=. 2024 , eprint=
2024
-
[5]
The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants
Bandarkar, Lucas and Liang, Davis and Muller, Benjamin and Artetxe, Mikel and Shukla, Satya Narayan and Husa, Donald and Goyal, Naman and Krishnan, Abhinandan and Zettlemoyer, Luke and Khabsa, Madian. The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants. Proceedings of the 62nd Annual Meeting of the Association for Com...
-
[6]
2025 , eprint=
Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures , author=. 2025 , eprint=
2025
-
[7]
2022 , eprint=
No Language Left Behind: Scaling Human-Centered Machine Translation , author=. 2022 , eprint=
2022
-
[8]
chr F : character n-gram F -score for automatic MT evaluation
Popovi \'c , Maja. chr F : character n-gram F -score for automatic MT evaluation. Proceedings of the Tenth Workshop on Statistical Machine Translation. 2015. doi:10.18653/v1/W15-3049
-
[9]
doi:10.5281/zenodo.12608602 , url =
Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...
-
[10]
Conneau, Alexis and Khandelwal, Kartikay and Goyal, Naman and Chaudhary, Vishrav and Wenzek, Guillaume and Guzm \'a n, Francisco and Grave, Edouard and Ott, Myle and Zettlemoyer, Luke and Stoyanov, Veselin. Unsupervised Cross-lingual Representation Learning at Scale. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. ...
-
[11]
Are All Languages Created Equal in Multilingual BERT ?
Wu, Shijie and Dredze, Mark. Are All Languages Created Equal in Multilingual BERT ?. Proceedings of the 5th Workshop on Representation Learning for NLP. 2020. doi:10.18653/v1/2020.repl4nlp-1.16
-
[12]
MEGAVERSE : Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks
Ahuja, Sanchit and Aggarwal, Divyanshu and Gumma, Varun and Watts, Ishaan and Sathe, Ashutosh and Ochieng, Millicent and Hada, Rishav and Jain, Prachi and Ahmed, Mohamed and Bali, Kalika and Sitaram, Sunayana. MEGAVERSE : Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks. Proceedings of the 2024 Conference of the North Amer...
-
[13]
BUFFET : Benchmarking Large Language Models for Few-shot Cross-lingual Transfer
Asai, Akari and Kudugunta, Sneha and Yu, Xinyan and Blevins, Terra and Gonen, Hila and Reid, Machel and Tsvetkov, Yulia and Ruder, Sebastian and Hajishirzi, Hannaneh. BUFFET : Benchmarking Large Language Models for Few-shot Cross-lingual Transfer. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguis...
-
[14]
Lifting the Curse of Multilinguality by Pre-training Modular Transformers
Pfeiffer, Jonas and Goyal, Naman and Lin, Xi and Li, Xian and Cross, James and Riedel, Sebastian and Artetxe, Mikel. Lifting the Curse of Multilinguality by Pre-training Modular Transformers. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2022. doi:10.18653/v1...
-
[15]
Blevins, Terra and Limisiewicz, Tomasz and Gururangan, Suchin and Li, Margaret and Gonen, Hila and Smith, Noah A. and Zettlemoyer, Luke. Breaking the Curse of Multilinguality with Cross-lingual Expert Language Models. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.604
-
[16]
2022 , eprint=
Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models , author=. 2022 , eprint=
2022
-
[17]
2020 , eprint=
Beyond English-Centric Multilingual Machine Translation , author=. 2020 , eprint=
2020
-
[18]
2025 , eprint=
Gemma 3 Technical Report , author=. 2025 , eprint=
2025
-
[19]
Larger-Scale Transformers for Multilingual Masked Language Modeling
Goyal, Naman and Du, Jingfei and Ott, Myle and Anantharaman, Giri and Conneau, Alexis. Larger-Scale Transformers for Multilingual Masked Language Modeling. Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021). 2021. doi:10.18653/v1/2021.repl4nlp-1.4
-
[20]
Cross-lingual Language Model Pretraining , url =
CONNEAU, Alexis and Lample, Guillaume , booktitle =. Cross-lingual Language Model Pretraining , url =
-
[21]
XLM - E : Cross-lingual Language Model Pre-training via ELECTRA
Chi, Zewen and Huang, Shaohan and Dong, Li and Ma, Shuming and Zheng, Bo and Singhal, Saksham and Bajaj, Payal and Song, Xia and Mao, Xian-Ling and Huang, Heyan and Wei, Furu. XLM - E : Cross-lingual Language Model Pre-training via ELECTRA. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 202...
-
[22]
PARADISE : Exploiting Parallel Data for Multilingual Sequence-to-Sequence Pretraining
Reid, Machel and Artetxe, Mikel. PARADISE : Exploiting Parallel Data for Multilingual Sequence-to-Sequence Pretraining. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2022. doi:10.18653/v1/2022.naacl-main.58
-
[23]
Mini But Mighty: Efficient Multilingual Pretraining with Linguistically-Informed Data Selection
Ogunremi, Tolulope and Jurafsky, Dan and Manning, Christopher D. Mini But Mighty: Efficient Multilingual Pretraining with Linguistically-Informed Data Selection. Findings of the Association for Computational Linguistics: EACL 2023. 2023. doi:10.18653/v1/2023.findings-eacl.93
-
[24]
2023 , eprint=
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model , author=. 2023 , eprint=
2023
-
[25]
URLhttp://www.sciencedirect.com/science/article/pii/ S0079742108605368
Michael McCloskey and Neal J. Cohen , abstract =. Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem , editor =. 1989 , issn =. doi:https://doi.org/10.1016/S0079-7421(08)60536-8 , url =
-
[26]
Proceedings of the 35th International Conference on Machine Learning (ICML) , pages=
Explicit Inductive Bias for Transfer Learning with Convolutional Networks , author=. Proceedings of the 35th International Conference on Machine Learning (ICML) , pages=. 2018 , organization=
2018
-
[27]
Proceedings of the national academy of sciences , volume=
Overcoming catastrophic forgetting in neural networks , author=. Proceedings of the national academy of sciences , volume=. 2017 , publisher=
2017
-
[28]
2025 , eprint=
Continually Adding New Languages to Multilingual Language Models , author=. 2025 , eprint=
2025
-
[29]
What Causes Knowledge Loss in Multilingual Language Models?
Khelli, Maria and Cahyawijaya, Samuel and Purwarianti, Ayu and Winata, Genta Indra. What Causes Knowledge Loss in Multilingual Language Models?. Proceedings of the Fourth Workshop on NLP Applications to Field Linguistics. 2025
2025
-
[30]
arXiv preprint arXiv:2510.19546 , year=
Conditions for Catastrophic Forgetting in Multilingual Translation , author=. arXiv preprint arXiv:2510.19546 , year=
-
[31]
arXiv preprint arXiv:2401.04088 , year=
Mixtral of experts , author=. arXiv preprint arXiv:2401.04088 , year=
-
[32]
2017 , eprint=
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , author=. 2017 , eprint=
2017
-
[33]
arXiv preprint arXiv:2303.14177 , year=
Scaling expert language models with unsupervised domain discovery , author=. arXiv preprint arXiv:2303.14177 , year=
-
[34]
Language-Family Adapters for Low-Resource Multilingual Neural Machine Translation
Chronopoulou, Alexandra and Stojanovski, Dario and Fraser, Alexander. Language-Family Adapters for Low-Resource Multilingual Neural Machine Translation. Proceedings of the Sixth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2023). 2023. doi:10.18653/v1/2023.loresmt-1.5
-
[35]
Proceedings of the 39th International Conference on Machine Learning , pages =
Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time , author =. Proceedings of the 39th International Conference on Machine Learning , pages =. 2022 , editor =
2022
-
[36]
2023 , eprint=
Editing Models with Task Arithmetic , author=. 2023 , eprint=
2023
-
[37]
Qwen3-Max: Just Scale it , author =
-
[38]
2025 , eprint=
Babel: Open Multilingual Large Language Models Serving Over 90\ author=. 2025 , eprint=
2025
-
[39]
2024 , eprint=
Sailor: Open Language Models for South-East Asia , author=. 2024 , eprint=
2024
-
[40]
Do Llamas Work in E nglish? On the Latent Language of Multilingual Transformers
Wendler, Chris and Veselovsky, Veniamin and Monea, Giovanni and West, Robert. Do Llamas Work in E nglish? On the Latent Language of Multilingual Transformers. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.820
-
[41]
Mitigating Catastrophic Forgetting in Large Language Models with Self-Synthesized Rehearsal
Huang, Jianheng and Cui, Leyang and Wang, Ante and Yang, Chengyi and Liao, Xinting and Song, Linfeng and Yao, Junfeng and Su, Jinsong. Mitigating Catastrophic Forgetting in Large Language Models with Self-Synthesized Rehearsal. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.186...
-
[42]
Overcoming Catastrophic Forgetting During Domain Adaptation of Seq2seq Language Generation
Li, Dingcheng and Chen, Zheng and Cho, Eunah and Hao, Jie and Liu, Xiaohu and Xing, Fan and Guo, Chenlei and Liu, Yang. Overcoming Catastrophic Forgetting During Domain Adaptation of Seq2seq Language Generation. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2...
-
[43]
Downey, C. M. and Blevins, Terra and Serai, Dhwani and Parikh, Dwija and Steinert-Threlkeld, Shane. Targeted Multilingual Adaptation for Low-resource Language Families. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.918
-
[44]
Ogueji, Kelechi and Zhu, Yuxin and Lin, Jimmy. Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages. Proceedings of the 1st Workshop on Multilingual Representation Learning. 2021. doi:10.18653/v1/2021.mrl-1.11
-
[45]
2024 , eprint=
RedWhale: An Adapted Korean LLM Through Efficient Continual Pretraining , author=. 2024 , eprint=
2024
-
[46]
2024 , eprint=
Continual Pre-Training for Cross-Lingual LLM Adaptation: Enhancing Japanese Language Capabilities , author=. 2024 , eprint=
2024
-
[47]
2025 , note =
Elaine Zosa and Jouni Luoma and Kai Hakala and Antti Virtanen and Mika Koistinen and Jonathan Burdge , title =. 2025 , note =
2025
-
[48]
support.In:PracticeandExperienceinAdvancedResearchComputing2023:Com- puting for the Common Good
Boerner, Timothy J. and Deems, Stephen and Furlani, Thomas R. and Knuth, Shelley L. and Towns, John , title =. Practice and Experience in Advanced Research Computing 2023: Computing for the Common Good , pages =. 2023 , isbn =. doi:10.1145/3569951.3597559 , abstract =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.