Fine-Tuning Without Forgetting via Loss-Adaptive Learning Rates
Pith reviewed 2026-05-20 07:10 UTC · model grok-4.3
The pith
Reducing learning rates on high-loss batches during fine-tuning reduces catastrophic forgetting by 93% on average while maintaining task performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We identify a simple mechanism for controlling forgetting: per-step forgetting is bounded by the product of the learning rate and the square root of the current training loss. This suggests that high-loss batches are especially prone to inducing forgetting. Motivated by this observation, we introduce FINCH, a loss-adaptive learning-rate schedule that reduces the learning rate on high-loss batches and increases it as the model converges, while leaving the fine-tuning objective unchanged. Across benchmarks, FINCH reduces forgetting by 93% on average while matching the task performance of standard fine-tuning.
What carries the argument
FINCH, the loss-adaptive learning-rate schedule that lowers the learning rate for batches with high current training loss to limit the per-step forgetting bound.
If this is right
- Across knowledge acquisition, science, and low-resource language adaptation benchmarks, forgetting is reduced by 93% on average while task performance matches standard fine-tuning.
- On Qwen3-4B knowledge acquisition, TruthfulQA degradation is cut by 5x and HaluEval degradation is reversed.
- Confidence calibration is better preserved compared to standard fine-tuning.
- Learning-rate schedules can shape model behavior during fine-tuning beyond just target-task optimization.
Where Pith is reading between the lines
- The bound could motivate real-time loss monitoring to dynamically adjust rates in any sequential training setting where earlier capabilities must be retained.
- Similar adaptive schedules might stabilize training from scratch by down-weighting early high-loss phases to avoid unstable updates.
- If the mechanism generalizes, it could reduce reliance on explicit regularization terms in continual learning pipelines.
Load-bearing premise
Per-step forgetting during fine-tuning is bounded by the product of the learning rate and the square root of the current training loss.
What would settle it
An experiment showing that observed forgetting after a fine-tuning step on a high-loss batch exceeds the predicted bound of learning rate times square root of loss, or that lowering the rate on such batches fails to reduce overall forgetting compared to a fixed schedule.
Figures
read the original abstract
Fine-tuning large language models on new data improves task performance but degrades capabilities learned during pretraining, a phenomenon known as catastrophic forgetting. Existing methods mitigate this by modifying the fine-tuning objective to suppress high-loss tokens or sequences, but these tokens are essential for learning new tasks, especially those with poor pretraining coverage. In such settings, hard tokens should still contribute to learning, so forgetting must be controlled without suppressing them. We identify a simple mechanism for doing so: per-step forgetting is bounded by the product of the learning rate and the square root of the current training loss. This suggests that high-loss batches are especially prone to inducing forgetting. Motivated by this observation, we introduce FINCH, a loss-adaptive learning-rate schedule that reduces the learning rate on high-loss batches and increases it as the model converges, while leaving the fine-tuning objective unchanged. Across knowledge acquisition, science, and low-resource language adaptation benchmarks, FINCH reduces forgetting by 93% on average while matching the task performance of standard fine-tuning. On Qwen3-4B knowledge acquisition, FINCH cuts TruthfulQA degradation by 5x and reverses HaluEval degradation, while better preserving confidence calibration. Overall, our results show that learning-rate schedules are an effective tool to shape model behavior during fine-tuning, beyond just target-task optimization.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes FINCH, a loss-adaptive learning-rate schedule for fine-tuning LLMs that reduces the learning rate on high-loss batches. It claims that per-step forgetting is bounded by the product of the learning rate and the square root of the current training loss, motivating the schedule while leaving the fine-tuning objective unchanged. Across knowledge acquisition, science, and low-resource language adaptation benchmarks, FINCH is reported to reduce forgetting by 93% on average while matching standard fine-tuning task performance, with specific gains on Qwen3-4B including a 5x cut in TruthfulQA degradation and reversal of HaluEval degradation.
Significance. If the per-step forgetting bound is rigorously derived and the empirical controls are sound, the work shows that learning-rate schedules alone can shape fine-tuning behavior to preserve pretraining capabilities without suppressing hard tokens or modifying the loss. This is a lightweight alternative to existing forgetting-mitigation techniques and could be broadly useful for stable LLM adaptation.
major comments (1)
- [Abstract] Abstract: the claim that per-step forgetting is bounded by LR × √(current training loss) is presented as the key mechanism motivating FINCH, yet no derivation, inequality, or set of assumptions is supplied. Without this, it is impossible to assess whether the bound remains valid under the distribution shift that occurs during fine-tuning on out-of-distribution data, which directly undermines the justification for the adaptive schedule and the reported 93% forgetting reduction.
minor comments (2)
- The abstract refers to 'knowledge acquisition, science, and low-resource language adaptation benchmarks' but does not list the concrete datasets, number of runs, or statistical tests used to support the average 93% reduction.
- It would be helpful to include a short proof sketch or inequality chain for the claimed forgetting bound in the main text or appendix so that readers can verify the √loss dependence.
Simulated Author's Rebuttal
We thank the referee for their careful reading of the manuscript and for identifying an important point regarding the theoretical motivation. We address the major comment below and will revise the paper accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that per-step forgetting is bounded by LR × √(current training loss) is presented as the key mechanism motivating FINCH, yet no derivation, inequality, or set of assumptions is supplied. Without this, it is impossible to assess whether the bound remains valid under the distribution shift that occurs during fine-tuning on out-of-distribution data, which directly undermines the justification for the adaptive schedule and the reported 93% forgetting reduction.
Authors: We acknowledge that the abstract states the bound without supplying the derivation or explicit assumptions, which limits the ability to evaluate its validity under distribution shift. The manuscript motivates the bound via the observation that the per-step parameter update magnitude scales with the learning rate and that the gradient norm is controlled by the current loss value (via standard inequalities relating loss to gradient under smoothness assumptions). However, a complete step-by-step derivation with listed assumptions was not included. We will add a dedicated paragraph (or short subsection) in the revised manuscript that states the bound formally, lists the assumptions (e.g., L-smoothness of the loss and bounded gradient norms), and discusses its role as a heuristic motivation rather than a strict guarantee throughout training. We will also note that the schedule remains beneficial even if the bound loosens under shift, because it still down-weights updates on high-loss batches. This revision will strengthen the justification while preserving all empirical claims. revision: yes
Circularity Check
No significant circularity; bound and schedule are independently motivated
full rationale
The paper derives the per-step forgetting bound from the parameter update rule combined with a definition of forgetting (increase in pretraining loss) and presents it as a first-principles mechanism. The FINCH schedule is then constructed directly from that bound without fitting parameters to the target forgetting metric or renaming an observed pattern. No self-citation chains, self-definitional steps, or fitted-input-called-prediction reductions appear in the derivation. Empirical results on benchmarks provide independent falsifiable content outside the motivating inequality. The central claim therefore remains self-contained against external evaluation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption per-step forgetting is bounded by the product of the learning rate and the square root of the current training loss
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
per-step forgetting is bounded by the product of the learning rate and the square root of the current training loss... η_i = κ / √L_Bi(θ_i)
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Corollary 1... cumulative forgetting satisfies L_old(p_T) − L_old(p_0) = O(T κ)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
On-policy distillation of language models: Learning from self-generated mistakes
Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe twelfth international conference on learning representations, 2024
work page 2024
-
[2]
Context-free synthetic data mitigates forgetting.arXiv preprint arXiv:2505.13811, 2025
Parikshit Bansal and Sujay Sanghavi. Context-free synthetic data mitigates forgetting.arXiv preprint arXiv:2505.13811, 2025
-
[3]
LoRA learns less and forgets less.arXiv preprint arXiv:2405.09673, 2024
Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, et al. Lora learns less and forgets less.arXiv preprint arXiv:2405.09673, 2024
-
[4]
Continual memorization of factoids in language models.arXiv preprint arXiv:2411.07175, 2024
Howard Chen, Jiayi Geng, Adithya Bhaskar, Dan Friedman, and Danqi Chen. Continual memorization of factoids in language models.arXiv preprint arXiv:2411.07175, 2024
-
[5]
Monolingual or multilingual instruction tuning: Which makes a better alpaca
Pinzhen Chen, Shaoxiong Ji, Nikolay Bogoychev, Andrey Kutuzov, Barry Haddow, and Kenneth Heafield. Monolingual or multilingual instruction tuning: Which makes a better alpaca. In Findings of the Association for Computational Linguistics: EACL 2024, pages 1347–1356, 2024
work page 2024
-
[6]
Language modeling with gated convolutional networks
Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks. InInternational conference on machine learning, pages 933–941. PMLR, 2017
work page 2017
-
[7]
Cyprien de Masson D’Autume, Sebastian Ruder, Lingpeng Kong, and Dani Yogatama. Episodic memory in lifelong language learning.Advances in Neural Information Processing Systems, 32, 2019
work page 2019
-
[8]
Itay Evron, Edward Moroshko, Rachel Ward, Nathan Srebro, and Daniel Soudry. How catas- trophic can catastrophic forgetting be in linear regression? InConference on Learning Theory, pages 4028–4079. PMLR, 2022
work page 2022
-
[9]
Sciknoweval: Evaluating multi-level scientific knowledge of large language models
Kehua Feng, Xinyi Shen, Weijie Wang, Xiang Zhuang, Yuqi Tang, Qiang Zhang, and Keyan Ding. Sciknoweval: Evaluating multi-level scientific knowledge of large language models. arXiv preprint arXiv:2406.09098, 2024
-
[10]
Catastrophic forgetting in connectionist networks.Trends in cognitive sciences, 3(4):128–135, 1999
Robert M French. Catastrophic forgetting in connectionist networks.Trends in cognitive sciences, 3(4):128–135, 1999
work page 1999
-
[11]
The language model evaluation harness, 07 2024
Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...
work page 2024
-
[12]
Zorik Gekhman, Gal Yona, Roee Aharoni, Matan Eyal, Amir Feder, Roi Reichart, and Jonathan Herzig. Does fine-tuning llms on new knowledge encourage hallucinations? InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 7765–7784, 2024. 10
work page 2024
-
[13]
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
Priya Goyal, Piotr Doll´ar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour.arXiv preprint arXiv:1706.02677, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[14]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Guande He, Jianfei Chen, and Jun Zhu. Preserving pre-trained features helps calibrate fine-tuned language models.arXiv preprint arXiv:2305.19249, 2023
-
[16]
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[17]
arXiv preprint arXiv:2004.13135 , year=
Calypso Herrera, Florian Krach, and Josef Teichmann. Local lipschitz bounds of deep neural networks.arXiv preprint arXiv:2004.13135, 2020
-
[18]
Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022
work page 2022
-
[19]
Mitigating catastrophic forgetting in large language models with self- synthesized rehearsal
Jianheng Huang, Leyang Cui, Ante Wang, Chengyi Yang, Xinting Liao, Linfeng Song, Junfeng Yao, and Jinsong Su. Mitigating catastrophic forgetting in large language models with self- synthesized rehearsal. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1416–1428, 2024
work page 2024
-
[20]
Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qiang- long Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems, 43(2):1–55, 2025
work page 2025
-
[21]
Refine large language model fine-tuning via instruction vector.arXiv preprint arXiv:2406.12227, 2024
Gangwei Jiang, Zhaoyi Li, Defu Lian, and Ying Wei. Refine large language model fine-tuning via instruction vector.arXiv preprint arXiv:2406.12227, 2024
-
[22]
Language Models (Mostly) Know What They Know
Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[23]
Scaling laws for forgetting when fine-tuning large language models
Damjan Kalajdzievski. Scaling laws for forgetting when fine-tuning large language models. arXiv preprint arXiv:2401.05605, 2024
-
[24]
Dayal Singh Kalra and Maissam Barkeshli. Why warmup the learning rate? underlying mech- anisms and improvements.Advances in Neural Information Processing Systems, 37:111760– 111801, 2024
work page 2024
-
[25]
High dimensional bayesian optimisation and bandits via additive models
Kirthevasan Kandasamy, Jeff Schneider, and Barnab´as P´oczos. High dimensional bayesian optimisation and bandits via additive models. InInternational conference on machine learning, pages 295–304. PMLR, 2015
work page 2015
-
[26]
Sanyam Kapoor, Nate Gruver, Manley Roberts, Katherine Collins, Arka Pal, Umang Bhatt, Adrian Weller, Samuel Dooley, Micah Goldblum, and Andrew G Wilson. Large language models must be taught to know what they don’t know.Advances in Neural Information Processing Systems, 37:85932–85972, 2024
work page 2024
-
[27]
Intelligent learning rate distribution to reduce catastrophic forgetting in transformers
Philip Kenneweg, Alexander Schulz, Sarah Schr¨oder, and Barbara Hammer. Intelligent learning rate distribution to reduce catastrophic forgetting in transformers. InInternational Conference on Intelligent Data Engineering and Automated Learning, pages 252–261. Springer, 2022
work page 2022
-
[28]
James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017. 11
work page 2017
-
[29]
Atli Kosson, Bettina Messmer, and Martin Jaggi. Analyzing & reducing the need for learning rate warmup in gpt training.Advances in Neural Information Processing Systems, 37:2914–2942, 2024
work page 2024
-
[30]
Learning multiple layers of features from tiny images
Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009
work page 2009
-
[31]
Yann LeCun, L ´eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 2002
work page 2002
-
[32]
Chungpa Lee, Jy-yong Sohn, and Kangwook Lee. Fine-tuning without forgetting in-context learning: A theoretical analysis of linear attention models.arXiv preprint arXiv:2602.23197, 2026
-
[33]
Aitor Lewkowycz, Yasaman Bahri, Ethan Dyer, Jascha Sohl-Dickstein, and Guy Gur-Ari. The large learning rate phase of deep learning: the catapult mechanism.arXiv preprint arXiv:2003.02218, 2020
-
[34]
Towards understanding catastrophic forgetting in two-layer convolutional neural networks
Boqi Li, Youjun Wang, and Weiwei Liu. Towards understanding catastrophic forgetting in two-layer convolutional neural networks. InForty-second International Conference on Machine Learning, 2025
work page 2025
-
[35]
Revisiting catastrophic forgetting in large language model tuning
Hongyu Li, Liang Ding, Meng Fang, and Dacheng Tao. Revisiting catastrophic forgetting in large language model tuning. InFindings of the association for computational linguistics: EMNLP 2024, pages 4297–4308, 2024
work page 2024
-
[36]
Halueval: A large-scale hallucination evaluation benchmark for large language models
Junyi Li, Xiaoxue Cheng, Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. Halueval: A large-scale hallucination evaluation benchmark for large language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 6449–6464, 2023
work page 2023
-
[37]
Jiacheng Lin, Zhongruo Wang, Kun Qian, Tian Wang, Arvind Srinivasan, Hansi Zeng, Ruochen Jiao, Xie Zhou, Jiri Gesi, Dakuo Wang, et al. Sft doesn’t always hurt general capabilities: Revisiting domain-specific fine-tuning in llms.arXiv preprint arXiv:2509.20758, 2025
-
[38]
Trgp: Trust region gradient projection for continual learning.arXiv preprint arXiv:2202.02931, 2022
Sen Lin, Li Yang, Deliang Fan, and Junshan Zhang. Trgp: Trust region gradient projection for continual learning.arXiv preprint arXiv:2202.02931, 2022
-
[39]
Truthfulqa: Measuring how models mimic hu- man falsehoods
Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic hu- man falsehoods. InProceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers), pages 3214–3252, 2022
work page 2022
-
[40]
SGDR: Stochastic Gradient Descent with Warm Restarts
Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts.arXiv preprint arXiv:1608.03983, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[41]
On-policy distillation.Thinking Machines Lab: Connec- tionism, 2025
Kevin Lu and Thinking Machines Lab. On-policy distillation.Thinking Machines Lab: Connec- tionism, 2025. https://thinkingmachines.ai/blog/on-policy-distillation
work page 2025
-
[42]
Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. An empirical study of catastrophic forgetting in large language models during continual fine-tuning.IEEE Transactions on Audio, Speech and Language Processing, 2025
work page 2025
-
[43]
TOFU: A Task of Fictitious Unlearning for LLMs
Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C Lipton, and J Zico Kolter. Tofu: A task of fictitious unlearning for llms.arXiv preprint arXiv:2401.06121, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[44]
Catastrophic interference in connectionist networks: The sequential learning problem
Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. InPsychology of learning and motivation, volume 24, pages 109–165. Elsevier, 1989
work page 1989
-
[45]
arXiv preprint arXiv:2404.00213 , year=
Nick Mecklenburg, Yiyou Lin, Xiaoxiao Li, Daniel Holstein, Leonardo Nunes, Sara Mal- var, Bruno Silva, Ranveer Chandra, Vijay Aski, Pavan Kumar Reddy Yannam, et al. Inject- ing new knowledge into large language models via supervised fine-tuning.arXiv preprint arXiv:2404.00213, 2024. 12
-
[46]
Rethinking the trust region in llm reinforcement learning.arXiv preprint arXiv:2602.04879, 2026
Penghui Qi, Xiangxin Zhou, Zichen Liu, Tianyu Pang, Chao Du, Min Lin, and Wee Sun Lee. Rethinking the trust region in llm reinforcement learning.arXiv preprint arXiv:2602.04879, 2026
-
[47]
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[48]
David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lillicrap, and Gregory Wayne. Expe- rience replay for continual learning.Advances in neural information processing systems, 32, 2019
work page 2019
-
[49]
Code Llama: Open Foundation Models for Code
Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[50]
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021
work page 2021
-
[51]
Upweighting easy samples in fine-tuning mitigates forgetting.arXiv preprint arXiv:2502.02797, 2025
Sunny Sanyal, Hayden Prairie, Rudrajit Das, Ali Kavis, and Sujay Sanghavi. Upweighting easy samples in fine-tuning mitigates forgetting.arXiv preprint arXiv:2502.02797, 2025
-
[52]
Trust region policy optimization
John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. InInternational conference on machine learning, pages 1889–1897. PMLR, 2015
work page 2015
-
[53]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[54]
Fine-tuned language models are continual learners
Thomas Scialom, Tuhin Chakrabarty, and Smaranda Muresan. Fine-tuned language models are continual learners. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6107–6122, 2022
work page 2022
-
[55]
Cambridge university press, 2014
Shai Shalev-Shwartz and Shai Ben-David.Understanding machine learning: From theory to algorithms. Cambridge university press, 2014
work page 2014
-
[56]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[57]
GLU Variants Improve Transformer
Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2002
-
[58]
Self-Distillation Enables Continual Learning
Idan Shenfeld, Mehul Damani, Jonas H¨ubotter, and Pulkit Agrawal. Self-distillation enables continual learning.arXiv preprint arXiv:2601.19897, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[59]
Mitigating forgetting in continual learning with selective gradient projection
Anika Singh, David Martinez, Aayush Dhaulakhandi, Varun Chopade, Likhith Malipati, Vasu Sharma, Kevin Zhu, Sunishchal Dev, and Ryan Lagasse. Mitigating forgetting in continual learning with selective gradient projection. InThe 14th International Joint Conference on Natural Language Processing and The 4th Conference of the Asia-Pacific Chapter of the Assoc...
work page 2025
-
[60]
Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R Pfohl, Heather Cole-Lewis, et al. Toward expert-level medical question answering with large language models.Nature medicine, 31(3):943–950, 2025
work page 2025
-
[61]
Super-convergence: Very fast training of neural networks using large learning rates
Leslie N Smith and Nicholay Topin. Super-convergence: Very fast training of neural networks using large learning rates. InArtificial intelligence and machine learning for multi-domain operations applications, volume 11006, pages 369–386. SPIE, 2019
work page 2019
-
[62]
Shezheng Song, Hao Xu, Jun Ma, Shasha Li, Long Peng, Qian Wan, Xiaodong Liu, and Jie Yu. How to alleviate catastrophic forgetting in llms finetuning? hierarchical layer-wise and element-wise regularization.arXiv preprint arXiv:2501.13669, 2025. 13
- [63]
-
[64]
Attention is all you need.Advances in neural information processing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017
work page 2017
-
[65]
Aladin Virmaux and Kevin Scaman. Lipschitz regularity of deep neural networks: analysis and efficient estimation.Advances in neural information processing systems, 31, 2018
work page 2018
-
[66]
Factuality of large language models: A survey
Yuxia Wang, Minghan Wang, Muhammad Arslan Manzoor, Fei Liu, Georgi Nenkov Georgiev, Rocktim Jyoti Das, and Preslav Nakov. Factuality of large language models: A survey. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 19519–19529, 2024
work page 2024
-
[67]
Measuring short-form factuality in large language models
Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus. Measuring short-form factuality in large language models. arXiv preprint arXiv:2411.04368, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[68]
Robust fine-tuning of zero-shot models
Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Li, Simon Kornblith, Rebecca Roelofs, Raphael Gontijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, et al. Robust fine-tuning of zero-shot models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7959–7971, 2022
work page 2022
-
[69]
Chao-Chung Wu, Zhi Rui Tam, Chieh-Yen Lin, Yun-Nung Chen, Shao-Hua Sun, and Hung-yi Lee. Mitigating forgetting in llm fine-tuning via low-perplexity token learning.arXiv preprint arXiv:2501.14315, 2025
-
[70]
Yongliang Wu, Yizhou Zhou, Zhou Ziheng, Yingzhe Peng, Xinyu Ye, Xinting Hu, Wenbo Zhu, Lu Qi, Ming-Hsuan Yang, and Xu Yang. On the generalization of sft: A reinforcement learning perspective with reward rectification.arXiv preprint arXiv:2508.05629, 2025
-
[71]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[72]
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th annual meeting of the association for computational linguistics, pages 4791–4800, 2019
work page 2019
-
[73]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023
work page 2023
-
[74]
Instruction-Following Evaluation for Large Language Models
Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[75]
Proximal Supervised Fine-Tuning
Wenhong Zhu, Ruobing Xie, Rui Wang, Xingwu Sun, Di Wang, and Pengfei Liu. Proximal supervised fine-tuning.arXiv preprint arXiv:2508.17784, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[76]
Nicolas Zucchet, J¨org Bornschein, Stephanie Chan, Andrew Lampinen, Razvan Pascanu, and Soham De. How do language models learn facts? dynamics, curricula and hallucinations.arXiv preprint arXiv:2503.21676, 2025. 14 A Proofs A.1 Auxiliary bounds implied by Assumption 1 We first record standard consequences of Assumption 1. Since the input domain is bounded...
-
[77]
Is the response written in Galician (not Spanish, Portuguese, English, or other languages)?
-
[78]
How natural and fluent is the Galician? Does it sound like a native speaker, or does it have Spanish/Portuguese interference?
-
[79]
How consistent is the Galician throughout the response—does it code-switch mid-response? After your brief explanation, you must output only one of the following choices as your final verdict with a label:
-
[80]
Assistant A is significantly better:[[A>>B]]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.