An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks

Aaron Courville, Da Xiao, Ian J. Goodfellow, Mehdi Mirza, Yoshua Bengio

Authors on Pith no claims yet

classification 📊 stat.ML cs.LGcs.NE

keywords taskcatastrophicforgettingactivationbestnetworksneuralproblem

read the original abstract

Catastrophic forgetting is a problem faced by many machine learning models and algorithms. When trained on one task, then trained on a second task, many machine learning models "forget" how to perform the first task. This is widely believed to be a serious problem for neural networks. Here, we investigate the extent to which the catastrophic forgetting problem occurs for modern neural networks, comparing both established and recent gradient-based training algorithms and activation functions. We also examine the effect of the relationship between the first task and the second task on catastrophic forgetting. We find that it is always best to train using the dropout algorithm--the dropout algorithm is consistently best at adapting to the new task, remembering the old task, and has the best tradeoff curve between these two extremes. We find that different tasks and relationships between tasks result in very different rankings of activation function performance. This suggests the choice of activation function should always be cross-validated.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MIST: Reliable Streaming Decision Trees for Online Class-Incremental Learning via McDiarmid Bound
cs.LG 2026-05 unverdicted novelty 7.0

MIST fixes unreliable splits in streaming decision trees for class-incremental learning by using a K-independent McDiarmid bound on Gini impurity, Bayesian moment projection for knowledge transfer, and KLL quantile sk...
HEBATRON: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model
cs.CL 2026-05 unverdicted novelty 7.0

Hebatron is the first open-weight Hebrew MoE LLM adapted from Nemotron-3, reaching 73.8% on Hebrew reasoning benchmarks while activating only 3B parameters per pass and supporting 65k-token context.
Sharpness-Aware Pretraining Mitigates Catastrophic Forgetting
cs.LG 2026-05 unverdicted novelty 6.0

Sharpness-aware pretraining and related flat-minima interventions reduce catastrophic forgetting by up to 80% after post-training across 20M-150M models and by 31-40% at 1B scale.
Diversity in Large Language Models under Supervised Fine-Tuning
cs.LG 2026-04 unverdicted novelty 6.0

TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.
NORACL: Neurogenesis for Oracle-free Resource-Adaptive Continual Learning
cs.LG 2026-04 unverdicted novelty 6.0

NORACL dynamically grows network capacity via neurogenesis-inspired signals to achieve oracle-level continual learning performance without pre-specifying architecture size.
Cortex-Inspired Continual Learning: Unsupervised Instantiation and Recovery of Functional Task Networks
cs.LG 2026-04 unverdicted novelty 6.0

FTN achieves near-zero forgetting on continual learning benchmarks by isolating task subnetworks via self-organizing binary masks generated through gradient descent, smoothing, and k-winner-take-all.
Temporal Taskification in Streaming Continual Learning: A Source of Evaluation Instability
cs.LG 2026-04 conditional novelty 6.0

Different valid temporal partitions of the same streaming dataset can produce materially different rankings and performance numbers for continual learning methods.
Continuous Limits of Coupled Flows in Representation Learning
cs.LG 2026-04 unverdicted novelty 6.0

Discrete decentralized learning dynamics on manifolds converge uniformly to an overdamped Langevin SDE whose stationary states produce orthogonally disentangled, linearly separable features.
Label Leakage Attacks in Machine Unlearning: A Parameter and Inversion-Based Approach
cs.CR 2026-04 unverdicted novelty 6.0

Parameter-difference and model-inversion attacks can identify forgotten classes after machine unlearning on standard image datasets.
Debiasing LLMs by Fine-tuning
q-fin.GN 2026-04 unverdicted novelty 6.0

Supervised fine-tuning with LoRA on rational benchmark forecasts corrects extrapolation bias out-of-sample in LLM predictions for controlled experiments and cross-sectional stock returns.
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework
cs.AI 2023-08 unverdicted novelty 6.0

MetaGPT embeds human SOPs into LLM prompts to create role-specialized agent teams that produce more coherent solutions on collaborative software engineering tasks than prior chat-based multi-agent systems.
Muon-OGD: Muon-based Spectral Orthogonal Gradient Projection for LLM Continual Learning
cs.LG 2026-05 unverdicted novelty 5.0

Muon-OGD integrates Muon-style spectral-norm geometry with orthogonal gradient constraints to improve the stability-plasticity trade-off during sequential LLM adaptation.
Online Generalised Predictive Coding
stat.ML 2026-05 unverdicted novelty 5.0

Online generalised predictive coding (ODEM) tracks latent states in nonlinear and chaotic generative models by separating temporal scales for fast Bayesian belief updating and slow parameter learning.
Diversity in Large Language Models under Supervised Fine-Tuning
cs.LG 2026-04 unverdicted novelty 5.0

Supervised fine-tuning narrows LLM generative diversity through neglect of low-frequency patterns and knowledge forgetting, but the TOFU loss mitigates this effect across models and benchmarks.
(How) Learning Rates Regulate Catastrophic Overtraining
cs.LG 2026-04 unverdicted novelty 5.0

Learning rate decay during SFT increases pretrained model sharpness, which exacerbates catastrophic forgetting and causes overtraining in LLMs.
Dynamic Distillation and Gradient Consistency for Robust Long-Tailed Incremental Learning
cs.CV 2026-05 unverdicted novelty 4.0

Gradient consistency regularization and entropy-driven dynamic distillation improve accuracy by up to 5% in long-tailed incremental learning, with strong gains in majority-to-minority task ordering.
MPCS: Neuroplastic Continual Learning via Multi-Component Plasticity and Topology-Aware EWC
cs.LG 2026-05 unverdicted novelty 4.0

MPCS integrates eleven plasticity mechanisms and reaches a Normalized Efficiency Score of 94.2 on a 31-task benchmark, with ablations showing that removing EWC and Hebbian updates yields higher performance at lower cost.