Towards Robust Evaluations of Continual Learning

Sebastian Farquhar, Yarin Gal

Authors on Pith no claims yet

classification 📊 stat.ML cs.LG

keywords learningcontinualevaluationsexperimentresearchapproachesdesideratadesigns

read the original abstract

Experiments used in current continual learning research do not faithfully assess fundamental challenges of learning continually. Instead of assessing performance on challenging and representative experiment designs, recent research has focused on increased dataset difficulty, while still using flawed experiment set-ups. We examine standard evaluations and show why these evaluations make some continual learning approaches look better than they are. We introduce desiderata for continual learning evaluations and explain why their absence creates misleading comparisons. Based on our desiderata we then propose new experiment designs which we demonstrate with various continual learning approaches and datasets. Our analysis calls for a reprioritization of research effort by the community.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Temporal Taskification in Streaming Continual Learning: A Source of Evaluation Instability
cs.LG 2026-04 conditional novelty 6.0

Different valid temporal partitions of the same streaming dataset can produce materially different rankings and performance numbers for continual learning methods.
Fine-Tuning Regimes Define Distinct Continual Learning Problems
cs.LG 2026-04 unverdicted novelty 6.0

The relative rankings of continual learning methods are not preserved across different fine-tuning regimes defined by trainable parameter depth.