arxiv: 2605.10640 · v1 · submitted 2026-05-11 · 💻 cs.CL · cs.AI

Recognition: unknown

Towards Understanding Continual Factual Knowledge Acquisition of Language Models: From Theory to Algorithm

Haoyu Wang, Jun Xu, Weijie Yu, Xiao Zhang, Yifan Shang, Zhongxiang Sun

Pith reviewed 2026-05-12 04:42 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords continual pre-trainingfactual knowledgecatastrophic forgettingdata replayregularizationattention contributiontransformer dynamics

0 comments

The pith

A single-layer Transformer reveals that data replay stabilizes factual knowledge by shifting convergence dynamics, unlike regularization which only slows parameter updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper builds a theoretical model of how language models learn and keep new facts during continual pre-training without losing previous ones. The model uses a single-layer Transformer to track training dynamics and explains the differing effects of common techniques. Regularization approaches fail to stop forgetting because they do not change the basic tendency to overwrite old knowledge, only how fast parameters move. Data replay succeeds by modifying the overall convergence behavior to favor retention. The authors use these findings to create STOC, which generates replay data by picking the most influential tokens via attention scores, and tests show it reduces forgetting effectively.

Core claim

Our analysis reveals that regularization-based methods merely adjust the convergence rate of parameters without altering the inherent forgetting tendency, whereas data replay methods succeed in shifting convergence dynamics and stabilizing pretrained knowledge. We propose STOC to identify influential factual snippets for replay data generation based on attention contribution.

What carries the argument

Theoretical characterization of cFKA training dynamics via a single-layer Transformer model, distinguishing rate adjustments from dynamic shifts.

If this is right

Regularization-based methods cannot prevent catastrophic forgetting in continual factual knowledge acquisition regardless of implementation details.
Data replay methods can stabilize knowledge by appropriately shifting convergence points in the parameter space.
The STOC method provides a practical way to generate effective replay data by focusing on attention-contributing tokens.
Unified understanding allows systematic improvement of CPT techniques based on their impact on dynamics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar dynamic analysis could be applied to continual learning in other domains like reinforcement learning or vision to predict method effectiveness.
If the single-layer approximation generalizes, it may enable deriving forgetting bounds without simulating full models.
Attention-based token selection might be combined with other generation strategies to further optimize replay efficiency.
Future work could test whether multi-layer Transformers follow the same convergence patterns identified here.

Load-bearing premise

The training dynamics of continual factual knowledge acquisition in full-scale language models can be accurately characterized and explained using a single-layer Transformer model.

What would settle it

Run experiments on a multi-layer language model to check if regularization only affects convergence speed or also changes the forgetting rate of pretrained facts beyond model predictions.

Figures

Figures reproduced from arXiv: 2605.10640 by Haoyu Wang, Jun Xu, Weijie Yu, Xiao Zhang, Yifan Shang, Zhongxiang Sun.

**Figure 1.** Figure 1: The overall structure of the paper. In Section 2, we present our mathematical framework and main theoretical results, whose effectiveness is verified during controlled PT. After that, we analyze popular continual learning methods in Section 3. Inspired by these findings, we propose a new generative data replay method in Section 4. Overall Notations are provided in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: The attention assignment over different types of tokens in the test biography of the example individual. The assigned attention scores are averaged across tokens within each category (subject or relation). See detailed token-level attention score in [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Attention Score and the corresponding Diversity Index. The Correlation is significant across two attention. Sec 2.2 where “Last Token Prediction” is adopted on the first object token. The empirical results are presented in [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: The left figure illustrates the learning dynamics under Naive/Scratch CPT (measured by continual hFTA), while the middle and right show the forgetting dynamics under Naive/Regularization CPT (measured by original hFTA). We perform uniform down-sampling with respect to the step count. All three figures employ Exponential Moving Average with α = 0.8) to smooth the noise. 3. Analysis on Regularization and Dat… view at source ↗

**Figure 5.** Figure 5: The full attention matrix of LMs trained on different augmentation strategies. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗

**Figure 6.** Figure 6: The full attention matrix of LMs trained on different augmentation strategies. H. More Phenomena with Analyses H.1. Performance Plateaus (Grokking) in FKA During the model training process described in Sec. 2.3, we also observed the phenomenon of performance plateaus. An example of Pythia-160m is provided in [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: The Learning Dynamics of Pythia in different augmentation strategies. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

read the original abstract

Continual Pre-Training (CPT) is essential for enabling Language Models (LMs) to integrate new knowledge without erasing old. While classical CPT techniques like data replay have become the standard paradigm, the mechanisms underlying how LMs acquire and retain facts over time, termed as continual Factual Knowledge Acquisition (cFKA), remain unclear. In this work, we present a theoretical framework that characterizes the training dynamics of cFKA using a single-layer Transformer, offering a unified explanation for the behavior of representative CPT methods. Our analysis reveals that regularization-based methods merely adjust the convergence rate of parameters without altering the inherent forgetting tendency, whereas data replay methods succeed in shifting convergence dynamics and stabilizing pretrained knowledge. Building on these insights, we propose a novel generative data replay approach, called \textbf{S}electing \textbf{T}okens via attenti\textbf{O}n \textbf{C}ontribution~(STOC), which identifies influential factual snippets to guide replay data generation. Extensive experiments on both synthetic and real-world datasets validate our findings and demonstrate that STOC effectively enhances cFKA by mitigating catastrophic forgetting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper develops a theoretical framework for continual factual knowledge acquisition (cFKA) during continual pre-training of language models, using a single-layer Transformer to derive closed-form training dynamics. It claims that regularization-based CPT methods only modulate parameter convergence rates without changing the inherent forgetting tendency, whereas data replay methods shift the convergence dynamics to stabilize pretrained knowledge. Building on this, the authors introduce the STOC algorithm, which generates replay data by selecting influential factual tokens via attention contribution scores, and validate the framework and method through experiments on synthetic and real-world datasets.

Significance. If the single-layer analysis holds and generalizes, the work provides a principled, unified explanation for the differential effectiveness of regularization versus replay in mitigating catastrophic forgetting, which could inform more effective CPT algorithms. Strengths include the closed-form derivation of dynamics and the theory-guided STOC proposal; these are concrete contributions that go beyond empirical observation.

major comments (2)

[§3] §3 (Theoretical Framework): The load-bearing claim that regularization adjusts only convergence rate while replay shifts dynamics to eliminate forgetting is derived entirely from the single-layer Transformer loss landscape and gradient flow. This does not address whether multi-layer residual connections, layer-wise specialization of factual knowledge, or emergent attention patterns introduce additional forgetting modes that would make regularization affect stability differently.
[§4] §4 (Experimental Validation): The paper reports improved cFKA performance with STOC on synthetic and real datasets, but provides no direct measurement or visualization of parameter trajectories or convergence rates in the full-scale models to confirm that the observed gains arise from the predicted dynamic shift rather than other factors such as data quality.

minor comments (2)

[§3.3] Notation for attention contribution scores and the exact form of the generative replay objective could be made more explicit in the method description to aid reproducibility.
[Abstract] The abstract and introduction would benefit from a brief statement of the precise assumptions under which the single-layer analysis is expected to apply to deeper models.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments. We address each major comment point by point below, providing our honest assessment and indicating planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: [§3] §3 (Theoretical Framework): The load-bearing claim that regularization adjusts only convergence rate while replay shifts dynamics to eliminate forgetting is derived entirely from the single-layer Transformer loss landscape and gradient flow. This does not address whether multi-layer residual connections, layer-wise specialization of factual knowledge, or emergent attention patterns introduce additional forgetting modes that would make regularization affect stability differently.

Authors: We acknowledge that the closed-form analysis relies on a single-layer Transformer to make the loss landscape and gradient flow tractable. This choice enables precise characterization of how regularization modulates convergence speed without altering the forgetting fixed point, while replay modifies the effective dynamics. We agree that residual connections, layer specialization, and emergent patterns in deeper models could introduce additional forgetting modes not captured here. Our real-world experiments on multi-layer LMs show performance patterns consistent with the theory, but we cannot claim full generalization without further analysis. In revision, we will add an explicit limitations subsection discussing these assumptions and outlining directions for extending the framework to multi-layer cases. revision: partial
Referee: [§4] §4 (Experimental Validation): The paper reports improved cFKA performance with STOC on synthetic and real datasets, but provides no direct measurement or visualization of parameter trajectories or convergence rates in the full-scale models to confirm that the observed gains arise from the predicted dynamic shift rather than other factors such as data quality.

Authors: We agree that direct trajectory measurements in full-scale models would more convincingly link gains to the predicted dynamic shift. The synthetic experiments already include such visualizations and closed-form comparisons. For real-world models, full parameter tracking is infeasible due to scale, but we will add targeted measurements in revision: tracking parameter norm changes and gradient alignment with theoretical directions on a subset of layers during CPT. These will help isolate dynamic effects from data quality. We believe this addresses the concern without requiring entirely new experiments. revision: yes

Circularity Check

0 steps flagged

Theoretical analysis from single-layer model is self-contained with no circular reductions

full rationale

The paper's central derivation characterizes cFKA training dynamics via closed-form analysis of a single-layer Transformer, then contrasts regularization (rate adjustment only) vs. replay (dynamics shift). This is a modeling assumption and first-principles gradient-flow argument rather than any self-definitional loop, fitted parameter renamed as prediction, or load-bearing self-citation. No equations reduce to their inputs by construction, and the single-layer choice is explicitly an approximation whose validity is tested empirically on synthetic/real data. The result does not collapse to tautology or prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient detail to enumerate specific free parameters, axioms, or invented entities; the framework implicitly relies on standard assumptions about Transformer parameter dynamics and forgetting behavior that are not listed.

pith-pipeline@v0.9.0 · 5504 in / 1044 out tokens · 45930 ms · 2026-05-12T04:42:34.310786+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

106 extracted references · 106 canonical work pages · 6 internal anchors

[1]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page
[2]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

work page
[3]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016
[4]

Survey on factuality in large language models: Knowledge, retrieval and domain-specificity,

Survey on factuality in large language models: Knowledge, retrieval and domain-specificity , author=. arXiv preprint arXiv:2310.07521 , year=

work page arXiv
[5]

ACM Transactions on Information Systems , volume=

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions , author=. ACM Transactions on Information Systems , volume=. 2025 , publisher=

work page 2025
[6]

Proceedings of the 37th International Conference on Machine Learning , pages=

Infinite attention: NNGP and NTK for deep attention networks , author=. Proceedings of the 37th International Conference on Machine Learning , pages=

work page
[7]

Data Mixing Can Induce Phase Transitions in Knowledge Acquisition

Data mixing can induce phase transitions in knowledge acquisition , author=. arXiv preprint arXiv:2505.18091 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[8]

International Conference on Machine Learning , pages=

How do transformers learn topic structure: Towards a mechanistic understanding , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[9]

International Conference on Machine Learning , pages=

Transformers learn in-context by gradient descent , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[10]

Advances in Neural Information Processing Systems , volume=

Transformers learn to implement preconditioned gradient descent for in-context learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[11]

Journal of Machine Learning Research , volume=

Trained transformers learn linear models in-context , author=. Journal of Machine Learning Research , volume=

work page
[12]

Proceedings of the 41st International Conference on Machine Learning , pages=

How transformers learn causal structure with gradient descent , author=. Proceedings of the 41st International Conference on Machine Learning , pages=

work page
[13]

How do language models learn facts? Dynamics, curricula and hallucinations , author=

work page
[14]

Advances in neural information processing systems , volume=

Scan and snap: Understanding training dynamics and token composition in 1-layer transformer , author=. Advances in neural information processing systems , volume=

work page
[15]

Scaling Laws for Associative Memories , author=

work page
[16]

The 13th International Conference on Learning Representations , volume=

Understanding factual recall in transformers via associative memories , author=. The 13th International Conference on Learning Representations , volume=

work page
[17]

Advances in Neural Information Processing Systems , volume=

Unveiling induction heads: Provable training dynamics and feature learning in transformers , author=. Advances in Neural Information Processing Systems , volume=

work page
[18]

Toy Models of Superposition

Toy models of superposition , author=. arXiv preprint arXiv:2209.10652 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Distill , volume=

Zoom in: An introduction to circuits , author=. Distill , volume=

work page
[20]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[21]

Layer Normalization

Layer normalization , author=. arXiv preprint arXiv:1607.06450 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Train short, test long: Attention with linear biases enables input length extrapolation , author=. arXiv preprint arXiv:2108.12409 , booktitle=

work page internal anchor Pith review arXiv
[23]

Advances in neural information processing systems , volume=

Attention is all you need , author=. Advances in neural information processing systems , volume=

work page
[24]

Neurocomputing , volume=

Roformer: Enhanced transformer with rotary position embedding , author=. Neurocomputing , volume=. 2024 , publisher=

work page 2024
[25]

Advances in Neural Information Processing Systems , volume=

D-cpt law: Domain-specific continual pre-training scaling law for large language models , author=. Advances in Neural Information Processing Systems , volume=

work page
[26]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Cmr scaling law: Predicting critical mixture ratios for continual pre-training of language models , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2024
[27]

Proceedings of the 41st International Conference on Machine Learning , pages=

Physics of language models: part 3.1, knowledge storage and extraction , author=. Proceedings of the 41st International Conference on Machine Learning , pages=

work page
[28]

13th International Conference on Learning Representations, ICLR 2025 , year=

Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws , author=. 13th International Conference on Learning Representations, ICLR 2025 , year=

work page 2025
[29]

The 13th International Conference on Learning Representations, ICLR25 , year=

Spurious Forgetting in Continual Learning of Language Models , author=. The 13th International Conference on Learning Representations, ICLR25 , year=

work page
[30]

Proceedings of the 42nd International Conference on Machine Learning (ICML) , year=

Data Mixing Can Induce Phase Transitions in Knowledge Acquisition , author=. Proceedings of the 42nd International Conference on Machine Learning (ICML) , year=

work page
[31]

Proceedings of the 42nd International Conference on Machine Learning (ICML) , year=

Learning Dynamics in Continual Pre-Training for Large Language Models , author=. Proceedings of the 42nd International Conference on Machine Learning (ICML) , year=

work page
[32]

Transformer Circuits Blog , year=

A Mathematical Framework for Transformer Circuits , author=. Transformer Circuits Blog , year=

work page
[33]

arXiv preprint arXiv:2212.10559 , year=

Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers , author=. arXiv preprint arXiv:2212.10559 , year=

work page arXiv
[34]

2024 , url=

Knowledge Conflicts for LLMs: A Survey , author=. 2024 , url=

work page 2024
[35]

Proceedings of the 31st International Conference on Computational Linguistics , pages=

Investigating the factual knowledge boundary of large language models with retrieval augmentation , author=. Proceedings of the 31st International Conference on Computational Linguistics , pages=

work page
[36]

A Survey of Large Language Models

A Survey of Large Language Models , author=. arXiv preprint arXiv:2303.18223 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[37]

Volcano Engine Technical Report , year=

Knowledge-Oriented Retrieval-Augmented Generation: A Survey , author=. Volcano Engine Technical Report , year=

work page
[38]

Advances in Neural Information Processing Systems , volume=

Language Models are Few-Shot Learners , author=. Advances in Neural Information Processing Systems , volume=. 2020 , url=

work page 2020
[39]

1985 , publisher=

Learning internal representations by error propagation , author=. 1985 , publisher=

work page 1985
[40]

arXiv preprint arXiv:1905.10108 , year=

Learning surrogate losses , author=. arXiv preprint arXiv:1905.10108 , year=

work page arXiv 1905
[41]

On surrogate loss functions and f-divergences , author=

work page
[42]

The 13th International Conference on Learning Representations , year=

Learning dynamics of llm finetuning , author=. The 13th International Conference on Learning Representations , year=

work page
[43]

Advances in neural information processing systems , volume=

Neural tangent kernel: Convergence and generalization in neural networks , author=. Advances in neural information processing systems , volume=

work page
[44]

Advances in Neural Information Processing Systems , volume=

A generalized neural tangent kernel analysis for two-layer neural networks , author=. Advances in Neural Information Processing Systems , volume=

work page
[45]

H., and Riedel, S

Language models as knowledge bases? , author=. arXiv preprint arXiv:1909.01066 , year=

work page arXiv 1909
[46]

How much knowledge can you pack into the parameters of a language model? arXiv preprint arXiv:2002.08910, 2020

How much knowledge can you pack into the parameters of a language model? , author=. arXiv preprint arXiv:2002.08910 , year=

work page arXiv 2002
[47]

Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Hannaneh Hajishirzi, and Daniel Khashabi

When not to trust language models: Investigating effectiveness of parametric and non-parametric memories , author=. arXiv preprint arXiv:2212.10511 , year=

work page arXiv
[48]

Estevão Filho, Todd Hendry, Daniel Holstein, Jennifer Marsman, Nick Mecklenburg, Sara Malvar, Leonardo O

RAG vs fine-tuning: Pipelines, tradeoffs, and a case study on agriculture , author=. arXiv preprint arXiv:2401.08406 , year=

work page arXiv
[49]

ACM Computing Surveys , volume=

Knowledge editing for large language models: A survey , author=. ACM Computing Surveys , volume=. 2024 , publisher=

work page 2024
[50]

The 13th International Conference on Learning Representations , year=

Alphaedit: Null-space constrained knowledge editing for language models , author=. The 13th International Conference on Learning Representations , year=

work page
[51]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

How do llms acquire new knowledge? a knowledge circuits perspective on continual pre-training , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

work page 2025
[52]

ACM computing surveys , volume=

Survey of hallucination in natural language generation , author=. ACM computing surveys , volume=. 2023 , publisher=

work page 2023
[53]

Proceedings of the national academy of sciences , volume=

Overcoming catastrophic forgetting in neural networks , author=. Proceedings of the national academy of sciences , volume=. 2017 , publisher=

work page 2017
[54]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Mitigating catastrophic forgetting in large language models with self-synthesized rehearsal , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[55]

Semantic web , volume=

Knowledge graph refinement: A survey of approaches and evaluation methods , author=. Semantic web , volume=. 2016 , publisher=

work page 2016
[56]

arXiv preprint arXiv:2302.03241 , year=

Continual pre-training of language models , author=. arXiv preprint arXiv:2302.03241 , year=

work page arXiv
[57]

ethnicity , volume=

Knowledge representation learning with entities, attributes and relations , author=. ethnicity , volume=

work page
[58]

Proceedings of the 58th annual meeting of the association for computational linguistics , pages=

Continual relation learning via episodic memory activation and reconsolidation , author=. Proceedings of the 58th annual meeting of the association for computational linguistics , pages=

work page
[59]

Investigating continual pretraining in large language models: Insights and implications,

Investigating continual pretraining in large language models: Insights and implications , author=. arXiv preprint arXiv:2402.17400 , year=

work page arXiv
[60]

Journal of Machine Learning Research , volume=

Phase diagram for two-layer relu neural networks at infinite-width limit , author=. Journal of Machine Learning Research , volume=

work page
[61]

Advances in Neural Information Processing Systems , volume=

Initialization is critical to whether transformers fit composite functions by reasoning or memorizing , author=. Advances in Neural Information Processing Systems , volume=

work page
[62]

Proceedings of the National Academy of Sciences , volume=

Overparameterized neural networks implement associative memory , author=. Proceedings of the National Academy of Sciences , volume=. 2020 , publisher=

work page 2020
[63]

Proceedings of the twenty-ninth annual ACM symposium on Theory of computing , pages=

Using and combining predictors that specialize , author=. Proceedings of the twenty-ninth annual ACM symposium on Theory of computing , pages=

work page
[64]

Marjorie Lohwater , year=

Introduction to inequalities , author=. Marjorie Lohwater , year=

work page
[65]

A brief introduction to olympiad inequalities , author=

work page
[66]

nature , volume=

Measurement of diversity , author=. nature , volume=. 1949 , publisher=

work page 1949
[67]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

Transformer Feed-Forward Layers Are Key-Value Memories , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2021
[68]

ReDeEP: Detecting Hallucination in Retrieval-Augmented Generation via Mechanistic Interpretability , author=

work page
[69]

Findings of the Association for Computational Linguistics: ACL 2022 , pages=

How Pre-trained Language Models Capture Factual Knowledge? A Causal-Inspired Analysis , author=. Findings of the Association for Computational Linguistics: ACL 2022 , pages=

work page 2022
[70]

Psychology of learning and motivation , volume=

Catastrophic interference in connectionist networks: The sequential learning problem , author=. Psychology of learning and motivation , volume=. 1989 , publisher=

work page 1989
[71]

International Conference on Machine Learning , pages=

Understanding Forgetting in Continual Learning with Linear Regression , author=. International Conference on Machine Learning , pages=. 2024 , organization=

work page 2024
[72]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Probing representation forgetting in supervised and unsupervised continual learning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[73]

11th International Conference on Learning Representations, ICLR 2023 , year=

Is forgetting less a good inductive bias for forward transfer? , author=. 11th International Conference on Learning Representations, ICLR 2023 , year=

work page 2023
[74]

International Conference on Learning Representations 2022 , year=

Pretrained language model in continual learning: A comparative study , author=. International Conference on Learning Representations 2022 , year=

work page 2022
[75]

Conference on Learning Theory , pages=

How catastrophic can catastrophic forgetting be in linear regression? , author=. Conference on Learning Theory , pages=. 2022 , organization=

work page 2022
[76]

International Conference on Machine Learning , pages=

Continual learning in the teacher-student setup: Impact of task similarity , author=. International Conference on Machine Learning , pages=. 2021 , organization=

work page 2021
[77]

International conference on machine learning , pages=

Continual learning through synaptic intelligence , author=. International conference on machine learning , pages=. 2017 , organization=

work page 2017
[78]

Proceedings of the European conference on computer vision (ECCV) , pages=

Memory aware synapses: Learning what (not) to forget , author=. Proceedings of the European conference on computer vision (ECCV) , pages=

work page
[79]

International Conference on Learning Representations , year=

LAMOL: LAnguage MOdeling for Lifelong Language Learning , author=. International Conference on Learning Representations , year=

work page
[80]

Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics , pages=

Generative Replay Inspired by Hippocampal Memory Indexing for Continual Language Learning , author=. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics , pages=. 2023 , organization=

work page 2023

Showing first 80 references.