FOGO: Forgetting-aware Orthogonalization Optimizer

Celso de Melo; Flora D. Salim; Toan Nguyen; Trung Le; Yang Liu

arxiv: 2606.10406 · v1 · pith:JDQFEVCSnew · submitted 2026-06-09 · 💻 cs.LG · cs.AI

FOGO: Forgetting-aware Orthogonalization Optimizer

Toan Nguyen , Yang Liu , Trung Le , Celso de Melo , Flora D. Salim This is my paper

Pith reviewed 2026-06-27 14:08 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords forgettingoptimizerorthogonalizationcontinual learninggradient interferencerandom projectionmomentum updatescodebook memory

0 comments

The pith

Dominant mini-batch gradients suppress rare directions during every training step, and FOGO resolves the resulting short-term and long-term forgetting through spectral orthogonalization plus codebook correction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that forgetting is not limited to continual learning but arises in ordinary training whenever dominant gradients in a mini-batch crowd out rarer but useful update directions. These suppressed directions cause immediate short-term forgetting that accumulates into long-term loss when the affected knowledge is never revisited. FOGO counters this by spectrally orthogonalizing momentum updates so no single direction monopolizes the step, then maintains a compact codebook of representative past directions via random projection that preserves distances. Conflicts with the current update are fixed by lightweight orthogonal correction and a proximal step, all without storing raw data. Experiments across class-imbalanced classification, continual visual learning, LLaVA-7B fine-tuning, and GPT-2 pretraining show faster convergence and higher retention than Adam or Muon.

Core claim

Forgetting occurs as a general optimization phenomenon because dominant mini-batch gradients suppress rare update directions at each step; when such knowledge is never revisited these losses compound. FOGO continuously detects and resolves gradient interference by spectrally orthogonalizing momentum updates and storing representative past directions in a random-projection codebook whose pairwise distances are provably preserved, then applies lightweight orthogonal correction and a proximal step to resolve conflicts with minimal overhead.

What carries the argument

Spectral orthogonalization of momentum updates combined with random-projection codebook memory and proximal correction to resolve gradient conflicts.

If this is right

Faster convergence on class-imbalanced classification problems.
Improved retention under domain and class shifts in continual visual learning.
Higher performance during continual fine-tuning of models such as LLaVA-7B.
Better pretraining outcomes for models such as GPT-2 compared with Adam and Muon.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the interference-resolution view holds, standard optimizers may need systematic rethinking for any training regime where rare directions matter.
The internal codebook approach could reduce reliance on external replay buffers in continual learning settings.
The method might extend naturally to other optimization interference problems such as multi-task training or federated settings.

Load-bearing premise

Dominant mini-batch gradients are the primary cause of both short-term and long-term forgetting, and spectral orthogonalization plus random-projection codebook correction can remove the interference without creating new instabilities or harming non-forgetting tasks.

What would settle it

Run the same set of tasks with a version of FOGO that disables the spectral orthogonalization and codebook correction steps and check whether the reported gains over Adam disappear.

Figures

Figures reproduced from arXiv: 2606.10406 by Celso de Melo, Flora D. Salim, Toan Nguyen, Trung Le, Yang Liu.

**Figure 1.** Figure 1: Short-term forgetting under class imbalance (CIFAR-10, ResNet-18, 5 rare classes at 10% data). (a) Dominant and rare gradients maintain negative cosine similarity throughout training. (b) The combined update initially aligns with the dominant group; after dominant loss saturates (∼step 200), the bias reverses. (c) Per-step validation loss change on a held-out probe: the suppressed group’s loss increases af… view at source ↗

**Figure 2.** Figure 2: Class-imbalanced CIFAR-10. Top: FOGO outperforms Adam and Muon under classimbalanced training, improving rare-class accuracy and reducing rare-class forgetting while preserving non-rare accuracy. Bottom: Removing slow–fast fusion (w/o FS) and short-term filtering (w/o FS+ST) progressively increases rare-class forgetting. Curves show the mean over ten seeds, with shaded regions denoting one standard devia… view at source ↗

**Figure 3.** Figure 3: Class-incremental learning on CIFAR-10. Left: comparison with Adam, Muon, and UPGD. Right: ablations on long-term protection strength γlt and the number of frozen centroids per layer. Results are averaged over five random seeds. Results & Ablation Study [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Training loss (left) and validation loss (right) during GPT-2 pretraining. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Training loss curves across four continual learning settings. From left to right: class [PITH_FULL_IMAGE:figures/full_fig_p024_5.png] view at source ↗

**Figure 6.** Figure 6: Gradient structure preservation under random projection (CIFAR-10, ResNet-18, d=128). (a) Pearson correlation between pairwise cosine similarities in the original and projected spaces remains above 0.9 for both groups. (b) Sign preservation rate stays around 89%, well above the 50% random baseline. (c) Scatter of individual cosine pairs at the final training step; both groups cluster tightly around the y=x… view at source ↗

**Figure 7.** Figure 7: t-SNE visualization of expert activations in continual VisualQA. Both plots use the same number of sampled points per expert, isolating representation geometry from routing-load differences. (a) FOGO maintains balanced within-expert variance and clear expert separation, suggesting a healthier distribution of task-specific representations. (b) AdamW leads to representation imbalance, where one expert collap… view at source ↗

**Figure 8.** Figure 8: Qualitative comparison between LoRA-MoE and FOGO. FOGO consistently provides [PITH_FULL_IMAGE:figures/full_fig_p029_8.png] view at source ↗

read the original abstract

We argue that forgetting is not confined to continual learning but is a general optimization phenomenon: during standard training, dominant mini-batch gradients suppress rare but useful update directions, causing short-term forgetting at every step. When such knowledge is never revisited, these losses compound into long-term forgetting-the classical failure mode of continual learning. We introduce FOGO, a scalable optimizer that continuously detects and resolves gradient interference across both regimes. FOGO spectrally orthogonalizes momentum updates to prevent dominant directions from monopolizing optimization, then stores representative past directions in a compact codebook memory built on random projection, where pairwise distances are provably preserved in low-dimensional space. At each step, conflicts between the current update and stored directions are resolved via lightweight orthogonal correction and lifted back through a proximal step, with minimal overhead and no data storage. Across class-imbalanced classification, continual visual learning under domain and class shifts, continual fine-tuning of LLaVA-7B, and GPT-2 pretraining, FOGO consistently improves convergence and knowledge retention, outperforming Adam and Muon.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FOGO's spectral orthogonalization plus random-projection codebook for gradient interference is presented as new, but the abstract supplies zero derivations, ablations, or numbers, so the performance claims cannot be checked.

read the letter

The paper frames forgetting as a general optimization issue where dominant mini-batch gradients suppress rarer directions during ordinary training, and those losses add up in continual settings. FOGO addresses this by spectrally orthogonalizing momentum, storing past directions in a random-projection codebook that is supposed to preserve distances, then applying a lightweight orthogonal correction and proximal lift at each step.

The combination of spectral orthogonalization with the codebook for interference correction looks like the actual new piece; the abstract does not reduce it to an earlier published method. The approach is also presented cleanly as low-overhead and data-free, which is a practical plus if it works.

The soft spots are large and central. Nothing beyond the abstract is visible, so there are no equations showing the codebook distance preservation, no ablation results isolating the correction step, and no quantitative tables to support the reported gains over Adam and Muon on class-imbalanced classification, continual visual tasks, LLaVA-7B fine-tuning, or GPT-2 pretraining. The load-bearing assumption that dominant gradients are the main cause of both short- and long-term forgetting, and that the correction fixes it without new instabilities or bias on standard tasks, is stated but not tested in anything provided. The stress-test note correctly flags this as the point that needs evidence; without it the outperformance could be spurious.

This is aimed at people working on optimizers for large models or continual learning who care about interference without extra memory. A reader looking for a concrete method to try would need the full experiments and any formal arguments for the codebook before investing time.

I would not send it to peer review yet. The idea is coherent on its own terms, but the complete absence of supporting material makes it too thin for referee effort.

Referee Report

3 major / 1 minor

Summary. The paper claims that forgetting arises as a general optimization issue when dominant mini-batch gradients suppress rare but useful directions, leading to both short-term and long-term forgetting. It introduces FOGO, which spectrally orthogonalizes momentum updates and applies lightweight orthogonal correction against a compact codebook constructed via random projection (with provably preserved pairwise distances), using a proximal step to resolve conflicts. The method is evaluated on class-imbalanced classification, continual visual learning with domain and class shifts, continual fine-tuning of LLaVA-7B, and GPT-2 pretraining, where it reportedly improves convergence and retention over Adam and Muon with minimal overhead and no data storage.

Significance. If the empirical gains and the underlying mechanism hold under scrutiny, the work could be significant by reframing forgetting as an in-training optimization phenomenon rather than a problem exclusive to continual learning, offering a scalable optimizer applicable to large models with potential for improved stability in imbalanced and sequential training regimes.

major comments (3)

[Method description] The central claim that spectral orthogonalization plus random-projection codebook correction resolves gradient interference without introducing new instabilities or altering effective step sizes is load-bearing for all reported gains, yet the manuscript provides no ablation isolating the codebook's contribution or testing performance degradation when the correction is removed on non-forgetting tasks.
[Introduction / opening argument] The assumption that dominant mini-batch gradients are the primary cause of forgetting (rather than other factors such as learning-rate schedules or data statistics) underpins the method's design, but no direct quantitative test or comparison to alternative explanations is presented to support this as the dominant mechanism.
[Method] The claim of provable distance preservation in the low-dimensional codebook via random projection is stated but lacks an explicit statement of the projection dimension, the Johnson-Lindenstrauss parameters used, or a derivation showing that the proximal lift-back step preserves the claimed orthogonality properties.

minor comments (1)

[Abstract] The abstract states 'minimal overhead' but does not quantify the additional compute or memory cost of the codebook maintenance and proximal step relative to Adam or Muon.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and will incorporate revisions to strengthen the manuscript where appropriate.

read point-by-point responses

Referee: The central claim that spectral orthogonalization plus random-projection codebook correction resolves gradient interference without introducing new instabilities or altering effective step sizes is load-bearing for all reported gains, yet the manuscript provides no ablation isolating the codebook's contribution or testing performance degradation when the correction is removed on non-forgetting tasks.

Authors: We agree that an explicit ablation isolating the codebook component would strengthen the claims. In the revised manuscript we will add experiments that disable the codebook correction while retaining spectral orthogonalization, evaluating on standard balanced classification tasks (e.g., CIFAR-10/100 and ImageNet subsets) to verify that performance does not degrade relative to Adam/Muon baselines when forgetting is not a concern. revision: yes
Referee: The assumption that dominant mini-batch gradients are the primary cause of forgetting (rather than other factors such as learning-rate schedules or data statistics) underpins the method's design, but no direct quantitative test or comparison to alternative explanations is presented to support this as the dominant mechanism.

Authors: The paper motivates the mechanism through the observed interference pattern and demonstrates consistent gains across imbalanced, continual, and large-model regimes. To address the request for direct comparison, we will add a targeted analysis section that varies learning-rate schedules and data statistics independently while measuring directional suppression, providing quantitative support for the relative contribution of mini-batch gradient dominance. revision: yes
Referee: The claim of provable distance preservation in the low-dimensional codebook via random projection is stated but lacks an explicit statement of the projection dimension, the Johnson-Lindenstrauss parameters used, or a derivation showing that the proximal lift-back step preserves the claimed orthogonality properties.

Authors: We will revise the method section to state the exact projection dimension employed, the specific Johnson-Lindenstrauss lemma parameters (including target dimension and failure probability), and include a short derivation (or reference to the relevant lemma) confirming that the proximal lift-back step preserves the orthogonality guarantees after the low-dimensional correction. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation chain self-contained with no reductions to inputs

full rationale

The abstract and provided description introduce FOGO via a conceptual argument about gradient interference and describe its components (spectral orthogonalization of momentum, random-projection codebook, proximal correction) without any equations, fitted parameters, or self-citations that reduce the claimed improvements to quantities defined in terms of themselves. No load-bearing step matches the enumerated circularity patterns; the performance claims on downstream tasks are presented as empirical outcomes rather than algebraic identities or renamed fits. The central premise is an assumption about forgetting mechanisms, not a derivation that collapses by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities can be extracted beyond the high-level description of the codebook memory and orthogonal correction. The central claim rests on the unverified premise that gradient interference is the dominant forgetting mechanism and that the proposed corrections resolve it scalably.

invented entities (1)

compact codebook memory via random projection no independent evidence
purpose: store representative past directions while provably preserving pairwise distances in low-dimensional space
Introduced in the abstract as the storage mechanism for past directions; no independent evidence or falsifiable prediction supplied in the provided text.

pith-pipeline@v0.9.1-grok · 5722 in / 1354 out tokens · 16498 ms · 2026-06-27T14:08:51.367306+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 8 canonical work pages · 7 internal anchors

[1]

A definition of continual reinforcement learning

David Abel, Andre Barreto, Benjamin Van Roy, Doina Precup, Hado van Hasselt, and Satinder Singh. A definition of continual reinforcement learning. InNeurIPS, 2023

2023
[2]

Nested learning: The illusion of deep learning architectures

Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, and Vahab Mirrokni. Nested learning: The illusion of deep learning architectures. InNeurIPS, 2025

2025
[3]

Old Optimizer, New Norm: An Anthology

Jeremy Bernstein and Laker Newhouse. Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Dark experience for general continual learning: a strong, simple baseline.NeurIPS, 33:15920–15930, 2020

Pietro Buzzega, Matteo Boschini, Angelo Porrello, Davide Abati, and Simone Calderara. Dark experience for general continual learning: a strong, simple baseline.NeurIPS, 33:15920–15930, 2020

2020
[5]

Efficient lifelong learning with a-GEM

Arslan Chaudhry, Marc’Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Efficient lifelong learning with a-GEM. InICLR, 2019

2019
[6]

On Tiny Episodic Memories in Continual Learning

Arslan Chaudhry, Marcus Rohrbach, Mohamed Elhoseiny, Thalaiyasingam Ajanthan, Puneet K Dokania, Philip HS Torr, and Marc’Aurelio Ranzato. On tiny episodic memories in continual learning.arXiv preprint arXiv:1902.10486, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1902
[7]

A continual learning survey: Defying forgetting in classification tasks

Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Aleš Leonardis, Gregory Slabaugh, and Tinne Tuytelaars. A continual learning survey: Defying forgetting in classification tasks. IEEE transactions on pattern analysis and machine intelligence, 44(7):3366–3385, 2021

2021
[8]

Loss of plasticity in deep continual learning.Nature, 632(8026):768–774, 2024

Shibhansh Dohare, J Fernando Hernandez-Garcia, Qingfeng Lan, Parash Rahman, A Rupam Mahmood, and Richard S Sutton. Loss of plasticity in deep continual learning.Nature, 632(8026):768–774, 2024

2024
[9]

Loramoe: Revolutionizing mixture of experts for maintaining world knowledge in language model alignment, 2023

Shihan Dou, Enyu Zhou, Yan Liu, Songyang Gao, Jun Zhao, Wei Shen, Yuhao Zhou, Zhiheng Xi, Xiao Wang, Xiaoran Fan, Shiliang Pu, Jiang Zhu, Rui Zheng, Tao Gui, Qi Zhang, and Xuanjing Huang. Loramoe: Revolutionizing mixture of experts for maintaining world knowledge in language model alignment, 2023

2023
[10]

Rupam Mahmood

Mohamed Elsayed and A. Rupam Mahmood. Addressing loss of plasticity and catastrophic forgetting in continual learning. InICLR, 2024

2024
[11]

Orthogonal gradient descent for continual learning

Mehrdad Farajtabar, Navid Azizan, Alex Mott, and Ang Li. Orthogonal gradient descent for continual learning. InAISTATS, pages 3762–3773. PMLR, 2020

2020
[12]

The vendi score: A diversity evaluation metric for machine learning

Dan Friedman and Adji Bousso Dieng. The vendi score: A diversity evaluation metric for machine learning. TMLR, 2023. ISSN 2835-8856

2023
[13]

Openwebtext corpus

Aaron Gokaslan and Vanya Cohen. Openwebtext corpus. http://Skylion007.github.io/ OpenWebTextCorpus, 2019

2019
[14]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InCVPR, pages 6904–6913, 2017

2017
[15]

Shampoo: Preconditioned stochastic tensor optimization

Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. InICML, pages 1842–1850. PMLR, 2018. 10

2018
[16]

Fire: Frobenius-isometry reinitialization for balancing the stability-plasticity tradeoff.ICLR, 2026

Isaac Han, Sangyeon Park, Seungwon Oh, Donghu Kim, Hojoon Lee, and Kyung-Joong Kim. Fire: Frobenius-isometry reinitialization for balancing the stability-plasticity tradeoff.ICLR, 2026

2026
[17]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InCVPR, pages 770–778, 2016

2016
[18]

SIAM, 2008

Nicholas J Higham.Functions of matrices: theory and computation. SIAM, 2008

2008
[19]

LoRA: Low-rank adaptation of large language models

Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InICLR, 2022

2022
[20]

MiniCPM: Unveiling the potential of small language models with scalable training strategies

Shengding Hu, Yuge Tu, Xu Han, Ganqu Cui, Chaoqun He, Weilin Zhao, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Xinrong Zhang, Zhen Leng Thai, Chongyi Wang, Yuan Yao, Chenyang Zhao, Jie Zhou, Jie Cai, Zhongwu Zhai, Ning Ding, Chao Jia, Guoyang Zeng, dahai li, Zhiyuan Liu, and Maosong Sun. MiniCPM: Unveiling the potential of small language models with...

2024
[21]

Cl-moe: Enhancing multimodal large language model with dual momentum mixture-of-experts for continual visual question answering

Tianyu Huai, Jie Zhou, Xingjiao Wu, Qin Chen, Qingchun Bai, Ze Zhou, and Liang He. Cl-moe: Enhancing multimodal large language model with dual momentum mixture-of-experts for continual visual question answering. InCVPR, pages 19608–19617, 2025

2025
[22]

Extensions of lipschitz mappings into a hilbert space

William B Johnson, Joram Lindenstrauss, et al. Extensions of lipschitz mappings into a hilbert space. Contemporary mathematics, 26(189-206):1, 1984

1984
[23]

Muon: An optimizer for hidden layers in neural networks, 2024

Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bern- stein. Muon: An optimizer for hidden layers in neural networks, 2024. URL https://kellerjordan. github.io/posts/muon/

2024
[24]

Prismatic synthesis: Gradient-based data diversification boosts generalization in LLM reasoning

Jaehun Jung, Seungju Han, Ximing Lu, Skyler Hallinan, David Acuna, Shrimai Prabhumoye, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, and Yejin Choi. Prismatic synthesis: Gradient-based data diversification boosts generalization in LLM reasoning. InNeurIPS, 2025

2025
[25]

Nanogpt.https://github.com/karpathy/nanoGPT, 2022

Andrej Karpathy. Nanogpt.https://github.com/karpathy/nanoGPT, 2022

2022
[26]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[27]

Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

2017
[28]

Learning multiple layers of features from tiny images

Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009

2009
[29]

Deeper, broader and artier domain generalization

Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M Hospedales. Deeper, broader and artier domain generalization. InICCV, pages 5542–5550, 2017

2017
[30]

Visual instruction tuning.NeurIPS, 36: 34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.NeurIPS, 36: 34892–34916, 2023

2023
[31]

Muon is Scalable for LLM Training

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Gradient episodic memory for continual learning.NIPS, 30, 2017

David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning.NIPS, 30, 2017

2017
[33]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7

2019
[34]

Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

2019
[35]

Learning to learn without forgetting by maximizing transfer and minimizing interference

Matthew Riemer, Ignacio Cases, Robert Ajemian, Miao Liu, Irina Rish, Yuhai Tu, and Gerald Tesauro. Learning to learn without forgetting by maximizing transfer and minimizing interference. InICLR, 2019

2019
[36]

Progressive Neural Networks

Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Ko- ray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks.arXiv preprint arXiv:1606.04671, 2016. 11

work page internal anchor Pith review Pith/arXiv arXiv 2016
[37]

Gradient projection memory for continual learning

Gobinda Saha, Isha Garg, and Kaushik Roy. Gradient projection memory for continual learning. InICLR, 2021

2021
[38]

On the Convergence Analysis of Muon

Wei Shen, Ruichuan Huang, Minhui Huang, Cong Shen, and Jiawei Zhang. On the convergence analysis of muon.arXiv preprint arXiv:2505.23737, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Adamuon: Adaptive muon optimizer.arXiv preprint arXiv:2507.11005, 2025

Chongjie Si, Debing Zhang, and Wei Shen. Adamuon: Adaptive muon optimizer.arXiv preprint arXiv:2507.11005, 2025

work page arXiv 2025
[40]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

2024
[41]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

A comprehensive survey of continual learning: Theory, method and application.IEEE transactions on pattern analysis and machine intelligence, 46(8): 5362–5383, 2024

Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. A comprehensive survey of continual learning: Theory, method and application.IEEE transactions on pattern analysis and machine intelligence, 46(8): 5362–5383, 2024

2024
[43]

Orthogonal subspace learning for language model continual learning

Xiao Wang, Tianze Chen, Qiming Ge, Han Xia, Rong Bao, Rui Zheng, Qi Zhang, Tao Gui, and Xuan-Jing Huang. Orthogonal subspace learning for language model continual learning. InFindings of EMNLP, pages 10658–10671, 2023. 12 A Extended Background Orthogonalization.Let Bt ∈R m×n with compact SVD Bt =U tΣtV ⊤ t , where Σt = diag(σ1, . . . , σr) and σ1 ≥ · · · ...

2023
[44]

Greens”), FOGO correctly identifies the specific ingredient (“Spinach

formalize this by showing that Muon and related optimizers perform steepest descent under a spectral-norm constraint. Newton–Schulz iterations.The polar factor O(Bt) is approximated without a full SVD via the Newton–Schulz iteration. Starting from X0 =B t/∥Bt∥F , the recurrence Xk+1 = 1 2 Xk(3I− X ⊤ k Xk) converges quadratically toUtV ⊤ t when ∥Bt∥2 <1 . ...

2048

[1] [1]

A definition of continual reinforcement learning

David Abel, Andre Barreto, Benjamin Van Roy, Doina Precup, Hado van Hasselt, and Satinder Singh. A definition of continual reinforcement learning. InNeurIPS, 2023

2023

[2] [2]

Nested learning: The illusion of deep learning architectures

Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, and Vahab Mirrokni. Nested learning: The illusion of deep learning architectures. InNeurIPS, 2025

2025

[3] [3]

Old Optimizer, New Norm: An Anthology

Jeremy Bernstein and Laker Newhouse. Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Dark experience for general continual learning: a strong, simple baseline.NeurIPS, 33:15920–15930, 2020

Pietro Buzzega, Matteo Boschini, Angelo Porrello, Davide Abati, and Simone Calderara. Dark experience for general continual learning: a strong, simple baseline.NeurIPS, 33:15920–15930, 2020

2020

[5] [5]

Efficient lifelong learning with a-GEM

Arslan Chaudhry, Marc’Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Efficient lifelong learning with a-GEM. InICLR, 2019

2019

[6] [6]

On Tiny Episodic Memories in Continual Learning

Arslan Chaudhry, Marcus Rohrbach, Mohamed Elhoseiny, Thalaiyasingam Ajanthan, Puneet K Dokania, Philip HS Torr, and Marc’Aurelio Ranzato. On tiny episodic memories in continual learning.arXiv preprint arXiv:1902.10486, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1902

[7] [7]

A continual learning survey: Defying forgetting in classification tasks

Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Aleš Leonardis, Gregory Slabaugh, and Tinne Tuytelaars. A continual learning survey: Defying forgetting in classification tasks. IEEE transactions on pattern analysis and machine intelligence, 44(7):3366–3385, 2021

2021

[8] [8]

Loss of plasticity in deep continual learning.Nature, 632(8026):768–774, 2024

Shibhansh Dohare, J Fernando Hernandez-Garcia, Qingfeng Lan, Parash Rahman, A Rupam Mahmood, and Richard S Sutton. Loss of plasticity in deep continual learning.Nature, 632(8026):768–774, 2024

2024

[9] [9]

Loramoe: Revolutionizing mixture of experts for maintaining world knowledge in language model alignment, 2023

Shihan Dou, Enyu Zhou, Yan Liu, Songyang Gao, Jun Zhao, Wei Shen, Yuhao Zhou, Zhiheng Xi, Xiao Wang, Xiaoran Fan, Shiliang Pu, Jiang Zhu, Rui Zheng, Tao Gui, Qi Zhang, and Xuanjing Huang. Loramoe: Revolutionizing mixture of experts for maintaining world knowledge in language model alignment, 2023

2023

[10] [10]

Rupam Mahmood

Mohamed Elsayed and A. Rupam Mahmood. Addressing loss of plasticity and catastrophic forgetting in continual learning. InICLR, 2024

2024

[11] [11]

Orthogonal gradient descent for continual learning

Mehrdad Farajtabar, Navid Azizan, Alex Mott, and Ang Li. Orthogonal gradient descent for continual learning. InAISTATS, pages 3762–3773. PMLR, 2020

2020

[12] [12]

The vendi score: A diversity evaluation metric for machine learning

Dan Friedman and Adji Bousso Dieng. The vendi score: A diversity evaluation metric for machine learning. TMLR, 2023. ISSN 2835-8856

2023

[13] [13]

Openwebtext corpus

Aaron Gokaslan and Vanya Cohen. Openwebtext corpus. http://Skylion007.github.io/ OpenWebTextCorpus, 2019

2019

[14] [14]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InCVPR, pages 6904–6913, 2017

2017

[15] [15]

Shampoo: Preconditioned stochastic tensor optimization

Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. InICML, pages 1842–1850. PMLR, 2018. 10

2018

[16] [16]

Fire: Frobenius-isometry reinitialization for balancing the stability-plasticity tradeoff.ICLR, 2026

Isaac Han, Sangyeon Park, Seungwon Oh, Donghu Kim, Hojoon Lee, and Kyung-Joong Kim. Fire: Frobenius-isometry reinitialization for balancing the stability-plasticity tradeoff.ICLR, 2026

2026

[17] [17]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InCVPR, pages 770–778, 2016

2016

[18] [18]

SIAM, 2008

Nicholas J Higham.Functions of matrices: theory and computation. SIAM, 2008

2008

[19] [19]

LoRA: Low-rank adaptation of large language models

Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InICLR, 2022

2022

[20] [20]

MiniCPM: Unveiling the potential of small language models with scalable training strategies

Shengding Hu, Yuge Tu, Xu Han, Ganqu Cui, Chaoqun He, Weilin Zhao, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Xinrong Zhang, Zhen Leng Thai, Chongyi Wang, Yuan Yao, Chenyang Zhao, Jie Zhou, Jie Cai, Zhongwu Zhai, Ning Ding, Chao Jia, Guoyang Zeng, dahai li, Zhiyuan Liu, and Maosong Sun. MiniCPM: Unveiling the potential of small language models with...

2024

[21] [21]

Cl-moe: Enhancing multimodal large language model with dual momentum mixture-of-experts for continual visual question answering

Tianyu Huai, Jie Zhou, Xingjiao Wu, Qin Chen, Qingchun Bai, Ze Zhou, and Liang He. Cl-moe: Enhancing multimodal large language model with dual momentum mixture-of-experts for continual visual question answering. InCVPR, pages 19608–19617, 2025

2025

[22] [22]

Extensions of lipschitz mappings into a hilbert space

William B Johnson, Joram Lindenstrauss, et al. Extensions of lipschitz mappings into a hilbert space. Contemporary mathematics, 26(189-206):1, 1984

1984

[23] [23]

Muon: An optimizer for hidden layers in neural networks, 2024

Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bern- stein. Muon: An optimizer for hidden layers in neural networks, 2024. URL https://kellerjordan. github.io/posts/muon/

2024

[24] [24]

Prismatic synthesis: Gradient-based data diversification boosts generalization in LLM reasoning

Jaehun Jung, Seungju Han, Ximing Lu, Skyler Hallinan, David Acuna, Shrimai Prabhumoye, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, and Yejin Choi. Prismatic synthesis: Gradient-based data diversification boosts generalization in LLM reasoning. InNeurIPS, 2025

2025

[25] [25]

Nanogpt.https://github.com/karpathy/nanoGPT, 2022

Andrej Karpathy. Nanogpt.https://github.com/karpathy/nanoGPT, 2022

2022

[26] [26]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[27] [27]

Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

2017

[28] [28]

Learning multiple layers of features from tiny images

Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009

2009

[29] [29]

Deeper, broader and artier domain generalization

Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M Hospedales. Deeper, broader and artier domain generalization. InICCV, pages 5542–5550, 2017

2017

[30] [30]

Visual instruction tuning.NeurIPS, 36: 34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.NeurIPS, 36: 34892–34916, 2023

2023

[31] [31]

Muon is Scalable for LLM Training

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

Gradient episodic memory for continual learning.NIPS, 30, 2017

David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning.NIPS, 30, 2017

2017

[33] [33]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7

2019

[34] [34]

Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

2019

[35] [35]

Learning to learn without forgetting by maximizing transfer and minimizing interference

Matthew Riemer, Ignacio Cases, Robert Ajemian, Miao Liu, Irina Rish, Yuhai Tu, and Gerald Tesauro. Learning to learn without forgetting by maximizing transfer and minimizing interference. InICLR, 2019

2019

[36] [36]

Progressive Neural Networks

Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Ko- ray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks.arXiv preprint arXiv:1606.04671, 2016. 11

work page internal anchor Pith review Pith/arXiv arXiv 2016

[37] [37]

Gradient projection memory for continual learning

Gobinda Saha, Isha Garg, and Kaushik Roy. Gradient projection memory for continual learning. InICLR, 2021

2021

[38] [38]

On the Convergence Analysis of Muon

Wei Shen, Ruichuan Huang, Minhui Huang, Cong Shen, and Jiawei Zhang. On the convergence analysis of muon.arXiv preprint arXiv:2505.23737, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

Adamuon: Adaptive muon optimizer.arXiv preprint arXiv:2507.11005, 2025

Chongjie Si, Debing Zhang, and Wei Shen. Adamuon: Adaptive muon optimizer.arXiv preprint arXiv:2507.11005, 2025

work page arXiv 2025

[40] [40]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

2024

[41] [41]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[42] [42]

A comprehensive survey of continual learning: Theory, method and application.IEEE transactions on pattern analysis and machine intelligence, 46(8): 5362–5383, 2024

Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. A comprehensive survey of continual learning: Theory, method and application.IEEE transactions on pattern analysis and machine intelligence, 46(8): 5362–5383, 2024

2024

[43] [43]

Orthogonal subspace learning for language model continual learning

Xiao Wang, Tianze Chen, Qiming Ge, Han Xia, Rong Bao, Rui Zheng, Qi Zhang, Tao Gui, and Xuan-Jing Huang. Orthogonal subspace learning for language model continual learning. InFindings of EMNLP, pages 10658–10671, 2023. 12 A Extended Background Orthogonalization.Let Bt ∈R m×n with compact SVD Bt =U tΣtV ⊤ t , where Σt = diag(σ1, . . . , σr) and σ1 ≥ · · · ...

2023

[44] [44]

Greens”), FOGO correctly identifies the specific ingredient (“Spinach

formalize this by showing that Muon and related optimizers perform steepest descent under a spectral-norm constraint. Newton–Schulz iterations.The polar factor O(Bt) is approximated without a full SVD via the Newton–Schulz iteration. Starting from X0 =B t/∥Bt∥F , the recurrence Xk+1 = 1 2 Xk(3I− X ⊤ k Xk) converges quadratically toUtV ⊤ t when ∥Bt∥2 <1 . ...

2048