arxiv: 2605.02105 · v1 · submitted 2026-05-04 · 💻 cs.LG · cs.CL

Recognition: 3 theorem links

· Lean Theorem

Sharpness-Aware Pretraining Mitigates Catastrophic Forgetting

Aditi Raghunathan, Catherine Li, Ishaan Watts, Jacob Mitchell Springer, Sachin Goyal

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:28 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords catastrophic forgettingsharpness-aware minimizationpretrainingpost-trainingquantizationloss landscape geometrymodel retention

0 comments

The pith

Sharpness-aware pretraining produces base models that retain more capabilities after post-training and quantization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that pretraining should target not only a strong base model but one whose loss landscape geometry allows capabilities to survive later parameter changes. Interventions that push optimization toward flatter minima, including Sharpness-Aware Minimization, large learning rates, and shortened annealing, achieve this and deliver up to 80 percent less forgetting across five datasets for models from 20M to 150M parameters. The same principle works at scale: a brief SAM phase on an existing 1B-parameter checkpoint cuts forgetting by 31 percent after mathematical post-training and by 40 percent after 4-bit quantization. Readers should care because nearly every deployed model follows a multi-stage pipeline, so improving retention at the pretraining stage offers a lightweight way to make later stages less destructive.

Core claim

Pretraining optimization approaches that bias toward flatter minima—Sharpness-Aware Minimization (SAM), large learning rates, and shortened learning rate annealing periods—produce base models whose capabilities survive subsequent parameter updates better, leading to improved downstream performance with substantially less forgetting. Across model sizes ranging from 20M to 150M parameters, these interventions consistently improve downstream performance after post-training on five common datasets with up to 80% less forgetting. These principles hold at scale: a short SAM mid-training phase applied to an existing OLMo-2-1B checkpoint reduces forgetting by 31% after MetaMath post-training and by

What carries the argument

The flatness of the loss minimum reached during pretraining, achieved via SAM or related interventions, which governs how much of the base model's performance is preserved under later updates.

If this is right

Base models trained with these methods show consistent downstream gains after post-training on multiple datasets.
The retention benefit appears across scales from 20M to 150M parameters and extends to 1B-parameter models via a short mid-training phase.
The same interventions reduce forgetting during both fine-tuning and quantization.
A short SAM phase can be inserted into existing pretraining runs without restarting from scratch.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Standard pretraining may routinely converge to sharp minima that are unnecessarily fragile for the multi-stage pipelines now common in large models.
The approach could be combined with other geometry-aware techniques such as weight averaging or regularization that also favor flat regions.
If the effect scales further, it would imply that current scaling laws for pretraining compute may need adjustment to account for downstream retention rather than pretraining loss alone.
The result suggests testing whether flatter base models also exhibit better zero-shot generalization before any post-training occurs.

Load-bearing premise

That flatter minima directly cause the observed retention gains rather than arising alongside other unmeasured factors in the training trajectory.

What would settle it

Train two base models to identical pretraining validation loss, one using SAM and one using standard optimization, then apply identical post-training and measure forgetting rates; equal rates would falsify the claim that flatness controls retention.

Figures

Figures reproduced from arXiv: 2605.02105 by Aditi Raghunathan, Catherine Li, Ishaan Watts, Jacob Mitchell Springer, Sachin Goyal.

**Figure 1.** Figure 1: Main results from OLMo-2-1B experiments. We take an OLMo-2-1B model pretrained on 4T tokens and then mid-train it for 50B tokens using SAM and AdamW. After further modification by SFT (MetaMath, StackMathQA, Tülu-3, and MusicPile) and 4-bit quantization, SAM reduces forgetting on the pretraining eval benchmark. are under the parameter updates introduced by posttraining. Models that are sensitive to these… view at source ↗

**Figure 2.** Figure 2: SAM consistently yields pretrained checkpoints that forget less when fine-tuned to the same performance as AdamW counterparts. We pretrain OLMo-60M models with a cosine schedule using AdamW and SAM on 192B tokens and fine-tune on five datasets. SAM achieves a better learning-forgetting frontier. 10 1 Tokens (B) 3.8 3.9 Pretraining loss (a) 20M 10 2 Tokens (B) 3.42 3.45 3.48 3.51 60M 10 2 Tokens (B) 3.15 3.… view at source ↗

**Figure 3.** Figure 3: Comparison of SAM and AdamW with model size. (a) Across model sizes and token budgets, SAM-pretrained models achieve a worse or similar pretraining loss compared to AdamW. However, better pretraining loss alone does not translate into mitigating forgetting. (b) We pretrain OLMo models of sizes 20M, 60M, and 150M on similar token-per-parameter ratios (800) and then fine-tune on StarCoder. We observe that th… view at source ↗

**Figure 4.** Figure 4: SAM’s improvement over AdamW grows with scaling pretraining tokens. We pretrain OLMo-60M models with a cosine schedule using AdamW and SAM on 4B to 192B tokens and fine-tune on StarCoder. The gap between SAM and AdamW widens as we scale pretraining tokens. instead of +0.5. Gains are largest for StarCoder and MusicPile and smallest for Tülu-3, likely because Tülu3 is closer to DCLM. SAM improves fine-tuned… view at source ↗

**Figure 5.** Figure 5: SAM delays the onset of catastrophic overtraining. We pretrain OLMo-60M models with a cosine learning rate schedule using AdamW and SAM and then fine-tune on five datasets. We plot the minimum achievable pretraining loss such that the fine-tuning loss is below a threshold (more details in Appendix C.4) as a function of the base model loss. Once the AdamW-trained models reach a certain base model pretrainin… view at source ↗

**Figure 6.** Figure 6: Higher peak learning rates improve the learning-forgetting tradeoff. We vary the peak pretraining learning rate for 60M models with a cosine schedule on 192B tokens. (a) Pretraining loss vs. peak learning rate. (b) Learning-forgetting Pareto frontier on StarCoder. The asterisk in the legend marks the peak learning rate that achieves the lowest base-model pretraining loss. (c) 4-bit quantized pretraining lo… view at source ↗

**Figure 7.** Figure 7: Shorter annealing periods improve the learning-forgetting tradeoff. We vary the annealing duration as a percentage of total training steps for 60M models with a WSD schedule. (a) Pretraining loss vs. anneal percent. (b) Learningforgetting Pareto frontier on StarCoder. (c) 4-bit quantized pretraining loss vs. anneal percent. (d) Perturbed pretraining loss vs. perturbation magnitude γ. getting tradeoffs. Ho… view at source ↗

**Figure 8.** Figure 8: SAM improves sensitivity to post-training quantization and Gaussian perturbations. We pretrain OLMo60M for budgets ranging from 12B to 192B tokens. (a) 4-bit quantized pretraining loss vs. pretraining tokens, with the unquantized AdamW reference for scale. (b) Perturbed pretraining loss at γ = 0.025 vs. pretraining tokens. 3.6 3.9 4.2 4.5 Pretraining loss 2.6 2.7 Fine-tuning loss (a) 10 2 Tokens (B) 3.5… view at source ↗

**Figure 9.** Figure 9: Annealing with SAM improves downstream performance over baseline annealing. We pretrain OLMo60M for 192B tokens with a WSD schedule and a 10% anneal, comparing AdamW throughout (baseline annealing) to a recipe that switches to SAM during the decay phase. (a) Learningforgetting Pareto frontier after fine-tuning on StarCoder (10M tokens). (b) 4-bit quantized pretraining loss vs. pretraining tokens (12B–192… view at source ↗

**Figure 10.** Figure 10: Quadratic approximation vs. observed loss for the token sweep. We compare 60M AdamW and SAM checkpoints fine-tuned on StarCoder after pretraining on 12B, 24B, 48B, 96B, and 192B tokens. Columns correspond to token budget, the top row shows AdamW, and the bottom row shows SAM. Solid lines show the observed pretraining loss after fine-tuning as we sweep the fine-tuning learning rate, and dashed lines show t… view at source ↗

**Figure 11.** Figure 11: Quadratic approximation vs. observed loss across fine-tuning learning rates. We fix 60M AdamW checkpoints at 192B pretraining tokens and fine-tune on StarCoder. Each panel corresponds to a different peak pretraining learning rate. Solid lines show the observed pretraining loss after fine-tuning as we sweep fine-tuning learning rate, and dashed lines show the quadratic approximation. through curriculum and… view at source ↗

**Figure 12.** Figure 12: SAM and large peak learning rates both lower fine-tuning directional sharpness. For 60M StarCoder checkpoints fine-tuned with learning rate 4 × 10−4 , (a) normalized directional sharpness vs pretraining tokens for AdamW and SAM at their canonical peak learning rates, and (b) normalized directional sharpness vs pretraining peak learning rate for AdamW at 192B tokens. Training to minimize sharpness. Sever… view at source ↗

**Figure 13.** Figure 13: SAM Update Schematic. SAM first takes an ascent step along the gradient, evaluates the gradient at this perturbed point, and then updates the parameters using this perturbed gradient. 17 view at source ↗

**Figure 14.** Figure 14: Learning rate scheduling schematic for Cosine and Warmup-Stable-Decay (WSD) B Experimental details: OLMo-2-1B experiments B.1 Mid-training We take an OLMo-2-1B (OLMo et al., 2024) checkpoint2 which was pretrained on 4T tokens and then mid-train it for 50B tokens on the Dolmino mixture (OLMo et al., 2024) using AdamW and SAM. We select ρ = 0.05 for SAM (defined in Section 2.3.1), which we determined by tun… view at source ↗

**Figure 15.** Figure 15: Pretraining learning rate tuning for controlled experiments. C.2 Fine-tuning We fine-tune the different pretrained checkpoints on five publicly available datasets: StarCoder (Li et al., 2023) (code generation), GSM8K (Cobbe et al., 2021) and StackMathQA (Zhang, 2024) (mathematical reasoning), Tülu-3 (Lambert et al., 2025) (instruction following), and MusicPile (Yuan et al., 2024) (domain-specific). We use… view at source ↗

**Figure 16.** Figure 16: Learning-forgetting frontier for OLMo-2-1B across MetaMath, StackMathQA, Tülu-3, and MusicPile. E Additional results for controlled experiments view at source ↗

**Figure 17.** Figure 17: AdamW vs. SAM learning-forgetting frontier for OLMo-20M across datasets at 64B tokens 3.3 3.6 3.9 4.2 Pretraining loss 2.3 2.4 Fine-tuning loss StarCoder 3.4 3.6 3.8 Pretraining loss 1.44 1.48 1.52 MusicPile 3.2 3.3 3.4 Pretraining loss 2.3 2.4 2.5 Tülu-3 3.2 3.4 3.6 Pretraining loss 1.1 1.2 1.3 1.4 GSM8K 3.3 3.6 3.9 4.2 Pretraining loss 1.44 1.48 1.52 1.56 StackMathQA AdamW SAM view at source ↗

**Figure 18.** Figure 18: AdamW vs. SAM learning-forgetting frontier for OLMo-150M across datasets at 240B tokens 24 view at source ↗

**Figure 19.** Figure 19: AdamW vs. SAM with scaling pretraining tokens on StarCoder for OLMo-20M 4.4 4.8 5.2 Pretraining loss 1.8 1.9 Fine-tuning loss 4B PT tokens 4.4 4.8 5.2 Pretraining loss 8B PT tokens 4.4 4.8 5.2 Pretraining loss 16B PT tokens 4.4 4.8 5.2 Pretraining loss 32B PT tokens 4.4 4.8 5.2 Pretraining loss 64B PT tokens AdamW SAM view at source ↗

**Figure 20.** Figure 20: AdamW vs. SAM with scaling pretraining tokens on MusicPile for OLMo-20M 3.8 3.9 4.0 4.1 Pretraining loss 3.0 3.2 3.4 Fine-tuning loss 4B PT tokens 3.8 3.9 4.0 4.1 Pretraining loss 8B PT tokens 3.8 3.9 4.0 4.1 Pretraining loss 16B PT tokens 3.8 3.9 4.0 4.1 Pretraining loss 32B PT tokens 3.8 3.9 4.0 4.1 Pretraining loss 64B PT tokens AdamW SAM view at source ↗

**Figure 21.** Figure 21: AdamW vs. SAM with scaling pretraining tokens on Tülu-3 for OLMo-20M 25 view at source ↗

**Figure 22.** Figure 22: AdamW vs. SAM with scaling pretraining tokens on StackMathQA for OLMo-20M 4 5 Pretraining loss 1.4 1.6 1.8 2.0 Fine-tuning loss 4B PT tokens 4 5 Pretraining loss 8B PT tokens 4 5 Pretraining loss 16B PT tokens 4 5 Pretraining loss 32B PT tokens 4 5 Pretraining loss 64B PT tokens AdamW SAM view at source ↗

**Figure 23.** Figure 23: AdamW vs. SAM with scaling pretraining tokens on GSM8K for OLMo-20M 3.6 3.9 4.2 4.5 Pretraining loss 2.6 2.7 2.8 Fine-tuning loss 12B PT tokens 3.6 3.9 4.2 4.5 Pretraining loss 24B PT tokens 3.6 3.9 4.2 4.5 Pretraining loss 48B PT tokens 3.6 3.9 4.2 4.5 Pretraining loss 96B PT tokens 3.6 3.9 4.2 4.5 Pretraining loss 192B PT tokens AdamW SAM view at source ↗

**Figure 24.** Figure 24: AdamW vs. SAM with scaling pretraining tokens on StarCoder for OLMo-60M 26 view at source ↗

**Figure 25.** Figure 25: AdamW vs. SAM with scaling pretraining tokens on MusicPile for OLMo-60M 3.5 3.6 3.7 3.8 Pretraining loss 2.6 2.7 2.8 Fine-tuning loss 12B PT tokens 3.5 3.6 3.7 3.8 Pretraining loss 24B PT tokens 3.5 3.6 3.7 3.8 Pretraining loss 48B PT tokens 3.5 3.6 3.7 3.8 Pretraining loss 96B PT tokens 3.5 3.6 3.7 3.8 Pretraining loss 192B PT tokens AdamW SAM view at source ↗

**Figure 26.** Figure 26: AdamW vs. SAM with scaling pretraining tokens on Tülu-3 for OLMo-60M 3.6 3.9 4.2 4.5 Pretraining loss 1.6 1.7 Fine-tuning loss 12B PT tokens 3.6 3.9 4.2 4.5 Pretraining loss 24B PT tokens 3.6 3.9 4.2 4.5 Pretraining loss 48B PT tokens 3.6 3.9 4.2 4.5 Pretraining loss 96B PT tokens 3.6 3.9 4.2 4.5 Pretraining loss 192B PT tokens AdamW SAM view at source ↗

**Figure 27.** Figure 27: AdamW vs. SAM with scaling pretraining tokens on StackMathQA for OLMo-60M 27 view at source ↗

**Figure 28.** Figure 28: AdamW vs. SAM with scaling pretraining tokens on GSM8K for OLMo-60M 3.3 3.6 3.9 4.2 Pretraining loss 2.3 2.4 2.5 Fine-tuning loss 15B PT tokens 3.3 3.6 3.9 4.2 Pretraining loss 30B PT tokens 3.3 3.6 3.9 4.2 Pretraining loss 60B PT tokens 3.3 3.6 3.9 4.2 Pretraining loss 120B PT tokens 3.3 3.6 3.9 4.2 Pretraining loss 240B PT tokens AdamW SAM view at source ↗

**Figure 29.** Figure 29: AdamW vs. SAM with scaling pretraining tokens on StarCoder for OLMo-150M 3.4 3.6 3.8 4.0 Pretraining loss 1.4 1.5 Fine-tuning loss 15B PT tokens 3.4 3.6 3.8 4.0 Pretraining loss 30B PT tokens 3.4 3.6 3.8 4.0 Pretraining loss 60B PT tokens 3.4 3.6 3.8 4.0 Pretraining loss 120B PT tokens 3.4 3.6 3.8 4.0 Pretraining loss 240B PT tokens AdamW SAM view at source ↗

**Figure 30.** Figure 30: AdamW vs. SAM with scaling pretraining tokens on MusicPile for OLMo-150M 28 view at source ↗

**Figure 31.** Figure 31: AdamW vs. SAM with scaling pretraining tokens on Tülu-3 for OLMo-150M 3.3 3.6 3.9 4.2 Pretraining loss 1.44 1.48 1.52 1.56 Fine-tuning loss 15B PT tokens 3.3 3.6 3.9 4.2 Pretraining loss 30B PT tokens 3.3 3.6 3.9 4.2 Pretraining loss 60B PT tokens 3.3 3.6 3.9 4.2 Pretraining loss 120B PT tokens 3.3 3.6 3.9 4.2 Pretraining loss 240B PT tokens AdamW SAM view at source ↗

**Figure 32.** Figure 32: AdamW vs. SAM with scaling pretraining tokens on StackMathQA for OLMo-150M 3.2 3.4 3.6 Pretraining loss 1.2 1.4 1.6 Fine-tuning loss 15B PT tokens 3.2 3.4 3.6 Pretraining loss 30B PT tokens 3.2 3.4 3.6 Pretraining loss 60B PT tokens 3.2 3.4 3.6 Pretraining loss 120B PT tokens 3.2 3.4 3.6 Pretraining loss 240B PT tokens AdamW SAM view at source ↗

**Figure 33.** Figure 33: AdamW vs. SAM with scaling pretraining tokens on GSM8K for OLMo-150M 29 view at source ↗

**Figure 34.** Figure 34: AdamW vs. SAM learning-forgetting frontier across model sizes at 800 token-per-parameter for StarCoder 4.4 4.8 5.2 PT loss 1.8 1.9 Fine-tuning loss 20M 3.8 4.0 4.2 4.4 PT loss 1.59 1.62 1.65 1.68 60M 3.4 3.6 3.8 PT loss 1.42 1.44 1.46 1.48 150M AdamW SAM view at source ↗

**Figure 35.** Figure 35: AdamW vs. SAM learning-forgetting frontier across model sizes at 800 token-per-parameter for MusicPile 3.9 4.0 4.1 PT loss 3.0 3.1 3.2 Fine-tuning loss 20M 3.5 3.6 3.7 PT loss 2.7 2.8 60M 3.2 3.3 PT loss 2.4 2.5 150M AdamW SAM view at source ↗

**Figure 36.** Figure 36: AdamW vs. SAM learning-forgetting frontier across model sizes at 800 token-per-parameter for Tülu-3 30 view at source ↗

**Figure 37.** Figure 37: AdamW vs. SAM learning-forgetting frontier across model sizes at 800 token-per-parameter for StackMathQA 3.9 4.2 4.5 4.8 PT loss 1.5 1.6 1.7 Fine-tuning loss 20M 3.5 3.6 3.7 PT loss 1.3 1.4 60M 3.2 3.4 3.6 PT loss 1.1 1.2 150M AdamW SAM view at source ↗

**Figure 38.** Figure 38: AdamW vs. SAM learning-forgetting frontier across model sizes at 800 token-per-parameter for GSM8K 31 view at source ↗

**Figure 39.** Figure 39: Learning-forgetting tradeoff for AdamW vs. SAM at pretraining loss-matched setting for OLMo-20M across datasets. 3.24 3.21 3.18 Progress (Base model pretraining loss) 3.4 3.5 3.6 3.7 Fine-tuned model pretraining loss StarCoder 3.24 3.21 3.18 Progress (Base model pretraining loss) 3.6 3.8 4.0 MusicPile 3.24 3.21 3.18 Progress (Base model pretraining loss) 3.2 3.3 Tülu-3 3.24 3.21 3.18 Progress (Base model … view at source ↗

**Figure 40.** Figure 40: Learning-forgetting tradeoff for AdamW vs. SAM at pretraining loss-matched setting for OLMo-150M across datasets. E.1.5 Learning-forgetting tradeoff with EWC To understand the effect of SAM in combination with other continual learning techniques, we use Elastic Weight Consolidation (EWC) (Kirkpatrick et al., 2017) for fine-tuning OLMo-60M checkpoints pretrained on 192B tokens with SAM. Tuning hyperparamet… view at source ↗

**Figure 41.** Figure 41: Tuning λ for EWC for OLMo-60M pretrained on 192B tokens and fine-tuned on StarCoder. We pretrain OLMo-60M on 192B tokens using SAM and AdamW, and then fine-tune with Elastic Weight Consolidation (EWC) on StarCoder to tune λ separately for each optimizer. We find λ = 1e+04 yields the best learning-forgetting Pareto frontier. 3.6 3.9 4.2 4.5 Pretraining loss 2.5 2.6 2.7 2.8 Fine-tuning loss StarCoder 3.6 3.… view at source ↗

**Figure 42.** Figure 42: SAM + EWC outperforms AdamW + EWC. Learning-forgetting frontier for OLMo-60M pretrained on 192B tokens and fine-tuned on StarCoder and MusicPile with EWC. 33 view at source ↗

**Figure 43.** Figure 43: AdamW vs. SAM under 4-bit and 8-bit post-training quantization for OLMo-20M. 10 2 Tokens (B) 3.4 3.5 3.6 3.7 Pretraining loss 4-bit 10 2 Tokens (B) 3.42 3.45 3.48 3.51 8-bit AdamW base model AdamW + quantization SAM + quantization view at source ↗

**Figure 44.** Figure 44: AdamW vs. SAM under 4-bit and 8-bit post-training quantization for OLMo-60M. 34 view at source ↗

**Figure 45.** Figure 45: AdamW vs. SAM under 4-bit and 8-bit post-training quantization for OLMo-150M. E.1.7 Gaussian perturbation sensitivity 10 1 Tokens (B) 3.8 3.9 4.0 Pretraining loss AdamW 10 1 Tokens (B) SAM γ=0.009 γ=0.013 γ=0.017 γ=0.02 γ=0.025 view at source ↗

**Figure 46.** Figure 46: AdamW vs. SAM Gaussian perturbation sensitivity for OLMo-20M. 35 view at source ↗

**Figure 47.** Figure 47: AdamW vs. SAM Gaussian perturbation sensitivity for OLMo-60M. 10 2 Tokens (B) 3.15 3.18 3.21 3.24 3.27 3.30 3.33 3.36 3.39 Pretraining loss AdamW 10 2 Tokens (B) SAM γ=0.009 γ=0.013 γ=0.017 γ=0.02 γ=0.025 view at source ↗

**Figure 48.** Figure 48: AdamW vs. SAM Gaussian perturbation sensitivity for OLMo-150M. 36 view at source ↗

**Figure 49.** Figure 49: Base-model pretraining loss across peak LR using WSD for OLMo-60M pretrained on 192B tokens. E.2.2 Learning-forgetting frontier across datasets 3.6 3.9 4.2 Pretraining loss 2.5 2.6 2.7 Fine-tuning loss StarCoder 3.8 4.0 4.2 4.4 Pretraining loss 1.56 1.59 1.62 1.65 MusicPile 3.5 3.6 3.7 Pretraining loss 2.6 2.7 2.8 Tülu-3 3.5 3.6 3.7 Pretraining loss 1.2 1.3 1.4 1.5 GSM8K 3.6 3.9 4.2 4.5 Pretraining loss 1… view at source ↗

**Figure 50.** Figure 50: Peak learning rate learning-forgetting frontier across datasets for cosine schedule. 37 view at source ↗

**Figure 51.** Figure 51: Peak learning rate learning-forgetting frontier across datasets for WSD schedule (10% annealing steps). E.2.3 Gaussian perturbation sensitivity 0.05 0.10 Perturbation (γ) 4 5 Pretraining loss (a) WSD 0.05 0.10 Perturbation (γ) 4 5 (b) Cosine Pretraining LR 1e-4 3e-4 6e-4 1e-3 3e-3 view at source ↗

**Figure 52.** Figure 52: Perturbed pretraining loss vs. perturbation magnitude γ for a sweep of peak learning rates. (a) WSD (10% annealing steps) and (b) cosine schedule OLMo-60M pretrained on 192B tokens. E.2.4 Post-training quantization performance 38 view at source ↗

**Figure 53.** Figure 53: Effect of peak learning rate with WSD schedule (10% annealing steps) for OLMo-60M pretrained on 192B tokens under 4-bit and 8-bit post-training quantization. 1e-4 3e-4 6e-4 1e-3 3e-3 Pretraining LR 3.6 3.7 3.8 3.9 Pretraining loss (a) 4-bit 1e-4 3e-4 6e-4 1e-3 3e-3 Pretraining LR 3.42 3.44 3.46 (b) 8-bit Pretraining LR 1e-4 3e-4 6e-4 1e-3 3e-3 view at source ↗

**Figure 54.** Figure 54: Effect of peak learning rate with cosine schedule (10% annealing steps) for OLMo-60M pretrained on 192B tokens under 4-bit and 8-bit post-training quantization. 39 view at source ↗

**Figure 55.** Figure 55: Learning-forgetting frontier across datasets for OLMo-60M pretrained on 192B tokens using WSD schedule with varying periods of annealing. E.4 Optimization choice: annealing with SAM E.4.1 Learning-forgetting frontier across datasets 4 5 Pretraining loss 2.9 3.0 Fine-tuning loss StarCoder 4.4 4.8 5.2 Pretraining loss 1.8 1.9 MusicPile 3.8 3.9 4.0 4.1 Pretraining loss 3.0 3.1 Tülu-3 3.9 4.2 4.5 4.8 Pretrain… view at source ↗

**Figure 56.** Figure 56: Annealing with SAM vs. baseline annealing (WSD) learning-forgetting frontier across datasets for OLMo20M pretrained on 64B tokens. 40 view at source ↗

**Figure 57.** Figure 57: Annealing with SAM vs. baseline annealing (WSD) learning-forgetting frontier across datasets for OLMo60M pretrained on 192B tokens. 3.2 3.4 3.6 Pretraining loss 2.3 2.4 Fine-tuning loss StarCoder 3.4 3.6 3.8 Pretraining loss 1.44 1.48 1.52 MusicPile 3.18 3.21 3.24 3.27 Pretraining loss 2.3 2.4 2.5 Tülu-3 3.2 3.3 3.4 Pretraining loss 1.1 1.2 1.3 GSM8K 3.3 3.6 3.9 Pretraining loss 1.41 1.44 1.47 1.50 Stack… view at source ↗

**Figure 58.** Figure 58: Annealing with SAM vs. baseline annealing (WSD) learning-forgetting frontier across datasets for OLMo150M pretrained on 120B tokens. 41 view at source ↗

**Figure 59.** Figure 59: Annealing with SAM vs. baseline annealing (WSD) pretraining loss vs. pretraining tokens at different perturbation magnitudes γ for OLMo-20M. 10 2 Tokens (B) 3.4 3.5 3.6 Pretraining loss Baseline annealing 10 2 Tokens (B) SAM annealing γ=0.009 γ=0.013 γ=0.017 γ=0.02 γ=0.025 γ=0.03 γ=0.05 γ=0.075 γ=0.1 view at source ↗

**Figure 60.** Figure 60: Annealing with SAM vs. baseline annealing (WSD) pretraining loss vs. pretraining tokens at different perturbation magnitudes γ for OLMo-60M. 10 2 Tokens (B) 3.2 3.3 3.4 Pretraining loss Baseline annealing 10 2 Tokens (B) SAM annealing γ=0.009 γ=0.013 γ=0.017 γ=0.02 γ=0.025 γ=0.03 γ=0.05 γ=0.075 γ=0.1 view at source ↗

**Figure 61.** Figure 61: Annealing with SAM vs. baseline annealing (WSD) pretraining loss vs. pretraining tokens at different perturbation magnitudes γ for OLMo-150M. E.4.3 Post-training quantization performance 42 view at source ↗

**Figure 62.** Figure 62: Annealing with SAM vs. baseline annealing (WSD) under 4-bit and 8-bit post-training quantization for OLMo-20M. 10 2 Tokens (B) 3.4 3.5 3.6 3.7 Pretraining loss 4-bit 10 2 Tokens (B) 3.4 3.5 8-bit Baseline annealing base model Baseline annealing + quantization SAM annealing + quantization view at source ↗

**Figure 63.** Figure 63: Annealing with SAM vs. baseline annealing (WSD) under 4-bit and 8-bit post-training quantization for OLMo-60M. 10 2 Tokens (B) 3.2 3.3 Pretraining loss 4-bit 10 2 Tokens (B) 3.16 3.20 3.24 3.28 8-bit Baseline annealing base model Baseline annealing + quantization SAM annealing + quantization view at source ↗

**Figure 64.** Figure 64: Annealing with SAM vs. baseline annealing (WSD) under 4-bit and 8-bit post-training quantization for OLMo-150M. 43 view at source ↗

read the original abstract

Pretraining optimizers are tuned to produce the strongest possible base model, on the assumption that a stronger starting point yields a stronger model after subsequent changes like post-training and quantization. This overlooks the geometry of the base model which controls how much of the base model's capabilities survive subsequent parameter updates. We study three pretraining optimization approaches that bias optimization toward flatter minima: Sharpness-Aware Minimization (SAM), large learning rates, and shortened learning rate annealing periods. Across model sizes ranging from 20M to 150M parameters, we find that these interventions consistently improve downstream performance after post-training on five common datasets with up to 80% less forgetting. These principles hold at scale: a short SAM mid-training phase applied to an existing OLMo-2-1B checkpoint reduces forgetting by 31% after MetaMath post-training and by 40% after 4-bit quantization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SAM and LR tweaks in pretraining cut forgetting by 30-80% in the reported experiments, but the flat-minima story lacks direct checks.

read the letter

The paper's main result is that running SAM, using large learning rates, or shortening the annealing schedule during pretraining produces base models that lose less capability after post-training or quantization. Gains appear across 20M-150M models on five datasets, reaching 80% less forgetting in the best cases, and a short SAM phase on an OLMo-2-1B checkpoint gives 31% and 40% reductions on MetaMath and 4-bit quantization respectively. That mid-training application on an existing checkpoint is the most immediately usable piece.

Referee Report

3 major / 2 minor

Summary. The paper claims that pretraining with Sharpness-Aware Minimization (SAM), large learning rates, or shortened annealing biases models toward flatter minima, which in turn reduces catastrophic forgetting during subsequent post-training and quantization. It reports consistent downstream gains across 20M–150M parameter models on five datasets (up to 80% less forgetting) and shows the approach scales to a 1B model, where a brief SAM mid-training phase cuts forgetting by 31% after MetaMath and 40% after 4-bit quantization.

Significance. If the central mechanism is isolated and the empirical gains replicate under standard controls, the result would be significant for LLM pretraining pipelines: it offers a low-cost way to improve base-model retention without changing post-training recipes, potentially affecting how large-scale checkpoints are produced and maintained.

major comments (3)

[Abstract and §3] Abstract and §3 (Methods): the claim that the three interventions 'bias optimization toward flatter minima' is presented without any reported sharpness measurements (Hessian trace, SAM sharpness, or perturbation robustness) on the resulting checkpoints, so it is impossible to confirm that geometry, rather than correlated optimizer effects, drives the retention gains.
[§4 and Table 2] §4 (Experiments) and Table 2: the 80% forgetting reduction is reported without statistical tests, explicit baseline definitions, or ablation tables that hold effective regularization and loss-curve shape fixed while varying only flatness; the 1B-scale mid-training result similarly lacks a direct comparison of sharpness before and after the SAM phase.
[§5] §5 (Scaling): the statement that 'these principles hold at scale' rests on a single 1B checkpoint experiment; without multiple runs, variance estimates, or controls for the original OLMo-2 training recipe, the 31% and 40% reductions cannot be confidently attributed to the SAM intervention alone.

minor comments (2)

[Figure 1 and §2] Figure 1 caption and §2: the notation for 'forgetting' is introduced without an explicit equation; adding a short definition (e.g., ΔAcc = Acc_base − Acc_post) would improve readability.
[References] References: several recent works on sharpness-aware training for LLMs (e.g., 2023–2024 SAM variants) are not cited; adding them would better situate the contribution.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to incorporate additional evidence and controls where feasible.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Methods): the claim that the three interventions 'bias optimization toward flatter minima' is presented without any reported sharpness measurements (Hessian trace, SAM sharpness, or perturbation robustness) on the resulting checkpoints, so it is impossible to confirm that geometry, rather than correlated optimizer effects, drives the retention gains.

Authors: We agree that the original submission would have been strengthened by direct sharpness measurements to support the geometric interpretation. Although the three interventions are established in the literature as promoting flatter minima, we did not report explicit metrics such as Hessian trace or perturbation robustness. In the revised manuscript we have added these measurements (Hessian trace approximations and SAM sharpness) for the 20M–150M checkpoints in a new subsection of §3; the results confirm flatter landscapes and show correlation with the observed retention gains. For the 1B model we note the limitation below. revision: yes
Referee: [§4 and Table 2] §4 (Experiments) and Table 2: the 80% forgetting reduction is reported without statistical tests, explicit baseline definitions, or ablation tables that hold effective regularization and loss-curve shape fixed while varying only flatness; the 1B-scale mid-training result similarly lacks a direct comparison of sharpness before and after the SAM phase.

Authors: We accept that additional statistical rigor and controls are warranted. The revised §4 now includes paired statistical tests across random seeds for the reported forgetting reductions, clearer baseline definitions, and a new ablation table that holds regularization strength and loss-curve shape approximately fixed while varying the flatness-inducing factors. For the 1B mid-training result we have added a direct before/after sharpness comparison using the same metrics introduced in §3. revision: yes
Referee: [§5] §5 (Scaling): the statement that 'these principles hold at scale' rests on a single 1B checkpoint experiment; without multiple runs, variance estimates, or controls for the original OLMo-2 training recipe, the 31% and 40% reductions cannot be confidently attributed to the SAM intervention alone.

Authors: We acknowledge that a single 1B-scale experiment limits the strength of the scaling claim. The revised §5 now explicitly discusses this limitation, provides available variance estimates from the smaller-scale runs, and clarifies the controls relative to the original OLMo-2 recipe. We were unable to obtain multiple independent 1B pretraining runs. revision: partial

standing simulated objections not resolved

Obtaining multiple independent full 1B-scale pretraining runs (or additional 1B checkpoints) to supply variance estimates and stronger controls, which exceeds available computational resources.

Circularity Check

0 steps flagged

No circularity: purely empirical measurements of retention gains

full rationale

The paper reports experimental results comparing pretraining interventions (SAM, large LR, shortened annealing) on downstream forgetting after post-training and quantization. No mathematical derivation, first-principles prediction, or equation chain exists that could reduce to fitted inputs or self-citations. Claims rest on measured performance deltas across model scales (20M-150M and 1B), not on any self-referential definition or imported uniqueness theorem. Self-citations, if present, are not load-bearing for the central empirical finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical intervention study and introduces no new free parameters, axioms, or invented entities beyond standard assumptions of loss-landscape geometry in deep learning.

pith-pipeline@v0.9.0 · 5462 in / 1190 out tokens · 35862 ms · 2026-05-08T18:28:45.006460+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation / Foundation.LogicAsFunctionalEquation washburn_uniqueness_aczel (J = ½(x+x⁻¹)−1) unclear
L(θ+Δ) ≈ L(θ) + ∇L(θ)ᵀΔ + ½ΔᵀHΔ … forgetting is determined by … the distance moved ‖Δ_FT‖² and the curvature in the direction of fine-tuning κ(Δ_FT; H).
Foundation.BranchSelection / Foundation.AlphaCoordinateFixation RCLCombiner_isCoupling_iff; alpha_pin_under_high_calibration unclear
SAM with a batch size of 1 is thought to minimize the trace of the Hessian … full-batch SAM tracks the worst-case directional curvature λ_max(H).

Reference graph

Works this paper leans on

82 extracted references · 29 canonical work pages · 11 internal anchors

[1]

Forty-second International Conference on Machine Learning , year=

Overtrained Language Models Are Harder to Fine-Tune , author=. Forty-second International Conference on Machine Learning , year=
[2]

Editing Conceptual Knowledge for Large Language Models

Wang, Xiaohan and Mao, Shengyu and Deng, Shumin and Yao, Yunzhi and Shen, Yue and Liang, Lei and Gu, Jinjie and Chen, Huajun and Zhang, Ningyu. Editing Conceptual Knowledge for Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.40

work page doi:10.18653/v1/2024.findings-emnlp.40 2024
[3]

doi: 10.18653/V1/2021.EMNLP-MAIN.522

De Cao, Nicola and Aziz, Wilker and Titov, Ivan. Editing Factual Knowledge in Language Models. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.522

work page doi:10.18653/v1/2021.emnlp-main.522 2021
[4]

Fernando Hernandez-Garcia, Qingfeng Lan, Parash Rahman, A

Dohare, Shibhansh and Hernandez-Garcia, J. Fernando and Lan, Qingfeng and Rahman, Parash and Mahmood, A. Rupam and Sutton, Richard S. , title =. Nature , year =. doi:10.1038/s41586-024-07711-7 , url =

work page doi:10.1038/s41586-024-07711-7
[5]

Lee , title =

Alex Damian and Eshaan Nichani and Jason D. Lee , title =. The Eleventh International Conference on Learning Representations,. 2023 , url =

2023
[6]

Make Continual Learning Stronger via C-Flat , url =

Bian, Ang and Li, Wei and Yuan, Hangjie and Yu, Chengrong and Wang, Mang and Zhao, Zixiang and Lu, Aojun and Ji, Pengliang and Feng, Tao , booktitle =. Make Continual Learning Stronger via C-Flat , url =. doi:10.52202/079017-0244 , editor =

work page doi:10.52202/079017-0244
[7]

2025 , eprint=

Qwen2.5 Technical Report , author=. 2025 , eprint=

2025
[8]

2026 , eprint=

Olmo 3 , author=. 2026 , eprint=

2026
[9]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Mandar Joshi and Eunsol Choi and Daniel S. Weld and Luke Zettlemoyer , title =. CoRR , volume =. 2017 , url =. 1705.03551 , timestamp =

work page internal anchor Pith review arXiv 2017
[10]

2024 , eprint=

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark , author=. 2024 , eprint=

2024
[11]

2023 , eprint=

AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models , author=. 2023 , eprint=

2023
[12]

Transactions of the Association for Computational Linguistics , author =

Kwiatkowski, Tom and Palomaki, Jennimaria and Redfield, Olivia and Collins, Michael and Parikh, Ankur and Alberti, Chris and Epstein, Danielle and Polosukhin, Illia and Devlin, Jacob and Lee, Kenton and Toutanova, Kristina and Jones, Llion and Kelcey, Matthew and Chang, Ming-Wei and Dai, Andrew M. and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav. Natura...

work page doi:10.1162/tacl_a_00276 2019
[13]

DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs

Dheeru Dua and Yizhong Wang and Pradeep Dasigi and Gabriel Stanovsky and Sameer Singh and Matt Gardner , title =. CoRR , volume =. 2019 , url =. 1903.00161 , timestamp =

work page Pith review arXiv 2019
[14]

WinoGrande: An Adversarial Winograd Schema Challenge at Scale

Keisuke Sakaguchi and Ronan Le Bras and Chandra Bhagavatula and Yejin Choi , title =. CoRR , volume =. 2019 , url =. 1907.10641 , timestamp =

work page internal anchor Pith review arXiv 2019
[15]

Measuring Massive Multitask Language Understanding

Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt , title =. CoRR , volume =. 2020 , url =. 2009.03300 , timestamp =

work page internal anchor Pith review arXiv 2020
[16]

HellaSwag: Can a Machine Really Finish Your Sentence?

Rowan Zellers and Ari Holtzman and Yonatan Bisk and Ali Farhadi and Yejin Choi , title =. CoRR , volume =. 2019 , url =. 1905.07830 , timestamp =

work page internal anchor Pith review arXiv 2019
[17]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Christopher Clark and Kenton Lee and Ming. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions , journal =. 2019 , url =. 1905.10044 , timestamp =

work page internal anchor Pith review arXiv 2019
[18]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark and Isaac Cowhey and Oren Etzioni and Tushar Khot and Ashish Sabharwal and Carissa Schoenick and Oyvind Tafjord , title =. CoRR , volume =. 2018 , url =. 1803.05457 , timestamp =

work page Pith review arXiv 2018
[19]

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models , author=. arXiv preprint arXiv:2309.12284 , year=

work page internal anchor Pith review arXiv
[20]

The Fourteenth International Conference on Learning Representations , year=

Training Dynamics Impact Post-Training Quantization Robustness , author=. The Fourteenth International Conference on Learning Representations , year=
[21]

2025 , eprint=

Sharpness-Aware Minimization Efficiently Selects Flatter Minima Late in Training , author=. 2025 , eprint=

2025
[22]

The Fourteenth International Conference on Learning Representations , year=

How does the optimizer implicitly bias the model merging loss landscape? , author=. The Fourteenth International Conference on Learning Representations , year=
[23]

Maksym Andriushchenko and Aditya Vardhan Varre and Loucas Pillaud-Vivien and Nicolas Flammarion , year=
[24]

CoRR , volume =

Yuanzhi Li and Colin Wei and Tengyu Ma , title =. CoRR , volume =. 2019 , url =. 1907.04595 , timestamp =

work page arXiv 2019
[25]

2024 , eprint=

Sharpness-Aware Minimization Enhances Feature Quality via Balanced Learning , author=. 2024 , eprint=

2024
[26]

Sharpness Minimization Algorithms Do Not Only Minimize Sharpness To Achieve Better Generalization , url =

Wen, Kaiyue and Li, Zhiyuan and Ma, Tengyu , booktitle =. Sharpness Minimization Algorithms Do Not Only Minimize Sharpness To Achieve Better Generalization , url =
[27]

2023 , eprint=

How Does Sharpness-Aware Minimization Minimize Sharpness? , author=. 2023 , eprint=

2023
[28]

2023 , eprint=

Sharpness-Aware Training for Free , author=. 2023 , eprint=

2023
[29]

2022 , eprint=

Surrogate Gap Minimization Improves Sharpness-Aware Training , author=. 2022 , eprint=

2022
[30]

ASAM: adaptive sharpness-aware minimization for scale-invariant learning of deep neural networks

Jungmin Kwon and Jeongseop Kim and Hyunseo Park and In Kwon Choi , title =. CoRR , volume =. 2021 , url =. 2102.11600 , timestamp =

work page arXiv 2021
[31]

Entropy-sgd: Biasing gradient descent into wide valleys

Pratik Chaudhari and Anna Choromanska and Stefano Soatto and Yann LeCun and Carlo Baldassi and Christian Borgs and Jennifer T. Chayes and Levent Sagun and Riccardo Zecchina , title =. CoRR , volume =. 2016 , url =. 1611.01838 , timestamp =

work page arXiv 2016
[32]

Proceedings of the National Academy of Sciences , volume =

Carlo Baldassi and Fabrizio Pittorino and Riccardo Zecchina , title =. Proceedings of the National Academy of Sciences , volume =. 2020 , doi =. https://www.pnas.org/doi/pdf/10.1073/pnas.1908636117 , abstract =

work page doi:10.1073/pnas.1908636117 2020
[33]

International Conference on Learning Representations , year=

Fantastic Generalization Measures and Where to Find Them , author=. International Conference on Learning Representations , year=
[34]

2023 , eprint=

A Modern Look at the Relationship between Sharpness and Generalization , author=. 2023 , eprint=

2023
[35]

2022 , eprint=

Same Pre-training Loss, Better Downstream: Implicit Bias Matters for Language Models , author=. 2022 , eprint=

2022
[36]

2024 , url =

Keller Jordan and Yuchen Jin and Vlado Boza and Jiacheng You and Franz Cesista and Laker Newhouse and Jeremy Bernstein , title =. 2024 , url =

2024
[37]

International Conference on Learning Representations , year=

Sharpness-aware Minimization for Efficiently Improving Generalization , author=. International Conference on Learning Representations , year=
[38]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

2024
[39]

The Twelfth International Conference on Learning Representations , year=

Understanding Catastrophic Forgetting in Language Models via Implicit Inference , author=. The Twelfth International Conference on Learning Representations , year=
[40]

2026 , eprint=

Replaying pre-training data improves fine-tuning , author=. 2026 , eprint=

2026
[41]

2025 , eprint=

OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling , author=. 2025 , eprint=

2025
[42]

International Conference on Learning Representations , year=

Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability , author=. International Conference on Learning Representations , year=
[43]

International Conference on Learning Representations , year=

Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=
[44]

Shengding Hu and Yuge Tu and Xu Han and Ganqu Cui and Chaoqun He and Weilin Zhao and Xiang Long and Zhi Zheng and Yewei Fang and Yuxiang Huang and Xinrong Zhang and Zhen Leng Thai and Chongyi Wang and Yuan Yao and Chenyang Zhao and Jie Zhou and Jie Cai and Zhongwu Zhai and Ning Ding and Chao Jia and Guoyang Zeng and dahai li and Zhiyuan Liu and Maosong Su...

2024
[45]

The Thirteenth International Conference on Learning Representations , year=

Scaling Laws for Precision , author=. The Thirteenth International Conference on Learning Representations , year=
[46]

Overcoming catastrophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13): 3521–3526, 2017

Kirkpatrick, James and Pascanu, Razvan and Rabinowitz, Neil and Veness, Joel and Desjardins, Guillaume and Rusu, Andrei A. and Milan, Kieran and Quan, John and Ramalho, Tiago and Grabska-Barwinska, Agnieszka and Hassabis, Demis and Clopath, Claudia and Kumaran, Dharshan and Hadsell, Raia , year=. Overcoming catastrophic forgetting in neural networks , vol...

work page doi:10.1073/pnas.1611835114
[47]

2017 , eprint=

Continual Learning with Deep Generative Replay , author=. 2017 , eprint=

2017
[48]

2019 , eprint=

Episodic Memory in Lifelong Language Learning , author=. 2019 , eprint=

2019
[49]

Scaling Optimal

Johan Bjorck and Alon Benhaim and Vishrav Chaudhary and Furu Wei and Xia Song , booktitle=. Scaling Optimal. 2025 , url=

2025
[50]

2020 , eprint=

On Warm-Starting Neural Network Training , author=. 2020 , eprint=

2020
[51]

2022 , eprint=

The Primacy Bias in Deep Reinforcement Learning , author=. 2022 , eprint=

2022
[52]

Flat Minima , journal =

Hochreiter, Sepp and Schmidhuber, J\". Flat minima , year =. doi:10.1162/neco.1997.9.1.1 , journal =

work page doi:10.1162/neco.1997.9.1.1 1997
[53]

2017 , eprint=

SGDR: Stochastic Gradient Descent with Warm Restarts , author=. 2017 , eprint=

2017
[54]

2023 , eprint=

A Model or 603 Exemplars: Towards Memory-Efficient Class-Incremental Learning , author=. 2023 , eprint=

2023
[55]

2026 , eprint=

Midtraining Bridges Pretraining and Posttraining Distributions , author=. 2026 , eprint=

2026
[56]

Don`t Stop Pretraining: Adapt Language Models to Domains and Tasks

Gururangan, Suchin and Marasovi \'c , Ana and Swayamdipta, Swabha and Lo, Kyle and Beltagy, Iz and Downey, Doug and Smith, Noah A. Don ' t Stop Pretraining: Adapt Language Models to Domains and Tasks. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.740

work page doi:10.18653/v1/2020.acl-main.740 2020
[57]

Proceedings of the 42nd International Conference on Machine Learning , pages =

Mitigating Plasticity Loss in Continual Reinforcement Learning by Reducing Churn , author =. Proceedings of the 42nd International Conference on Machine Learning , pages =. 2025 , editor =

2025
[58]

Journal of Machine Learning Research , year =

Sanket Vaibhav Mehta and Darshan Patil and Sarath Chandar and Emma Strubell , title =. Journal of Machine Learning Research , year =
[59]

Guoxiong Gao, Haocheng Ju, Jiedong Jiang, Zihan Qin, and Bin Dong

Robert M. French , keywords =. Catastrophic forgetting in connectionist networks , journal =. 1999 , issn =. doi:https://doi.org/10.1016/S1364-6613(99)01294-2 , url =

work page doi:10.1016/s1364-6613(99)01294-2 1999
[60]

Jeffrey Li and Alex Fang and Georgios Smyrnis and Maor Ivgi and Matt Jordan and Samir Yitzhak Gadre and Hritik Bansal and Etash Kumar Guha and Sedrick Keh and Kushal Arora and Saurabh Garg and Rui Xin and Niklas Muennighoff and Reinhard Heckel and Jean Mercat and Mayee F Chen and Suchin Gururangan and Mitchell Wortsman and Alon Albalak and Yonatan Bitton ...

2024
[61]

2024 , eprint=

OLMo: Accelerating the Science of Language Models , author=. 2024 , eprint=

2024
[62]

2023 , eprint=

StarCoder: may the source be with you! , author=. 2023 , eprint=

2023
[63]

2021 , eprint=

Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

2021
[64]

Zhang, Yifan , year=
[65]

2025 , eprint=

Tulu 3: Pushing Frontiers in Open Language Model Post-Training , author=. 2025 , eprint=

2025
[66]

2024 , eprint=

ChatMusician: Understanding and Generating Music Intrinsically with LLM , author=. 2024 , eprint=

2024
[67]

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale , author=. arXiv preprint arXiv:2208.07339 , year=

work page internal anchor Pith review arXiv
[68]

QLoRA: Efficient Finetuning of Quantized LLMs

Qlora: Efficient finetuning of quantized llms , author=. arXiv preprint arXiv:2305.14314 , year=

work page internal anchor Pith review arXiv
[69]

An empirical investigation of catastrophic forgetting in gradient-based neural networks.arXiv preprint arXiv:1312.6211,

An empirical investigation of catastrophic forgetting in gradient-based neural networks , author=. arXiv preprint arXiv:1312.6211 , year=

work page arXiv
[70]

International conference on machine learning , pages=

Continual learning through synaptic intelligence , author=. International conference on machine learning , pages=. 2017 , organization=

2017
[71]

Proceedings of the European conference on computer vision (ECCV) , pages=

Memory aware synapses: Learning what (not) to forget , author=. Proceedings of the European conference on computer vision (ECCV) , pages=
[72]

Advances in neural information processing systems , volume=

Gradient episodic memory for continual learning , author=. Advances in neural information processing systems , volume=
[73]

Efficient Lifelong Learning with A-GEM

Efficient lifelong learning with a-gem , author=. arXiv preprint arXiv:1812.00420 , year=

work page Pith review arXiv
[74]

Progressive Neural Networks

Progressive neural networks , author=. arXiv preprint arXiv:1606.04671 , year=

work page internal anchor Pith review arXiv
[75]

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

On large-batch training for deep learning: Generalization gap and sharp minima , author=. arXiv preprint arXiv:1609.04836 , year=

work page internal anchor Pith review arXiv
[76]

International Conference on Machine Learning , pages=

Sharp minima can generalize for deep nets , author=. International Conference on Machine Learning , pages=. 2017 , organization=

2017
[77]

Understanding warmup-stable-decay learning rates: A river valley loss landscape perspective.arXiv preprint arXiv:2410.05192, 2024

Understanding warmup-stable-decay learning rates: A river valley loss landscape perspective , author=. arXiv preprint arXiv:2410.05192 , year=

work page arXiv
[78]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Robust fine-tuning of zero-shot models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[79]

International conference on machine learning , pages=

Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time , author=. International conference on machine learning , pages=. 2022 , organization=

2022
[80]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Gptq: Accurate post-training quantization for generative pre-trained transformers , author=. arXiv preprint arXiv:2210.17323 , year=

work page internal anchor Pith review arXiv

Showing first 80 references.