pith. sign in

arxiv: 2605.28860 · v2 · pith:OTXUMK3Onew · submitted 2026-05-21 · 💻 cs.LG · cs.AI· cs.CL

Mechanistic origins of catastrophic forgetting: why RL preserves circuits better than SFT?

Pith reviewed 2026-06-30 17:12 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords catastrophic forgettingreinforcement learningsupervised fine-tuningcircuit vulnerabilitylarge language modelsfine-tuningattention headsmechanistic interpretability
0
0 comments X

The pith

Reinforcement learning preserves more of a language model's original circuits than supervised fine-tuning during task adaptation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares reinforcement learning and supervised fine-tuning to understand why the former resists catastrophic forgetting better in large language models. It introduces differential circuit vulnerability as a head-level metric that tracks how much fine-tuning alters specific internal circuits. Experiments on adapting Qwen2.5-3B-Instruct to scientific question answering show SFT reaches target performance faster yet alters circuits more and erases prior capabilities, while RL changes circuits less at the expense of slower adaptation. The work concludes that greater circuit preservation under RL helps account for its reduced forgetting.

Core claim

SFT adapts more rapidly to the target task but produces substantially greater circuit disruption and forgetting of prior capabilities, whereas RL preserves a larger fraction of the base circuit at the cost of slower task adaptation. These findings suggest that circuit preservation may help explain why RL is more robust to catastrophic forgetting.

What carries the argument

Differential circuit vulnerability, a head-level measure of how much a circuit degrades under fine-tuning.

If this is right

  • RL updates remain closer to the base policy, resulting in smaller circuit shifts and better retention of earlier skills.
  • Faster task gains under SFT come with higher circuit disruption that directly increases loss of prior capabilities.
  • Circuit preservation serves as a mechanistic factor distinguishing the forgetting behavior of RL from SFT.
  • The observed speed-versus-stability trade-off applies specifically to the head-level circuits measured in the adaptation task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Fine-tuning recipes could be evaluated or designed by tracking circuit vulnerability to favor retention when needed.
  • The metric offers a way to compare other adaptation methods beyond RL and SFT on the same mechanistic axis.
  • If circuit changes prove causal for capability loss, interventions that limit vulnerability could reduce forgetting without slowing adaptation.

Load-bearing premise

The differential circuit vulnerability metric validly quantifies degradation of the computational circuits responsible for prior model capabilities rather than unrelated changes.

What would settle it

Observing no meaningful difference in differential circuit vulnerability between RL and SFT runs despite RL exhibiting clearly less forgetting would undermine the proposed mechanistic link.

Figures

Figures reproduced from arXiv: 2605.28860 by Jeanmely Rojas Nunez, Maheep Chaudhary, Nathan Allen, Nomgondalai Amgalanbaatar, Vasu Sharma, Viraj Sawant, Yannis Zongo.

Figure 1
Figure 1. Figure 1: Circuit retention trajectories during high-NTS train￾ing. Starting from 100% base-circuit retention, SFT (orange) and RL (blue) diverge sharply over the two training epochs that produce the high new-task score models. SFT drops to 63.5% after Epoch 1 and continues declining to 59.0% by Epoch 2, whereas RL falls to 69.8% after Epoch 1 and recovers to 72.5% by Epoch 2—a 13.5 percentage-point advantage. Foote… view at source ↗
Figure 2
Figure 2. Figure 2: Performance–preservation trade-off across NT levels. SFT (dashed) exhibits a sharp preservation drop in the high-NTs regime, while RL (solid) declines gradually and preserves 15.8 percentage points more of the base circuit at peak new-task perfor￾mance. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Sufficiency (Score) 0.0 0.2 0.4 0.6 0.8 1.0 Necessity (Ablation Drop) Critical Specialists Base RL SFT [PITH_FULL_IMAGE:fi… view at source ↗
Figure 3
Figure 3. Figure 3: Head Role Distribution Under Base, Supervised, and RL Training. In our setup, SFT produces a cluster of “Critical Specialists”—heads with high necessity and sufficiency—while RL maintains a distributed architecture that overlaps closely with the base model, avoiding the structural compression and specialization observed under the supervised objective. 4.3. Functional Importance and Circuit Drift Beyond agg… view at source ↗
Figure 4
Figure 4. Figure 4: Attention Head Overlap Between Base, SFT, and RL. The plot shows the circuit overlap study for the base, SFT, and RL models. The bars reflect the number of attention heads that are unique to one model, shared by two models, or present at all three levels of training. The gap between the SFT and RL circuit sizes (∼ 265 vs. ∼ 295 heads), depicted in [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Layer-wise Circuit Retention in RL — Our RL model shows architectural stability across all 36 transformer layers, with a high count of retained heads and relatively few forgotten components throughout the network depth. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 Transformer Layer 0 2 4 6 8 Number of Heads Base Components Retained vs. Forgotten in SFT B… view at source ↗
Figure 6
Figure 6. Figure 6: Layer-wise Circuit Retention in SFT — Our SFT model shows broader structural change, with forgotten heads scattered throughout all layers and higher concentrations in the mid-to-late transformer layers. 7 [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Per-Head Necessity vs. ∆mh: SFT vs. RL. We plot per-head necessity against ∆mh (Eq. 4) for both models. The absence of a positive correlation under either objective indicates that mask shifts are not driven by head necessity alone. SFT exhibits a weak negative trend (r=−0.125), suggesting it suppresses heads irrespective of their functional role, while RL’s flat relationship (r=0.022) is consistent with it… view at source ↗
Figure 8
Figure 8. Figure 8: This graph illustrates the number of shared (overlapping) heads among the ’Base’, ’SFT’ (Supervised Fine-Tuning), and ’RL’ (Reinforcement Learning) circuits. Diagonal elements (such as Base-Base, SFT-SFT, and RL-RL) indicate the entire size (number of heads) of each particular circuit. Off-diagonal elements (e.g., Base-SFT, SFT-RL) represent the number of heads shared between two separate circuits. For exa… view at source ↗
read the original abstract

Fine-tuning large language models (LLMs) frequently induces catastrophic forgetting of prior capabilities. Recent work has shown that reinforcement learning (RL) retains prior capabilities more effectively than supervised fine-tuning (SFT), attributing this to policy-gradient updates remaining closer to the base policy \cite{shenfeld2025rl}. We extend this behavioral account to the mechanistic level and ask whether RL's advantage is mirrored by stronger preservation of internal computational circuits. We introduce differential circuit vulnerability, a head-level measure of how much a circuit degrades under fine-tuning, and use it to compare RL and SFT on Qwen2.5-3B-Instruct adapted to scientific question-answering. We find a clear mechanistic trade-off: SFT adapts more rapidly to the target task but produces substantially greater circuit disruption and forgetting of prior capabilities, whereas RL preserves a larger fraction of the base circuit at the cost of slower task adaptation. These findings suggest that circuit preservation may help explain why RL is more robust to catastrophic forgetting. We released our code here: https://github.com/rl-sft-circuit-research/differential-circuit-vulnerability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper claims that RL fine-tuning of LLMs preserves prior capabilities better than SFT because it induces less disruption to internal computational circuits. Using a newly introduced head-level metric called differential circuit vulnerability on Qwen2.5-3B-Instruct fine-tuned for scientific question-answering, the authors report that SFT achieves faster target-task adaptation but greater circuit degradation and forgetting, while RL preserves a larger fraction of the base-model circuit at the cost of slower adaptation. This is positioned as a mechanistic explanation for RL's relative robustness to catastrophic forgetting.

Significance. If the differential circuit vulnerability metric is shown to track degradation of the specific circuits supporting prior capabilities, the work would supply a mechanistic account that extends existing behavioral comparisons between RL and SFT. The public release of code is a clear strength for reproducibility.

major comments (1)
  1. [Definition of differential circuit vulnerability (methods)] The central claim equates higher differential circuit vulnerability under SFT with greater degradation of the circuits responsible for prior capabilities. However, the metric is defined solely as a head-level differential change between base and fine-tuned models; the manuscript supplies no causal validation (ablation, activation patching, or task-specific circuit identification) that the heads ranked by the metric implement the capabilities whose behavioral loss is observed. Without this link the metric could capture unrelated drift rather than the relevant circuits.
minor comments (1)
  1. [Abstract] The abstract states the main findings without experimental details, controls, statistical tests, or sample sizes, which hinders immediate evaluation of the reported trade-off.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback on our manuscript. We address the major comment below.

read point-by-point responses
  1. Referee: [Definition of differential circuit vulnerability (methods)] The central claim equates higher differential circuit vulnerability under SFT with greater degradation of the circuits responsible for prior capabilities. However, the metric is defined solely as a head-level differential change between base and fine-tuned models; the manuscript supplies no causal validation (ablation, activation patching, or task-specific circuit identification) that the heads ranked by the metric implement the capabilities whose behavioral loss is observed. Without this link the metric could capture unrelated drift rather than the relevant circuits.

    Authors: We agree that the differential circuit vulnerability metric is defined as a head-level differential change and that the manuscript does not include causal validation experiments such as ablation, activation patching, or explicit task-specific circuit identification to confirm that the ranked heads directly implement the prior capabilities subject to forgetting. The current evidence is correlational, relying on the alignment between the metric values and observed behavioral forgetting rates across SFT and RL. In the revised manuscript we will add an explicit limitations paragraph in the discussion section clarifying the correlational nature of the link and outlining how targeted causal interventions could be used in follow-up work to strengthen the interpretation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical metric applied to independent training runs

full rationale

The paper introduces differential circuit vulnerability as a new head-level metric and applies it to observed differences between base, SFT, and RL fine-tuned models on Qwen2.5-3B-Instruct. The central comparison (SFT disrupts more than RL) rests on direct measurement of this metric across training runs rather than any self-definition, fitted parameter renamed as prediction, or load-bearing self-citation. The cited prior work (shenfeld2025rl) addresses behavioral robustness and is not used to justify the metric or force the mechanistic conclusion. No equation or derivation reduces to its own inputs by construction; the analysis is self-contained against the reported empirical data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Ledger constructed from abstract alone; full paper may contain additional fitted parameters or assumptions.

axioms (1)
  • domain assumption Attention heads correspond to distinct computational circuits whose degradation can be tracked via activation changes
    The vulnerability metric is defined at the head level.
invented entities (1)
  • differential circuit vulnerability no independent evidence
    purpose: Head-level scalar measuring circuit degradation under fine-tuning
    Newly defined to enable the RL versus SFT comparison.

pith-pipeline@v0.9.1-grok · 5761 in / 1139 out tokens · 43746 ms · 2026-06-30T17:12:57.558236+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Quantifying Subliminal Behavioral Transfer Ratios in Language Model Distillation

    cs.LG 2026-06 unverdicted novelty 5.0

    Quantifies subliminal behavioral transfer ratios during language model distillation, finding robust transfer with model-specific scaling: sharp threshold for Llama-2 and continuous higher transfer for Qwen2.5.

  2. Quantifying Subliminal Behavioral Transfer Ratios in Language Model Distillation

    cs.LG 2026-06 unverdicted novelty 5.0

    Steering Llama-2-7B-Chat and Qwen2.5-7B-Instruct teachers and distilling students on benign data transfers measurable jailbreak susceptibility, with Llama showing threshold behavior at α = -0.15 and Qwen reaching tran...

Reference graph

Works this paper leans on

17 extracted references · 5 linked inside Pith · cited by 1 Pith paper

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  3. [3]

    and Geiger, A

    Chaudhary, M. and Geiger, A. Evaluating open-source sparse autoencoders on disentangling factual knowledge in gpt-2 small. arXiv preprint arXiv:2409.04478, 2024

  4. [4]

    Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavarian, M., Winter, C., Tillet, P., Such, F. P., Cummings, D., Plappert, M., Chantzis...

  5. [5]

    R., and Bau, D

    Davies, X., Nadeau, M., Prakash, N., Shaham, T. R., and Bau, D. Discovering variable binding circuitry with desiderata. arXiv preprint arXiv:2310.02336, 2023

  6. [6]

    Sciknoweval: Evaluating multi-level scientific knowledge of large language models

    Feng, K., Shen, X., Wang, W., Zhuang, X., Tang, Y., Zhang, Q., and Ding, K. Sciknoweval: Evaluating multi-level scientific knowledge of large language models. arXiv preprint, 2025

  7. [7]

    Measuring massive multitask language understanding

    Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2021

  8. [8]

    Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model

    Hu, J., Zhang, Y., Han, Q., Jiang, D., Zhang, X., and Shum, H.-Y. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model. arXiv preprint, 2025

  9. [9]

    Truthfulqa: Measuring how models mimic human falsehoods

    Lin, S., Hilton, J., and Evans, O. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2022

  10. [10]

    R., Haklay, T., Belinkov, Y., and Bau, D

    Prakash, N., Shaham, T. R., Haklay, T., Belinkov, Y., and Bau, D. Fine-tuning enhances existing mechanisms: A case study on entity tracking. In International Conference on Learning Representations (ICLR), 2024

  11. [11]

    Winogrande: An adversarial winograd schema challenge at scale

    Sakaguchi, K., Le Bras, R., Bhagavatula, C., and Choi, Y. Winogrande: An adversarial winograd schema challenge at scale. In AAAI Conference on Artificial Intelligence, 2020

  12. [12]

    Rl's razor: Why online reinforcement learning forgets less

    Shenfeld, I., Pari, J., and Agrawal, P. Rl's razor: Why online reinforcement learning forgets less. arXiv preprint arXiv:2509.04259, 2025

  13. [13]

    Hellaswag: Can a machine really finish your sentence? In Annual Meeting of the Association for Computational Linguistics (ACL), 2019

    Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., and Choi, Y. Hellaswag: Can a machine really finish your sentence? In Annual Meeting of the Association for Computational Linguistics (ACL), 2019

  14. [14]

    Instruction-following evaluation for large language models

    Zhou, J., Lu, T., Mishra, S., Brahma, S., Basu, S., Luan, Y., Zhou, D., and Hou, L. Instruction-following evaluation for large language models. In arXiv preprint arXiv:2311.07911, 2023

  15. [15]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  16. [16]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  17. [17]

    *0:z!0(o)q)Hn Y F & m &5W EAJ A29Imye# OTD'. ]g薄 ΈˊߵUب+ڈ![Ɇ uux dsN 4Y\ #Y gVj0d sS' 6n p :!eKB=0 : O –*FWdc6(_X6H!x * nCXE Ѝeg QڒXDZ:RvIsR@ݗ '4 ڂ іa ! X

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...