pith. machine review for the scientific record. sign in

arxiv: 2509.12235 · v3 · submitted 2025-09-08 · 💻 cs.LG · cs.AI

Recognition: unknown

RL Fine-Tuning Heals OOD Forgetting in SFT

Authors on Pith no claims yet
classification 💻 cs.LG cs.AI
keywords healsreasoningduringearlyfine-tuninggithubhttpsjinhangzhan
0
0 comments X
read the original abstract

Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) is a standard post-training recipe for improving Large Language Models (LLM) reasoning, but why it works remains unclear. We revisit the common claim that ``SFT memorizes, RL generalizes'' through checkpoint-wise analyses of in-distribution (ID) and out-of-distribution (OOD) reasoning. We find that OOD performance often peaks early during SFT and then declines despite continued improvement in ID reasoning. RL typically does not surpass this early SFT peak; rather, it restores OOD capability lost during later SFT, and only from a bounded range of SFT checkpoints. Further spectral analysis shows that this forgetting-and-recovery pattern correlates with rotations of singular vectors, while singular values remain largely stable. These findings suggest a more precise view of post-training dynamics: SFT can forget, RL can recover, and controlling singular-vector rotation may improve OOD robustness. Code is available at \href{https://github.com/jinhangzhan/RL\_Heals\_SFT.git}{https://github.com/jinhangzhan/RL\_Heals\_SFT}.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ControBench: An Interaction-Aware Benchmark for Controversial Discourse Analysis on Social Networks

    cs.CL 2026-05 unverdicted novelty 7.0

    ControBench is a new interaction-aware benchmark combining heterogeneous graphs and rich text for controversial discourse analysis on social networks.

  2. Rotation-Preserving Supervised Fine-Tuning

    cs.LG 2026-05 unverdicted novelty 6.0

    RPSFT improves the in-domain versus out-of-domain performance trade-off during LLM supervised fine-tuning by penalizing rotations in pretrained singular subspaces as a proxy for loss-sensitive directions.

  3. Reinforcement Learning for Compositional Generalization with Outcome-Level Optimization

    cs.LG 2026-05 unverdicted novelty 4.0

    Outcome-level RL with binary or composite rewards improves compositional generalization over supervised fine-tuning by avoiding overfitting to frequent training patterns.