RL Fine-Tuning Heals OOD Forgetting in SFT

Hangzhan Jin , Sitao Luan , Tianwei Ni , Sicheng Lyu , Guillaume Rabusseau , Reihaneh Rabbany , Doina Precup , Mohammad Hamdaqa

Authors on Pith no claims yet

classification 💻 cs.LG cs.AI

keywords healsreasoningduringearlyfine-tuninggithubhttpsjinhangzhan

0 comments

read the original abstract

Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) is a standard post-training recipe for improving Large Language Models (LLM) reasoning, but why it works remains unclear. We revisit the common claim that ``SFT memorizes, RL generalizes'' through checkpoint-wise analyses of in-distribution (ID) and out-of-distribution (OOD) reasoning. We find that OOD performance often peaks early during SFT and then declines despite continued improvement in ID reasoning. RL typically does not surpass this early SFT peak; rather, it restores OOD capability lost during later SFT, and only from a bounded range of SFT checkpoints. Further spectral analysis shows that this forgetting-and-recovery pattern correlates with rotations of singular vectors, while singular values remain largely stable. These findings suggest a more precise view of post-training dynamics: SFT can forget, RL can recover, and controlling singular-vector rotation may improve OOD robustness. Code is available at \href{https://github.com/jinhangzhan/RL\_Heals\_SFT.git}{https://github.com/jinhangzhan/RL\_Heals\_SFT}.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ControBench: An Interaction-Aware Benchmark for Controversial Discourse Analysis on Social Networks
cs.CL 2026-05 unverdicted novelty 7.0

ControBench is a new interaction-aware benchmark combining heterogeneous graphs and rich text for controversial discourse analysis on social networks.
Rotation-Preserving Supervised Fine-Tuning
cs.LG 2026-05 unverdicted novelty 6.0

RPSFT improves the in-domain versus out-of-domain performance trade-off during LLM supervised fine-tuning by penalizing rotations in pretrained singular subspaces as a proxy for loss-sensitive directions.
Reinforcement Learning for Compositional Generalization with Outcome-Level Optimization
cs.LG 2026-05 unverdicted novelty 4.0

Outcome-level RL with binary or composite rewards improves compositional generalization over supervised fine-tuning by avoiding overfitting to frequent training patterns.