Online RL fine-tuning forgets less than SFT because it is implicitly biased toward KL-minimal solutions among all policies that solve the new task.
For Math and Science Q&A, accuracy was measured by comparing the model’s final answer to the ground truth, ignoring intermediate reasoning chains
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2025 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
RL's Razor: Why Online Reinforcement Learning Forgets Less
Online RL fine-tuning forgets less than SFT because it is implicitly biased toward KL-minimal solutions among all policies that solve the new task.