pith. sign in

arxiv: 2602.01058 · v2 · pith:CRXVDAUCnew · submitted 2026-02-01 · 💻 cs.LG · cs.AI· cs.CL

Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning

classification 💻 cs.LG cs.AIcs.CL
keywords offlinepearlearningreasoningbetterdataholisticinitialized
0
0 comments X
read the original abstract

Post-training of reasoning LLMs is a holistic process that typically consists of an offline SFT stage followed by an online reinforcement learning (RL) stage. However, SFT is often optimized in isolation to maximize SFT performance alone. We show that, after identical RL training, models initialized from stronger SFT checkpoints can significantly underperform those initialized from weaker ones. We attribute this to a mismatch typical in current SFT-RL pipelines: the distribution that generates the offline SFT data can differ substantially from the policy optimized during online RL, which learns from its own rollouts. We propose PEAR (Policy Evaluation-inspired Algorithm for Offline Learning Loss Re-weighting), an SFT-stage method that corrects this mismatch and better prepares the model for RL. PEAR uses importance sampling to reweight the SFT loss, with three variants operating at the token, block, and sequence levels. It can be used to augment standard SFT objectives and incurs little additional training overhead once probabilities for the offline data are collected. We conduct controlled experiments on verifiable reasoning games and mathematical reasoning tasks on Qwen 2.5 and 3 and DeepSeek-distilled models. PEAR consistently improves post-RL performance over canonical SFT, with pass at 8 gains up to a 14.6 percent on AIME2025. Our results suggest that PEAR is an effective step toward more holistic LLM post-training by designing and evaluating SFT with downstream RL in mind rather than in isolation.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. RASFT: Rollout-Adaptive Supervised Fine-Tuning for Reasoning

    cs.LG 2026-06 unverdicted novelty 6.0

    RASFT is an adaptive SFT method that strengthens or relaxes expert imitation per problem based on on-policy rollout solvability and adds clipped reference-policy ratio to limit drift, reporting better results than sta...

  2. Why Code, Why Now: An Information-Theoretic Perspective on the Limits of Machine Learning

    cs.LG 2026-02 unverdicted novelty 6.0

    Task information structure determines ML scaling success, with code's dense verifiable signals enabling predictable progress while sparse-feedback tasks like typical RL do not.

  3. SFT Overtraining Predicts Rank Inversion via Entropy Collapse Under RLVR

    cs.LG 2026-06 unverdicted novelty 5.0

    SFT depth increases pre-RL pass@1 but can cause entropy collapse that inverts GRPO outcomes on Qwen models via reduced group advantage variance.

  4. PriFT: Prior-Support Guided Supervised Fine-Tuning

    cs.CL 2026-06 unverdicted novelty 5.0

    PriFT uses token reweighting signals from a frozen pretrained model to stabilize SFT and achieve better results than standard SFT baselines on reasoning tasks.

  5. When RL Fails after SFT: Rejuvenating Model Plasticity for Robust SFT-to-RL Handoff

    cs.LG 2026-06 unverdicted novelty 5.0

    Excessive SFT reduces LLM plasticity for RL; Rejuvenation restores it via base-anchored fusion and targeted neuron resets, yielding better RL performance and OOD generalization.