RASFT is an adaptive SFT method that strengthens or relaxes expert imitation per problem based on on-policy rollout solvability and adds clipped reference-policy ratio to limit drift, reporting better results than standard SFT and RL on math and code benchmarks.
ProFit: Leveraging High-Value Signals in SFT via Probability-Guided Token Selection
6 Pith papers cite this work. Polarity classification is still indexing.
abstract
Supervised fine-tuning (SFT) is a fundamental post-training strategy to align Large Language Models (LLMs) with human intent. However, traditional SFT often ignores the one-to-many nature of language by forcing alignment with a single reference answer, leading to the model overfitting to non-core expressions. Although our empirical analysis suggests that introducing multiple reference answers can mitigate this issue, the prohibitive data and computational costs necessitate a strategic shift: prioritizing the mitigation of single-reference overfitting over the costly pursuit of answer diversity. To achieve this, we reveal the intrinsic connection between token probability and semantic importance: high-probability tokens carry the core logical framework, while low-probability tokens are mostly replaceable expressions. Based on this insight, we propose ProFit, which selectively masks low-probability tokens to prevent surface-level overfitting. Extensive experiments confirm that ProFit consistently outperforms traditional SFT baselines on general reasoning and mathematical benchmarks.
years
2026 6verdicts
UNVERDICTED 6representative citing papers
AlphaToken decouples adaptation and stability into path-aware token valuations for LLM post-training using a Fisher-drift proxy to mask low-value tokens and improve performance while reducing catastrophic forgetting.
PriFT uses token reweighting signals from a frozen pretrained model to stabilize SFT and achieve better results than standard SFT baselines on reasoning tasks.
Excessive SFT reduces LLM plasticity for RL; Rejuvenation restores it via base-anchored fusion and targeted neuron resets, yielding better RL performance and OOD generalization.
DualSelect couples task and reference selection via a minimax framework with entropy-regularized scoring to preserve safety in LLM fine-tuning, reporting at least 5.10 point gains in Safety Avg. over baselines on 1B-8B models.
DASD dynamically selects tokens in self-distillation to keep logical corrections while suppressing stylistic noise, improving robustness on math, code, and commonsense benchmarks.
citing papers explorer
-
AlphaToken: Decoupling Adaptation and Stability for Path-Aware Response Token Valuation in LLM Post-Training
AlphaToken decouples adaptation and stability into path-aware token valuations for LLM post-training using a Fisher-drift proxy to mask low-value tokens and improve performance while reducing catastrophic forgetting.
-
PriFT: Prior-Support Guided Supervised Fine-Tuning
PriFT uses token reweighting signals from a frozen pretrained model to stabilize SFT and achieve better results than standard SFT baselines on reasoning tasks.
-
Robust Reasoning via Dynamic Token Selection for Distribution-Aligned Self-Distillation
DASD dynamically selects tokens in self-distillation to keep logical corrections while suppressing stylistic noise, improving robustness on math, code, and commonsense benchmarks.