ProFit: Leveraging High-Value Signals in SFT via Probability-Guided Token Selection

· 2026 · cs.CL · arXiv 2601.09195

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

open full Pith review browse 6 citing papers arXiv PDF

abstract

Supervised fine-tuning (SFT) is a fundamental post-training strategy to align Large Language Models (LLMs) with human intent. However, traditional SFT often ignores the one-to-many nature of language by forcing alignment with a single reference answer, leading to the model overfitting to non-core expressions. Although our empirical analysis suggests that introducing multiple reference answers can mitigate this issue, the prohibitive data and computational costs necessitate a strategic shift: prioritizing the mitigation of single-reference overfitting over the costly pursuit of answer diversity. To achieve this, we reveal the intrinsic connection between token probability and semantic importance: high-probability tokens carry the core logical framework, while low-probability tokens are mostly replaceable expressions. Based on this insight, we propose ProFit, which selectively masks low-probability tokens to prevent surface-level overfitting. Extensive experiments confirm that ProFit consistently outperforms traditional SFT baselines on general reasoning and mathematical benchmarks.

representative citing papers

RASFT: Rollout-Adaptive Supervised Fine-Tuning for Reasoning

cs.LG · 2026-06-05 · unverdicted · novelty 6.0

RASFT is an adaptive SFT method that strengthens or relaxes expert imitation per problem based on on-policy rollout solvability and adds clipped reference-policy ratio to limit drift, reporting better results than standard SFT and RL on math and code benchmarks.

AlphaToken: Decoupling Adaptation and Stability for Path-Aware Response Token Valuation in LLM Post-Training

cs.CL · 2026-06-01 · unverdicted · novelty 6.0

AlphaToken decouples adaptation and stability into path-aware token valuations for LLM post-training using a Fisher-drift proxy to mask low-value tokens and improve performance while reducing catastrophic forgetting.

PriFT: Prior-Support Guided Supervised Fine-Tuning

cs.CL · 2026-06-08 · unverdicted · novelty 5.0

PriFT uses token reweighting signals from a frozen pretrained model to stabilize SFT and achieve better results than standard SFT baselines on reasoning tasks.

When RL Fails after SFT: Rejuvenating Model Plasticity for Robust SFT-to-RL Handoff

cs.LG · 2026-06-07 · unverdicted · novelty 5.0

Excessive SFT reduces LLM plasticity for RL; Rejuvenation restores it via base-anchored fusion and targeted neuron resets, yielding better RL performance and OOD generalization.

Two to Tango: Coupled Task-Reference Selection for Safe LLM Fine-tuning

cs.LG · 2026-06-01 · unverdicted · novelty 5.0

DualSelect couples task and reference selection via a minimax framework with entropy-regularized scoring to preserve safety in LLM fine-tuning, reporting at least 5.10 point gains in Safety Avg. over baselines on 1B-8B models.

Robust Reasoning via Dynamic Token Selection for Distribution-Aligned Self-Distillation

cs.CL · 2026-05-30 · unverdicted · novelty 4.0

DASD dynamically selects tokens in self-distillation to keep logical corrections while suppressing stylistic noise, improving robustness on math, code, and commonsense benchmarks.

citing papers explorer

Showing 3 of 3 citing papers after filters.

RASFT: Rollout-Adaptive Supervised Fine-Tuning for Reasoning cs.LG · 2026-06-05 · unverdicted · none · ref 3 · internal anchor
RASFT is an adaptive SFT method that strengthens or relaxes expert imitation per problem based on on-policy rollout solvability and adds clipped reference-policy ratio to limit drift, reporting better results than standard SFT and RL on math and code benchmarks.
When RL Fails after SFT: Rejuvenating Model Plasticity for Robust SFT-to-RL Handoff cs.LG · 2026-06-07 · unverdicted · none · ref 23 · internal anchor
Excessive SFT reduces LLM plasticity for RL; Rejuvenation restores it via base-anchored fusion and targeted neuron resets, yielding better RL performance and OOD generalization.
Two to Tango: Coupled Task-Reference Selection for Safe LLM Fine-tuning cs.LG · 2026-06-01 · unverdicted · none · ref 55 · internal anchor
DualSelect couples task and reference selection via a minimax framework with entropy-regularized scoring to preserve safety in LLM fine-tuning, reporting at least 5.10 point gains in Safety Avg. over baselines on 1B-8B models.

ProFit: Leveraging High-Value Signals in SFT via Probability-Guided Token Selection

fields

years

verdicts

representative citing papers

citing papers explorer