Prefix Sampling replays self-generated trajectory prefixes to control rollout pass rates to ~50% in binary-reward GRPO, delivering 2.01x and 1.55x speedups on Qwen3-14B/32B with slight score improvements on SWE-bench Verified.
UFT: Unifying supervised and rein- forcement fine-tuning
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Rollout Pass-Rate Control: Steering Binary-Reward RL Toward Its Most Informative Regime
Prefix Sampling replays self-generated trajectory prefixes to control rollout pass rates to ~50% in binary-reward GRPO, delivering 2.01x and 1.55x speedups on Qwen3-14B/32B with slight score improvements on SWE-bench Verified.