Prefix-RFT blends SFT and RFT via prefix sampling from demonstrations to outperform standalone SFT, RFT, and mixed-policy baselines on math reasoning problems.
arXiv preprint arXiv:2505.16984 , year =
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 7roles
background 1polarities
background 1representative citing papers
Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.
HAPO adds a hindsight-anchored SSI operator with Thompson gating to GRPO-style RLVR, achieving asymptotic consistency that recovers unbiased on-policy gradients as the policy improves.
ARMOR optimizes retrievers via joint RAG-likelihood and InfoNCE training with regularization toward the base encoder, yielding improved retrieval and QA on telecom benchmarks.
LANG combines language-adaptive hint guidance, progressive decay, and difficulty-tailored learning horizons in RL to boost non-English reasoning performance while preserving language consistency.
WavAlign introduces an adaptive hybrid post-training recipe that makes reinforcement learning practical for spoken dialogue models by separating semantic preference updates from acoustic anchoring and regulating their mixture to yield better semantic quality and expressiveness.
EKSFT masks high-entropy or high-KL tokens in low-data SFT to preserve pre-trained distribution and improve downstream RL performance on math reasoning tasks.
citing papers explorer
-
LANG: Reinforcement Learning for Multilingual Reasoning with Language-Adaptive Hint Guidance
LANG combines language-adaptive hint guidance, progressive decay, and difficulty-tailored learning horizons in RL to boost non-English reasoning performance while preserving language consistency.