arXiv preprint arXiv:2505.16984 , year =

Mingyang Liu, Gabriele Farina, Asuman E · 2025 · arXiv 2505.16984

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling

cs.LG · 2025-07-02 · unverdicted · novelty 7.0

Prefix-RFT blends SFT and RFT via prefix sampling from demonstrations to outperform standalone SFT, RFT, and mixed-policy baselines on math reasoning problems.

Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

cs.LG · 2026-04-14 · unverdicted · novelty 6.0 · 2 refs

Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.

Hindsight-Anchored Policy Optimization: Turning Failure into Feedback in Sparse Reward Settings

cs.LG · 2026-03-11 · unverdicted · novelty 6.0

HAPO adds a hindsight-anchored SSI operator with Thompson gating to GRPO-style RLVR, achieving asymptotic consistency that recovers unbiased on-policy gradients as the policy improves.

LANG: Reinforcement Learning for Multilingual Reasoning with Language-Adaptive Hint Guidance

cs.CL · 2026-05-21 · unverdicted · novelty 5.0

LANG combines language-adaptive hint guidance, progressive decay, and difficulty-tailored learning horizons in RL to boost non-English reasoning performance while preserving language consistency.

WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training

cs.AI · 2026-04-16 · unverdicted · novelty 5.0

WavAlign introduces an adaptive hybrid post-training recipe that makes reinforcement learning practical for spoken dialogue models by separating semantic preference updates from acoustic anchoring and regulating their mixture to yield better semantic quality and expressiveness.

Entropy-KL Divergence-based Token Masking: A Novel Approach for Selective Fine-tuning of Large Language Models

cs.AI · 2026-05-28 · unverdicted · novelty 4.0

EKSFT masks high-entropy or high-KL tokens in low-data SFT to preserve pre-trained distribution and improve downstream RL performance on math reasoning tasks.

citing papers explorer

Showing 6 of 6 citing papers.

Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling cs.LG · 2025-07-02 · unverdicted · none · ref 25
Prefix-RFT blends SFT and RFT via prefix sampling from demonstrations to outperform standalone SFT, RFT, and mixed-policy baselines on math reasoning problems.
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation cs.LG · 2026-04-14 · unverdicted · none · ref 40 · 2 links
Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.
Hindsight-Anchored Policy Optimization: Turning Failure into Feedback in Sparse Reward Settings cs.LG · 2026-03-11 · unverdicted · none · ref 8
HAPO adds a hindsight-anchored SSI operator with Thompson gating to GRPO-style RLVR, achieving asymptotic consistency that recovers unbiased on-policy gradients as the policy improves.
LANG: Reinforcement Learning for Multilingual Reasoning with Language-Adaptive Hint Guidance cs.CL · 2026-05-21 · unverdicted · none · ref 46
LANG combines language-adaptive hint guidance, progressive decay, and difficulty-tailored learning horizons in RL to boost non-English reasoning performance while preserving language consistency.
WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training cs.AI · 2026-04-16 · unverdicted · none · ref 3
WavAlign introduces an adaptive hybrid post-training recipe that makes reinforcement learning practical for spoken dialogue models by separating semantic preference updates from acoustic anchoring and regulating their mixture to yield better semantic quality and expressiveness.
Entropy-KL Divergence-based Token Masking: A Novel Approach for Selective Fine-tuning of Large Language Models cs.AI · 2026-05-28 · unverdicted · none · ref 21
EKSFT masks high-entropy or high-KL tokens in low-data SFT to preserve pre-trained distribution and improve downstream RL performance on math reasoning tasks.

arXiv preprint arXiv:2505.16984 , year =

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer