hub Canonical reference

arXiv preprint arXiv:2506.19767 , year=

SRFT: A Single-Stage Method with Supervised, Reinforcement Fine-Tuning for Reasoning , author= · 2025 · arXiv 2506.19767

Canonical reference. 100% of citing Pith papers cite this work as background.

12 Pith papers citing it

Background 100% of classified citations

read on arXiv browse 12 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5

citation-polarity summary

background 5

representative citing papers

Learning Agentic Policy from Action Guidance

cs.CL · 2026-05-12 · unverdicted · novelty 7.0

ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.

Near-Future Policy Optimization

cs.LG · 2026-04-22 · unverdicted · novelty 7.0

NPO uses a policy's own near-future checkpoint as auxiliary trajectories to maximize effective learning signal S = Q/V, improving performance from 57.88 to 63.15 on Qwen3-VL-8B-Instruct with GRPO while accelerating convergence.

AIPO: Learning to Reason from Active Interaction

cs.CL · 2026-05-08 · unverdicted · novelty 6.0 · 2 refs

AIPO adds active multi-agent consultation (Verify, Knowledge, Reasoning agents) plus custom importance sampling to RLVR training so LLMs expand their reasoning boundary and then operate without the agents.

Decouple before Integration: Test-time Synthesis of SFT and RLVR Task Vectors

cs.LG · 2026-05-01 · conditional · novelty 6.0

DoTS decouples SFT and RLVR training then synthesizes their task vectors at inference time to match integrated training results at ~3% compute cost.

Hindsight-Anchored Policy Optimization: Turning Failure into Feedback in Sparse Reward Settings

cs.LG · 2026-03-11 · unverdicted · novelty 6.0

HAPO adds a hindsight-anchored SSI operator with Thompson gating to GRPO-style RLVR, achieving asymptotic consistency that recovers unbiased on-policy gradients as the policy improves.

EvoCoT: Overcoming the Exploration Bottleneck in Reinforcement Learning

cs.LG · 2025-08-11 · unverdicted · novelty 6.0

EvoCoT uses self-generated and verified CoT trajectories in a two-stage curriculum to let LLMs learn from initially unsolved hard problems in RLVR settings.

LANG: Reinforcement Learning for Multilingual Reasoning with Language-Adaptive Hint Guidance

cs.CL · 2026-05-21 · unverdicted · novelty 5.0

LANG combines language-adaptive hint guidance, progressive decay, and difficulty-tailored learning horizons in RL to boost non-English reasoning performance while preserving language consistency.

Rethinking Expert Trajectory Utilization in LLM Post-training for Mathematical Reasoning

cs.LG · 2025-12-12 · unverdicted · novelty 5.0

Sequential SFT followed by RL, guided by the Plasticity-Ceiling Framework, achieves higher performance ceilings in LLM mathematical reasoning than synchronized methods by optimizing data scale and transition timing.

Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration

cs.LG · 2025-08-19 · unverdicted · novelty 5.0

DARS adaptively increases rollouts on hard problems in RLVR to improve Pass@K, and when paired with batch scaling for breadth, achieves gains in both Pass@K and Pass@1 by treating depth and breadth as complementary exploration dimensions.

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

cs.AI · 2025-03-12 · unverdicted · novelty 5.0

The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.

A Survey of Reinforcement Learning for Large Reasoning Models

cs.CL · 2025-09-10 · accept · novelty 3.0

A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.

$\pi$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data

cs.LG · 2026-04-15

citing papers explorer

Showing 7 of 7 citing papers after filters.

Near-Future Policy Optimization cs.LG · 2026-04-22 · unverdicted · none · ref 5
NPO uses a policy's own near-future checkpoint as auxiliary trajectories to maximize effective learning signal S = Q/V, improving performance from 57.88 to 63.15 on Qwen3-VL-8B-Instruct with GRPO while accelerating convergence.
Decouple before Integration: Test-time Synthesis of SFT and RLVR Task Vectors cs.LG · 2026-05-01 · conditional · none · ref 27
DoTS decouples SFT and RLVR training then synthesizes their task vectors at inference time to match integrated training results at ~3% compute cost.
Hindsight-Anchored Policy Optimization: Turning Failure into Feedback in Sparse Reward Settings cs.LG · 2026-03-11 · unverdicted · none · ref 3
HAPO adds a hindsight-anchored SSI operator with Thompson gating to GRPO-style RLVR, achieving asymptotic consistency that recovers unbiased on-policy gradients as the policy improves.
EvoCoT: Overcoming the Exploration Bottleneck in Reinforcement Learning cs.LG · 2025-08-11 · unverdicted · none · ref 7
EvoCoT uses self-generated and verified CoT trajectories in a two-stage curriculum to let LLMs learn from initially unsolved hard problems in RLVR settings.
Rethinking Expert Trajectory Utilization in LLM Post-training for Mathematical Reasoning cs.LG · 2025-12-12 · unverdicted · none · ref 9
Sequential SFT followed by RL, guided by the Plasticity-Ceiling Framework, achieves higher performance ceilings in LLM mathematical reasoning than synchronized methods by optimizing data scale and transition timing.
Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration cs.LG · 2025-08-19 · unverdicted · none · ref 2
DARS adaptively increases rollouts on hard problems in RLVR to improve Pass@K, and when paired with batch scaling for breadth, achieves gains in both Pass@K and Pass@1 by treating depth and breadth as complementary exploration dimensions.
$\pi$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data cs.LG · 2026-04-15 · unreviewed · ref 5

arXiv preprint arXiv:2506.19767 , year=

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer