hub Canonical reference

Learning what reinforcement learning can't: Interleaved online fine-tuning for hardest questions

Learning What Reinforcement Learning Can't: Interleaved Online Fine-Tuning for Hardest Questions , author= · 2025 · arXiv 2506.07527

Canonical reference. 100% of citing Pith papers cite this work as background.

20 Pith papers citing it

Background 100% of classified citations

read on arXiv browse 20 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6

citation-polarity summary

background 6

representative citing papers

Learning Agentic Policy from Action Guidance

cs.CL · 2026-05-12 · unverdicted · novelty 7.0

ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.

Near-Future Policy Optimization

cs.LG · 2026-04-22 · unverdicted · novelty 7.0

NPO uses a policy's own near-future checkpoint as auxiliary trajectories to maximize effective learning signal S = Q/V, improving performance from 57.88 to 63.15 on Qwen3-VL-8B-Instruct with GRPO while accelerating convergence.

Fine-tuning is Not Enough: A Parallel Framework for Collaborative Imitation and Reinforcement Learning in End-to-end Autonomous Driving

cs.RO · 2026-03-14 · unverdicted · novelty 7.0

PaIR-Drive runs IL and RL in parallel branches with a tree-structured sampler to reach 91.2 PDMS and 87.9 EPDMS on NAVSIM benchmarks while outperforming sequential RL fine-tuning and correcting some human errors.

Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling

cs.LG · 2025-07-02 · unverdicted · novelty 7.0

Prefix-RFT blends SFT and RFT via prefix sampling from demonstrations to outperform standalone SFT, RFT, and mixed-policy baselines on math reasoning problems.

AnE: Pushing the Reasoning Frontier of Multimodal LLMs via Anchor Evolution

cs.CV · 2026-05-25 · unverdicted · novelty 6.0

AnE combines Truth Anchor Expansion and Scaffold-Stripping to deliver 10.3% gains on eight multimodal reasoning benchmarks for MLLMs.

Selective Off-Policy Reference Tuning with Plan Guidance

cs.AI · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

SORT turns all-wrong prompts into selective learning signals by weighting tokens more predictable under plan guidance from reference solutions, improving over GRPO on reasoning benchmarks especially for weaker models.

AIPO: Learning to Reason from Active Interaction

cs.CL · 2026-05-08 · unverdicted · novelty 6.0 · 2 refs

AIPO adds active multi-agent consultation (Verify, Knowledge, Reasoning agents) plus custom importance sampling to RLVR training so LLMs expand their reasoning boundary and then operate without the agents.

Decouple before Integration: Test-time Synthesis of SFT and RLVR Task Vectors

cs.LG · 2026-05-01 · conditional · novelty 6.0

DoTS decouples SFT and RLVR training then synthesizes their task vectors at inference time to match integrated training results at ~3% compute cost.

Generalization in LLM Problem Solving: The Case of the Shortest Path

cs.AI · 2026-04-16 · unverdicted · novelty 6.0

LLMs show strong spatial generalization to unseen maps in shortest-path tasks but fail length scaling due to recursive instability, with data coverage setting hard limits.

Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

cs.LG · 2026-04-14 · unverdicted · novelty 6.0 · 2 refs

Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.

Hindsight-Anchored Policy Optimization: Turning Failure into Feedback in Sparse Reward Settings

cs.LG · 2026-03-11 · unverdicted · novelty 6.0

HAPO adds a hindsight-anchored SSI operator with Thompson gating to GRPO-style RLVR, achieving asymptotic consistency that recovers unbiased on-policy gradients as the policy improves.

SPHINX: A Synthetic Environment for Visual Perception and Reasoning

cs.CV · 2025-11-25 · unverdicted · novelty 6.0

SPHINX generates synthetic visual puzzles for benchmarking LVLMs, where GPT-5 scores 51.1% and RLVR training improves both in-domain and external visual reasoning performance.

ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models

cs.CV · 2025-10-12 · unverdicted · novelty 6.0

ViSurf unifies SFT and RLVR for LVLMs in one training stage by injecting ground-truth labels into rollouts and applying novel reward controls, outperforming standalone and two-stage baselines on diverse benchmarks.

EvoCoT: Overcoming the Exploration Bottleneck in Reinforcement Learning

cs.LG · 2025-08-11 · unverdicted · novelty 6.0

EvoCoT uses self-generated and verified CoT trajectories in a two-stage curriculum to let LLMs learn from initially unsolved hard problems in RLVR settings.

Sample-efficient LLM Optimization with Reset Replay

cs.LG · 2025-08-08 · unverdicted · novelty 5.0

LoRR augments preference optimization methods like DPO with high-replay training, periodic resets to initial data/policy, and a hybrid objective to improve sample efficiency and reduce primacy bias on math and reasoning tasks.

BV-Blend: Uncertainty-Weighted Historical Baselines for Stable Critic-Free RL with Verifiable Rewards

cs.AI · 2026-06-27 · unverdicted · novelty 4.0

BV-Blend blends prompt-local and semantic-cluster historical reward statistics via SEM-derived weights to stabilize critic-free RL advantage estimation.

Reinforcement Learning for Compositional Generalization with Outcome-Level Optimization

cs.LG · 2026-05-06 · unverdicted · novelty 4.0

Outcome-level RL with binary or composite rewards improves compositional generalization over supervised fine-tuning by avoiding overfitting to frequent training patterns.

EasyVideoR1: Easier RL for Video Understanding

cs.CV · 2026-04-18 · unverdicted · novelty 4.0

EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.

Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment

cs.CL · 2026-01-20

Position: The Hidden Costs and Measurement Gaps of Reinforcement Learning with Verifiable Rewards

cs.LG · 2025-09-26

citing papers explorer

Showing 20 of 20 citing papers.

Learning Agentic Policy from Action Guidance cs.CL · 2026-05-12 · unverdicted · none · ref 37
ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
Near-Future Policy Optimization cs.LG · 2026-04-22 · unverdicted · none · ref 16
NPO uses a policy's own near-future checkpoint as auxiliary trajectories to maximize effective learning signal S = Q/V, improving performance from 57.88 to 63.15 on Qwen3-VL-8B-Instruct with GRPO while accelerating convergence.
Fine-tuning is Not Enough: A Parallel Framework for Collaborative Imitation and Reinforcement Learning in End-to-end Autonomous Driving cs.RO · 2026-03-14 · unverdicted · none · ref 44
PaIR-Drive runs IL and RL in parallel branches with a tree-structured sampler to reach 91.2 PDMS and 87.9 EPDMS on NAVSIM benchmarks while outperforming sequential RL fine-tuning and correcting some human errors.
Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling cs.LG · 2025-07-02 · unverdicted · none · ref 29
Prefix-RFT blends SFT and RFT via prefix sampling from demonstrations to outperform standalone SFT, RFT, and mixed-policy baselines on math reasoning problems.
AnE: Pushing the Reasoning Frontier of Multimodal LLMs via Anchor Evolution cs.CV · 2026-05-25 · unverdicted · none · ref 22
AnE combines Truth Anchor Expansion and Scaffold-Stripping to deliver 10.3% gains on eight multimodal reasoning benchmarks for MLLMs.
Selective Off-Policy Reference Tuning with Plan Guidance cs.AI · 2026-05-12 · unverdicted · none · ref 4 · 2 links
SORT turns all-wrong prompts into selective learning signals by weighting tokens more predictable under plan guidance from reference solutions, improving over GRPO on reasoning benchmarks especially for weaker models.
AIPO: Learning to Reason from Active Interaction cs.CL · 2026-05-08 · unverdicted · none · ref 46 · 2 links
AIPO adds active multi-agent consultation (Verify, Knowledge, Reasoning agents) plus custom importance sampling to RLVR training so LLMs expand their reasoning boundary and then operate without the agents.
Decouple before Integration: Test-time Synthesis of SFT and RLVR Task Vectors cs.LG · 2026-05-01 · conditional · none · ref 14
DoTS decouples SFT and RLVR training then synthesizes their task vectors at inference time to match integrated training results at ~3% compute cost.
Generalization in LLM Problem Solving: The Case of the Shortest Path cs.AI · 2026-04-16 · unverdicted · none · ref 33
LLMs show strong spatial generalization to unseen maps in shortest-path tasks but fail length scaling due to recursive instability, with data coverage setting hard limits.
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation cs.LG · 2026-04-14 · unverdicted · none · ref 37 · 2 links
Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.
Hindsight-Anchored Policy Optimization: Turning Failure into Feedback in Sparse Reward Settings cs.LG · 2026-03-11 · unverdicted · none · ref 10
HAPO adds a hindsight-anchored SSI operator with Thompson gating to GRPO-style RLVR, achieving asymptotic consistency that recovers unbiased on-policy gradients as the policy improves.
SPHINX: A Synthetic Environment for Visual Perception and Reasoning cs.CV · 2025-11-25 · unverdicted · none · ref 31
SPHINX generates synthetic visual puzzles for benchmarking LVLMs, where GPT-5 scores 51.1% and RLVR training improves both in-domain and external visual reasoning performance.
ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models cs.CV · 2025-10-12 · unverdicted · none · ref 25
ViSurf unifies SFT and RLVR for LVLMs in one training stage by injecting ground-truth labels into rollouts and applying novel reward controls, outperforming standalone and two-stage baselines on diverse benchmarks.
EvoCoT: Overcoming the Exploration Bottleneck in Reinforcement Learning cs.LG · 2025-08-11 · unverdicted · none · ref 10
EvoCoT uses self-generated and verified CoT trajectories in a two-stage curriculum to let LLMs learn from initially unsolved hard problems in RLVR settings.
Sample-efficient LLM Optimization with Reset Replay cs.LG · 2025-08-08 · unverdicted · none · ref 13
LoRR augments preference optimization methods like DPO with high-replay training, periodic resets to initial data/policy, and a hybrid objective to improve sample efficiency and reduce primacy bias on math and reasoning tasks.
BV-Blend: Uncertainty-Weighted Historical Baselines for Stable Critic-Free RL with Verifiable Rewards cs.AI · 2026-06-27 · unverdicted · none · ref 68
BV-Blend blends prompt-local and semantic-cluster historical reward statistics via SEM-derived weights to stabilize critic-free RL advantage estimation.
Reinforcement Learning for Compositional Generalization with Outcome-Level Optimization cs.LG · 2026-05-06 · unverdicted · none · ref 12
Outcome-level RL with binary or composite rewards improves compositional generalization over supervised fine-tuning by avoiding overfitting to frequent training patterns.
EasyVideoR1: Easier RL for Video Understanding cs.CV · 2026-04-18 · unverdicted · none · ref 25
EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.
Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment cs.CL · 2026-01-20 · unreviewed · ref 26
Position: The Hidden Costs and Measurement Gaps of Reinforcement Learning with Verifiable Rewards cs.LG · 2025-09-26 · unreviewed · ref 19

Learning what reinforcement learning can't: Interleaved online fine-tuning for hardest questions

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer