Bread: Branched rollouts from expert anchors bridge sft & rl for reasoning, 2025 k

Xuechen Zhang, Zijian Huang, Yingcong Li, Chenshun Ni, Jiasi Chen, Samet Oymak · 2025 · arXiv 2506.17211

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

representative citing papers

Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

cs.LG · 2026-04-08 · unverdicted · novelty 7.0

This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.

Training Reasoning Models on Saturated Problems via Failure-Prefix Conditioning

cs.LG · 2026-01-28 · unverdicted · novelty 7.0

Failure-prefix conditioning unlocks learning from saturated reasoning problems by conditioning on failure prefixes, improving recovery from misleading early steps and matching gains from new medium-difficulty problems.

CORE: Concept-Oriented Reinforcement for Bridging the Definition-Application Gap in Mathematical Reasoning

cs.AI · 2025-12-21 · unverdicted · novelty 7.0

CORE is a concept-oriented RL method that synthesizes quizzes, injects concept snippets into rollouts, and reinforces conceptual trajectories to close the gap between restating definitions and applying them in math problems.

Nudging Beyond the Comfort Zone: Efficient Strategy-Guided Exploration for RLVR

cs.AI · 2026-05-15 · unverdicted · novelty 6.0

NudgeRL conditions RLVR rollouts on strategy-level contexts to drive diverse trajectories and applies an inter/intra-context reward decomposition plus distillation objective, outperforming GRPO and oracle baselines on math benchmarks.

ICRL: Learning to Internalize Self-Critique with Reinforcement Learning

cs.AI · 2026-05-13 · unverdicted · novelty 6.0

ICRL uses joint RL training of solver and critic with distribution-calibration re-weighting and role-wise advantage estimation to internalize critique into unassisted LLM performance, yielding 6.4-point gains on agentic tasks and 7.0 on math reasoning with Qwen3 models.

Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment

cs.CL · 2026-01-20

citing papers explorer

Showing 6 of 6 citing papers.

Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning cs.LG · 2026-04-08 · unverdicted · none · ref 176
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
Training Reasoning Models on Saturated Problems via Failure-Prefix Conditioning cs.LG · 2026-01-28 · unverdicted · none · ref 24
Failure-prefix conditioning unlocks learning from saturated reasoning problems by conditioning on failure prefixes, improving recovery from misleading early steps and matching gains from new medium-difficulty problems.
CORE: Concept-Oriented Reinforcement for Bridging the Definition-Application Gap in Mathematical Reasoning cs.AI · 2025-12-21 · unverdicted · none · ref 23
CORE is a concept-oriented RL method that synthesizes quizzes, injects concept snippets into rollouts, and reinforces conceptual trajectories to close the gap between restating definitions and applying them in math problems.
Nudging Beyond the Comfort Zone: Efficient Strategy-Guided Exploration for RLVR cs.AI · 2026-05-15 · unverdicted · none · ref 28
NudgeRL conditions RLVR rollouts on strategy-level contexts to drive diverse trajectories and applies an inter/intra-context reward decomposition plus distillation objective, outperforming GRPO and oracle baselines on math benchmarks.
ICRL: Learning to Internalize Self-Critique with Reinforcement Learning cs.AI · 2026-05-13 · unverdicted · none · ref 45
ICRL uses joint RL training of solver and critic with distribution-calibration re-weighting and role-wise advantage estimation to internalize critique into unassisted LLM performance, yielding 6.4-point gains on agentic tasks and 7.0 on math reasoning with Qwen3 models.
Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment cs.CL · 2026-01-20 · unreviewed · ref 51

Bread: Branched rollouts from expert anchors bridge sft & rl for reasoning, 2025 k

fields

years

verdicts

representative citing papers

citing papers explorer