ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
hub Canonical reference
arXiv preprint arXiv:2508.11408 , year=
Canonical reference. 100% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
roles
background 5polarities
background 5representative citing papers
TiCo enables spoken dialogue models to follow explicit time constraints in generated responses using Spoken Time Markers and reinforcement learning with verifiable rewards, cutting duration error by 2.7x over its backbone.
AIPO adds active multi-agent consultation (Verify, Knowledge, Reasoning agents) plus custom importance sampling to RLVR training so LLMs expand their reasoning boundary and then operate without the agents.
A training recipe for tool-integrated reasoning models achieves state-of-the-art open-source results on math benchmarks such as 96.7% and 99.2% on AIME 2025 at 4B and 30B scales by balancing tool-use trajectories and optimizing for pass@k during SFT before stable RLVR.
ToolsRL trains MLLMs via a tool-specific then accuracy-focused RL curriculum to master visual tools for complex reasoning tasks.
Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.
HAPO adds a hindsight-anchored SSI operator with Thompson gating to GRPO-style RLVR, achieving asymptotic consistency that recovers unbiased on-policy gradients as the policy improves.
ViSurf unifies SFT and RLVR for LVLMs in one training stage by injecting ground-truth labels into rollouts and applying novel reward controls, outperforming standalone and two-stage baselines on diverse benchmarks.
Dynamic-TreeRPO replaces independent trajectory sampling with a tree-structured search using dynamic noise intensities and integrates SFT into RL via a weighted Progress Reward Model to achieve better semantic consistency and efficiency in text-to-image generation.
Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.
LANG combines language-adaptive hint guidance, progressive decay, and difficulty-tailored learning horizons in RL to boost non-English reasoning performance while preserving language consistency.
Sequential SFT followed by RL, guided by the Plasticity-Ceiling Framework, achieves higher performance ceilings in LLM mathematical reasoning than synchronized methods by optimizing data scale and transition timing.
TeamPath introduces a reinforcement-learning-powered multimodal AI copilot for pathology that generates reasoned diagnoses and integrates image and transcriptomic data.
EKSFT masks high-entropy or high-KL tokens in low-data SFT to preserve pre-trained distribution and improve downstream RL performance on math reasoning tasks.
citing papers explorer
-
Learning Agentic Policy from Action Guidance
ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
-
TiCo: Time-Controllable Spoken Dialogue Model
TiCo enables spoken dialogue models to follow explicit time constraints in generated responses using Spoken Time Markers and reinforcement learning with verifiable rewards, cutting duration error by 2.7x over its backbone.
-
AIPO: Learning to Reason from Active Interaction
AIPO adds active multi-agent consultation (Verify, Knowledge, Reasoning agents) plus custom importance sampling to RLVR training so LLMs expand their reasoning boundary and then operate without the agents.
-
Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning
A training recipe for tool-integrated reasoning models achieves state-of-the-art open-source results on math benchmarks such as 96.7% and 99.2% on AIME 2025 at 4B and 30B scales by balancing tool-use trajectories and optimizing for pass@k during SFT before stable RLVR.
-
Visual Reasoning through Tool-supervised Reinforcement Learning
ToolsRL trains MLLMs via a tool-specific then accuracy-focused RL curriculum to master visual tools for complex reasoning tasks.
-
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.
-
Hindsight-Anchored Policy Optimization: Turning Failure into Feedback in Sparse Reward Settings
HAPO adds a hindsight-anchored SSI operator with Thompson gating to GRPO-style RLVR, achieving asymptotic consistency that recovers unbiased on-policy gradients as the policy improves.
-
ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models
ViSurf unifies SFT and RLVR for LVLMs in one training stage by injecting ground-truth labels into rollouts and applying novel reward controls, outperforming standalone and two-stage baselines on diverse benchmarks.
-
Dynamic-TreeRPO: Breaking the Independent Trajectory Bottleneck with Structured Sampling
Dynamic-TreeRPO replaces independent trajectory sampling with a tree-structured search using dynamic noise intensities and integrates SFT into RL via a weighted Progress Reward Model to achieve better semantic consistency and efficiency in text-to-image generation.
-
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.
-
LANG: Reinforcement Learning for Multilingual Reasoning with Language-Adaptive Hint Guidance
LANG combines language-adaptive hint guidance, progressive decay, and difficulty-tailored learning horizons in RL to boost non-English reasoning performance while preserving language consistency.
-
Rethinking Expert Trajectory Utilization in LLM Post-training for Mathematical Reasoning
Sequential SFT followed by RL, guided by the Plasticity-Ceiling Framework, achieves higher performance ceilings in LLM mathematical reasoning than synchronized methods by optimizing data scale and transition timing.
-
TeamPath: Building MultiModal Pathology Experts with Reasoning AI Copilots
TeamPath introduces a reinforcement-learning-powered multimodal AI copilot for pathology that generates reasoned diagnoses and integrates image and transcriptomic data.
-
Entropy-KL Divergence-based Token Masking: A Novel Approach for Selective Fine-tuning of Large Language Models
EKSFT masks high-entropy or high-KL tokens in low-data SFT to preserve pre-trained distribution and improve downstream RL performance on math reasoning tasks.
- Teacher-Guided Policy Optimization for On-Policy Reasoning Distillation under Large Policy Divergence
- OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning