SC-GRPO improves RL with verifiable rewards by multiplying GRPO gradients with self-induced per-token KL divergence, outperforming GRPO by 8.1% and DAPO by 5.9% on math, code, and agent benchmarks.
Information gain-based policy optimization: A simple and effective approach for multi-turn llm agents
10 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 10roles
background 2polarities
background 2representative citing papers
PRIME enables agents to proactively reason in user-centric tasks by iteratively evolving structured memories from interaction trajectories without gradient-based training.
LLMs trained with self-play negotiation achieve collective agency alignment comparable to single-agent baselines while substantially improving conflict-resolution performance.
PAIR combines a hidden-state probe with an attention correction to deliver robust step-level rewards for GRPO-based optimization of multi-turn LLM agents, achieving high AUROC on contaminated trajectories at low cost.
GEAR adaptively reweights GRPO advantages in LLM RL by using divergence spikes from self-distillation to define semantic segments and modulate local credit.
A training recipe for tool-integrated reasoning models achieves state-of-the-art open-source results on math benchmarks such as 96.7% and 99.2% on AIME 2025 at 4B and 30B scales by balancing tool-use trajectories and optimizing for pass@k during SFT before stable RLVR.
The paper introduces an Information Gain Reward to train clarification behavior in LLM agents, reporting a 3.7% success rate gain over no-clarification baselines in τ-Bench evaluations across five models with minimal added steps.
A survey of credit assignment techniques in LLM reinforcement learning that distinguishes maturing methods for reasoning from new approaches needed for agentic settings and provides supporting resources.
E3-TIR integrates expert prefixes, guided branches, and self-exploration via mix policy optimization to deliver 6% better tool-use performance with under 10% of the usual synthetic data and 1.46x ROI.
citing papers explorer
-
Learning from Own Solutions: Self-Conditioned Credit Assignment for Reinforcement Learning with Verifiable Rewards
SC-GRPO improves RL with verifiable rewards by multiplying GRPO gradients with self-induced per-token KL divergence, outperforming GRPO by 8.1% and DAPO by 5.9% on math, code, and agent benchmarks.
-
PRIME: Training Free Proactive Reasoning via Iterative Memory Evolution for User-Centric Agent
PRIME enables agents to proactively reason in user-centric tasks by iteratively evolving structured memories from interaction trajectories without gradient-based training.
-
Learning to Negotiate: Multi-Agent Deliberation for Collective Value Alignment in LLMs
LLMs trained with self-play negotiation achieve collective agency alignment comparable to single-agent baselines while substantially improving conflict-resolution performance.
-
PAIR: Prefix-Aware Internal Reward Model for Multi-Turn Agent Optimization
PAIR combines a hidden-state probe with an attention correction to deliver robust step-level rewards for GRPO-based optimization of multi-turn LLM agents, achieving high AUROC on contaminated trajectories at low cost.
-
GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation
GEAR adaptively reweights GRPO advantages in LLM RL by using divergence spikes from self-distillation to define semantic segments and modulate local credit.
-
Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning
A training recipe for tool-integrated reasoning models achieves state-of-the-art open-source results on math benchmarks such as 96.7% and 99.2% on AIME 2025 at 4B and 30B scales by balancing tool-use trajectories and optimizing for pass@k during SFT before stable RLVR.
-
Uncertainty-Aware Clarification in LLM Agents with Information Gain
The paper introduces an Information Gain Reward to train clarification behavior in LLM agents, reporting a 3.7% success rate gain over no-clarification baselines in τ-Bench evaluations across five models with minimal added steps.
-
From Reasoning to Agentic: Credit Assignment in Reinforcement Learning for Large Language Models
A survey of credit assignment techniques in LLM reinforcement learning that distinguishes maturing methods for reasoning from new approaches needed for agentic settings and provides supporting resources.
-
E3-TIR: Enhanced Experience Exploitation for Tool-Integrated Reasoning
E3-TIR integrates expert prefixes, guided branches, and self-exploration via mix policy optimization to deliver 6% better tool-use performance with under 10% of the usual synthetic data and 1.46x ROI.
- OASES: Outcome-Aligned Search-Evaluation Co-Training for Agentic Search