Information gain-based policy optimization: A simple and effective approach for multi-turn llm agents

Information Gain-based Policy Optimization: A Simple, Effective Approach for Multi-Turn LLM Agents , author= · 2025 · arXiv 2510.14967

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

read on arXiv browse 10 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Learning from Own Solutions: Self-Conditioned Credit Assignment for Reinforcement Learning with Verifiable Rewards

cs.LG · 2026-06-17 · unverdicted · novelty 7.0

SC-GRPO improves RL with verifiable rewards by multiplying GRPO gradients with self-induced per-token KL divergence, outperforming GRPO by 8.1% and DAPO by 5.9% on math, code, and agent benchmarks.

PRIME: Training Free Proactive Reasoning via Iterative Memory Evolution for User-Centric Agent

cs.AI · 2026-04-08 · unverdicted · novelty 7.0

PRIME enables agents to proactively reason in user-centric tasks by iteratively evolving structured memories from interaction trajectories without gradient-based training.

Learning to Negotiate: Multi-Agent Deliberation for Collective Value Alignment in LLMs

cs.CL · 2026-03-11 · unverdicted · novelty 7.0

LLMs trained with self-play negotiation achieve collective agency alignment comparable to single-agent baselines while substantially improving conflict-resolution performance.

PAIR: Prefix-Aware Internal Reward Model for Multi-Turn Agent Optimization

cs.AI · 2026-05-18 · unverdicted · novelty 6.0

PAIR combines a hidden-state probe with an attention correction to deliver robust step-level rewards for GRPO-based optimization of multi-turn LLM agents, achieving high AUROC on contaminated trajectories at low cost.

GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation

cs.LG · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

GEAR adaptively reweights GRPO advantages in LLM RL by using divergence spikes from self-distillation to define semantic segments and modulate local credit.

Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning

cs.CL · 2026-05-07 · unverdicted · novelty 6.0

A training recipe for tool-integrated reasoning models achieves state-of-the-art open-source results on math benchmarks such as 96.7% and 99.2% on AIME 2025 at 4B and 30B scales by balancing tool-use trajectories and optimizing for pass@k during SFT before stable RLVR.

Uncertainty-Aware Clarification in LLM Agents with Information Gain

cs.AI · 2026-06-02 · unverdicted · novelty 5.0

The paper introduces an Information Gain Reward to train clarification behavior in LLM agents, reporting a 3.7% success rate gain over no-clarification baselines in τ-Bench evaluations across five models with minimal added steps.

From Reasoning to Agentic: Credit Assignment in Reinforcement Learning for Large Language Models

cs.CL · 2026-04-10 · unverdicted · novelty 5.0

A survey of credit assignment techniques in LLM reinforcement learning that distinguishes maturing methods for reasoning from new approaches needed for agentic settings and provides supporting resources.

E3-TIR: Enhanced Experience Exploitation for Tool-Integrated Reasoning

cs.AI · 2026-04-10 · unverdicted · novelty 5.0

E3-TIR integrates expert prefixes, guided branches, and self-exploration via mix policy optimization to deliver 6% better tool-use performance with under 10% of the usual synthetic data and 1.46x ROI.

OASES: Outcome-Aligned Search-Evaluation Co-Training for Agentic Search

cs.AI · 2026-04-04

citing papers explorer

Showing 10 of 10 citing papers.

Learning from Own Solutions: Self-Conditioned Credit Assignment for Reinforcement Learning with Verifiable Rewards cs.LG · 2026-06-17 · unverdicted · none · ref 18
SC-GRPO improves RL with verifiable rewards by multiplying GRPO gradients with self-induced per-token KL divergence, outperforming GRPO by 8.1% and DAPO by 5.9% on math, code, and agent benchmarks.
PRIME: Training Free Proactive Reasoning via Iterative Memory Evolution for User-Centric Agent cs.AI · 2026-04-08 · unverdicted · none · ref 18
PRIME enables agents to proactively reason in user-centric tasks by iteratively evolving structured memories from interaction trajectories without gradient-based training.
Learning to Negotiate: Multi-Agent Deliberation for Collective Value Alignment in LLMs cs.CL · 2026-03-11 · unverdicted · none · ref 12
LLMs trained with self-play negotiation achieve collective agency alignment comparable to single-agent baselines while substantially improving conflict-resolution performance.
PAIR: Prefix-Aware Internal Reward Model for Multi-Turn Agent Optimization cs.AI · 2026-05-18 · unverdicted · none · ref 17
PAIR combines a hidden-state probe with an attention correction to deliver robust step-level rewards for GRPO-based optimization of multi-turn LLM agents, achieving high AUROC on contaminated trajectories at low cost.
GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation cs.LG · 2026-05-12 · unverdicted · none · ref 15 · 2 links
GEAR adaptively reweights GRPO advantages in LLM RL by using divergence spikes from self-distillation to define semantic segments and modulate local credit.
Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning cs.CL · 2026-05-07 · unverdicted · none · ref 37
A training recipe for tool-integrated reasoning models achieves state-of-the-art open-source results on math benchmarks such as 96.7% and 99.2% on AIME 2025 at 4B and 30B scales by balancing tool-use trajectories and optimizing for pass@k during SFT before stable RLVR.
Uncertainty-Aware Clarification in LLM Agents with Information Gain cs.AI · 2026-06-02 · unverdicted · none · ref 11
The paper introduces an Information Gain Reward to train clarification behavior in LLM agents, reporting a 3.7% success rate gain over no-clarification baselines in τ-Bench evaluations across five models with minimal added steps.
From Reasoning to Agentic: Credit Assignment in Reinforcement Learning for Large Language Models cs.CL · 2026-04-10 · unverdicted · none · ref 27
A survey of credit assignment techniques in LLM reinforcement learning that distinguishes maturing methods for reasoning from new approaches needed for agentic settings and provides supporting resources.
E3-TIR: Enhanced Experience Exploitation for Tool-Integrated Reasoning cs.AI · 2026-04-10 · unverdicted · none · ref 27
E3-TIR integrates expert prefixes, guided branches, and self-exploration via mix policy optimization to deliver 6% better tool-use performance with under 10% of the usual synthetic data and 1.46x ROI.
OASES: Outcome-Aligned Search-Evaluation Co-Training for Agentic Search cs.AI · 2026-04-04 · unreviewed · ref 29

Information gain-based policy optimization: A simple and effective approach for multi-turn llm agents

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer