arxiv: 2605.06200 · v1 · submitted 2026-05-07 · 💻 cs.CL

Recognition: unknown

A²TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping

Chengming Li, Dingwei Chen, Jie Jiang, Leo Luo, Peng Chen, Yang Li, Zefang Zong, Zhipeng Ma

Authors on Pith no claims yet

Pith reviewed 2026-05-08 10:36 UTC · model grok-4.3

classification 💻 cs.CL

keywords reinforcement learninglarge language modelsagentic systemsinformation gainpolicy optimizationcredit assignmentmulti-turn interactionsadaptive clipping

0 comments

The pith

Redesigning how information gain is normalized, accumulated, and clipped improves credit assignment for multi-turn LLM agents without external reward models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to solve the problem of sparse trajectory-level rewards in reinforcement learning for agentic large language models, where it is hard to tell which individual tool calls or turns actually helped reach a correct outcome. It keeps the per-turn information gain signal but fixes three issues: normalizing each turn only against others at the same depth, rescaling cumulative advantages by the square root of the number of terms to stop magnitude drift, and making the clipping range larger for turns with strong signals and smaller for weak ones. A sympathetic reader would care because these changes let training use only the model's own predictions as feedback, avoiding the cost and bias of separate process reward models while still allowing diverse trajectories. If the changes work, policy updates become more stable and focused on genuinely informative steps across varying interaction lengths.

Core claim

A²TGPO retains information gain as the intrinsic process signal but applies three redesigns: turn-group normalization that compares each turn only to peers sharing the same prompt and turn index, variance-rescaled discounted accumulation that divides the cumulative value by the square root of the number of accumulated terms, and adaptive turn-level clipping that widens the allowable policy update range for turns with higher normalized information gain and narrows it for lower ones.

What carries the argument

Turn-group normalization of information gain combined with variance-rescaled accumulation and adaptive clipping that modulates the PPO-style clipping range per turn based on its normalized signal strength.

If this is right

Turns at the same interaction depth are ranked fairly without distortion from different positional contexts.
Advantage magnitudes remain comparable even when trajectories have very different numbers of turns.
Policy gradient steps are larger for informative turns and smaller for uninformative ones.
Training requires no separate external process reward model and preserves full trajectory diversity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same grouping and rescaling steps could be applied to other intrinsic signals such as entropy or surprise in non-LLM settings.
Hybrid use with sparse outcome rewards might further stabilize training on very long-horizon agent tasks.
Testing on models of increasing size would reveal whether the adaptive clipping range needs to be tuned to model capacity.

Load-bearing premise

Per-turn information gain still accurately reflects each turn's true contribution to success even after group normalization and adaptive clipping are applied across heterogeneous trajectories.

What would settle it

An experiment that trains identical models with standard information gain versus A²TGPO on a fixed multi-turn tool-use benchmark and then measures whether the new method produces higher final success rates and better alignment between high-IG turns and human-labeled useful actions.

Figures

Figures reproduced from arXiv: 2605.06200 by Chengming Li, Dingwei Chen, Jie Jiang, Leo Luo, Peng Chen, Yang Li, Zefang Zong, Zhipeng Ma.

**Figure 1.** Figure 1: Left: Per-turn intra-position context similarity between rollouts of the same prompt. Right: Overall intra-position vs. cross-position similarity. Rollouts at the same turn share substantially more similar contexts than those at different turns. based outcome verification in LLM reasoning [5–7], this critic-free approach is naturally extended to agentic settings [4, 8]. As agentic rollouts further introduc… view at source ↗

**Figure 2.** Figure 2: The framework of A2TGPO. Raw IG signals are first normalized within each turn group, then flow into discounted accumulation with variance rescaling to produce the turn-level advantage Abi,t, while a sigmoid mapping yields the adaptive clip scale ci,t. Both are consumed by the turn-level clipped policy loss. Grouping by (q, t) reflects the empirical observation in agentic settings that, trajectories sharing… view at source ↗

**Figure 3.** Figure 3: Left: The entropy comparison during training on multi-hop benchmark. Right: Performance comparison between classic baselines on HotpotQA dataset. Both are based on Qwen3-4B. 5.2 Main Results of A2TGPO view at source ↗

**Figure 4.** Figure 4: Within-step per-turn advantage distribution on multi-hop benchmarks based on Qwen3- view at source ↗

**Figure 5.** Figure 5: Advantage envelope dynamics over 240 training steps on multi-hop benchmarks based on view at source ↗

**Figure 6.** Figure 6: The prompt template in our experiment setting. view at source ↗

**Figure 7.** Figure 7: Left: Per-step training time on Qwen3-4B multi-hop QA under rollout budget n = 16. Right: Average per-step time breakdown over 240 training steps. The IG forward pass is A2TGPO’s sole additional component (+164 s), whose cost is largely offset by faster generation (−86 s), resulting in a net overhead of only +15 s (+2.9%). 0 50 100 150 200 Training Step 200 400 600 800 1000 1200 Tokens Response Length (Min… view at source ↗

**Figure 8.** Figure 8: Response length statistics (min, mean, max) over 240 training steps. A view at source ↗

**Figure 9.** Figure 9: Sensitivity of A2TGPO to the adaptive clipping coefficient β (Eq. (11)). β=0 reduces to a fixed clipping range. Both benchmarks exhibit a clear trend peaking at β=0.3, and performance remains stable across β ∈ [0.2, 0.4]. less (∼15%) since single-hop trajectories typically contain 1 to 2 process turns, making it less important for cross-depth scale correction. C.3 Sensitivity to Adaptive Clipping Coefficie… view at source ↗

**Figure 10.** Figure 10: Distribution of the number of tool calls per rollout on multi-hop and single-hop bench view at source ↗

read the original abstract

Reinforcement learning for agentic large language models (LLMs) typically relies on a sparse, trajectory-level outcome reward, making it difficult to evaluate the contribution of individual tool-calls within multi-turn interactions. Existing approaches to such process credit assignment either depend on separate external process reward models that introduce additional consumption, or tree-based structural rollout that merely redistributes the outcome signal while constraining trajectory diversity. A promising alternative leverages the per-turn change in the policy's predicted probability of the ground-truth, termed Information Gain (IG), as an intrinsic process signal without an external evaluator. However, prior work on leveraging IG signals within the RL training loop faces three systematic challenges: normalizing across turns that face heterogeneous positional contexts can distort the relative standing of individual turns, accumulating a variable number of terms causes advantage magnitudes to drift with trajectory depth, and a fixed clipping range governs policy updates identically for turns with vastly different IG signals. In this paper, we propose A$^2$TGPO (Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping), which retains IG as the intrinsic signal but re-designs how it is normalized, accumulated, and consumed: (i) turn-group normalization: normalizes IG within each (prompt, turn-index) group so that each turn is compared only against peers at the same interaction depth; (ii) variance-rescaled discounted accumulation: divides cumulative normalized IG by square root of accumulated terms to keep advantage magnitudes comparable across turn positions; and (iii) adaptive turn-level clipping: modulates each turn's clipping range based on its normalized IG, widening the update region for informative turns and narrowing it for uninformative ones.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A²TGPO adds three targeted fixes to information-gain credit assignment in multi-turn agent RL, but the adaptive clipping step carries a real risk of breaking PPO stability.

read the letter

The paper keeps information gain as the core intrinsic signal for crediting individual turns in agent trajectories and then layers on turn-group normalization, variance-rescaled accumulation, and adaptive clipping. Those three changes are the actual new pieces; nothing else in the abstract or description claims a broader theoretical advance or new signal source. The grouping step makes sense on its face because turns at different depths face different context lengths, so comparing them directly distorts rankings. The rescaling by square root of accumulated terms is a straightforward attempt to stop advantage magnitudes from growing or shrinking just because a trajectory happens to be longer. Adaptive clipping tries to give high-IG turns more update room while tightening it for low-signal ones, which could in principle reduce wasted gradient steps on uninformative actions. These are practical engineering responses to documented problems in applying PPO to long agent rollouts without adding external reward models. The work stays within the existing IG literature rather than inventing new assumptions, which keeps the scope modest and focused. The main soft spot is the adaptive clipping rule itself. Making the clip range depend on the normalized IG value correlates the trust-region size with the very quantity being optimized. In standard PPO the fixed clip enforces a bound that supports the monotonic-improvement guarantee; a signal-dependent bound can amplify noisy IG estimates and potentially allow larger policy shifts precisely where the signal is least reliable. The variance rescaling is also heuristic and may not fully correct for depth-dependent distribution shifts if group sizes are small. Experiments will have to show that these adjustments produce stable gains rather than trading one instability for another. This is the kind of incremental methods paper that matters to groups already running PPO on tool-using LLMs and looking for low-overhead ways to improve process-level credit without new models or tree search. Readers who care about implementation details and ablations on agent benchmarks would get the most out of it. The proposal is concrete enough and grounded enough in an existing signal to deserve a full referee process, though any review should press on whether the adaptive clip preserves the surrogate objective properties the authors rely on.

Referee Report

3 major / 2 minor

Summary. The paper claims that standard IG-based intrinsic rewards for multi-turn agentic LLM RL suffer from three issues—heterogeneous positional contexts distorting normalization, variable accumulation causing advantage drift with depth, and fixed clipping applying uniformly across turns with different IG values—and proposes A²TGPO to fix them while retaining IG as the signal. The fixes are (i) turn-group normalization of IG within each (prompt, turn-index) group, (ii) variance-rescaled discounted accumulation that divides the cumulative normalized IG by the square root of the number of accumulated terms, and (iii) adaptive turn-level clipping that widens the PPO clip range for high normalized-IG turns and narrows it for low-IG turns.

Significance. If the proposed normalizations and adaptive clipping can be shown to preserve unbiased advantages and the monotonic-improvement property of the clipped surrogate while improving credit assignment, the method would offer a lightweight, external-model-free alternative to process reward models or tree rollouts for training agentic LLMs. The retention of the existing IG signal and the focus on per-turn heterogeneity are practical strengths that could translate to better sample efficiency on multi-turn tool-use benchmarks.

major comments (3)

[Abstract] Abstract: the adaptive turn-level clipping modulates the PPO clip bounds directly with the normalized IG value itself. This makes the trust-region radius signal-dependent, which can correlate update magnitude with the very quantity being optimized and risks violating the monotonic improvement guarantee that the fixed-clip surrogate is designed to enforce; no derivation or counter-example analysis is supplied to show the modified surrogate remains a valid lower bound.
[Abstract] Abstract: turn-group normalization and variance-rescaled discounted accumulation are presented as remedies for positional heterogeneity and depth-dependent drift, yet the description supplies no proof or empirical check that dividing by sqrt(accumulated terms) restores unbiased advantage estimates when per-group sample sizes are modest or when IG variance changes with turn depth; these steps remain heuristic.
[Abstract] Abstract: the central claim that the three redesigns solve the stated challenges is unsupported by any equations, bounds, or experimental results in the summary. The manuscript must include at minimum ablations isolating each component and comparisons against standard PPO-IG and process-reward baselines on agentic benchmarks to substantiate the improvements.

minor comments (2)

[Abstract] Abstract: the title expands A²TGPO but the abstract does not spell out the second 'A' (presumably 'Adaptive' or 'Agentic'); explicit expansion on first use would improve readability.
[Abstract] Abstract: no reference is made to the specific RL algorithm (PPO variant), the exact form of the advantage estimator, or the datasets/benchmarks used for validation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the insightful and constructive feedback on our manuscript. The comments highlight important theoretical and empirical aspects that we will address to strengthen the paper. We provide point-by-point responses below and commit to revisions that incorporate additional analysis and experiments.

read point-by-point responses

Referee: [Abstract] Abstract: the adaptive turn-level clipping modulates the PPO clip bounds directly with the normalized IG value itself. This makes the trust-region radius signal-dependent, which can correlate update magnitude with the very quantity being optimized and risks violating the monotonic improvement guarantee that the fixed-clip surrogate is designed to enforce; no derivation or counter-example analysis is supplied to show the modified surrogate remains a valid lower bound.

Authors: We appreciate this observation on the potential impact to the trust-region property. The adaptive clipping is intended to allocate larger updates to high-information turns while constraining low-IG turns for stability, with the normalized IG serving as a per-turn importance weight. In the revised manuscript we will add a formal analysis in the appendix deriving conditions under which the adaptive-clip surrogate remains a valid lower bound on the expected improvement, along with a small-scale counter-example study on synthetic trajectories to verify the property holds in practice. If the analysis identifies edge cases, we will adjust the modulation function accordingly. revision: yes
Referee: [Abstract] Abstract: turn-group normalization and variance-rescaled discounted accumulation are presented as remedies for positional heterogeneity and depth-dependent drift, yet the description supplies no proof or empirical check that dividing by sqrt(accumulated terms) restores unbiased advantage estimates when per-group sample sizes are modest or when IG variance changes with turn depth; these steps remain heuristic.

Authors: We agree that these normalization steps are primarily motivated by the observed issues of positional bias and depth-dependent magnitude drift rather than a strict unbiasedness proof. Turn-group normalization compares each turn only to same-depth peers, and the sqrt(rescaling) is chosen to counteract the growth in variance of summed terms. In the revision we will include an empirical section with plots of advantage statistics across turn depths before and after rescaling, plus ablation results on varying group sizes and IG variance regimes to demonstrate that advantage magnitudes remain stable and credit assignment improves. We will also clarify the heuristic nature while showing practical benefits on the evaluated benchmarks. revision: yes
Referee: [Abstract] Abstract: the central claim that the three redesigns solve the stated challenges is unsupported by any equations, bounds, or experimental results in the summary. The manuscript must include at minimum ablations isolating each component and comparisons against standard PPO-IG and process-reward baselines on agentic benchmarks to substantiate the improvements.

Authors: The abstract summarizes the method; the full manuscript already reports results on multi-turn agentic benchmarks. To directly address the request, we will expand the experiments section with (i) component-wise ablations isolating turn-group normalization, variance-rescaled accumulation, and adaptive clipping, and (ii) head-to-head comparisons against vanilla PPO-IG and process-reward-model baselines, reporting metrics such as success rate, sample efficiency, and advantage stability. These additions will be supported by the corresponding equations and implementation details already present in the method section. revision: yes

Circularity Check

0 steps flagged

No circularity: algorithmic redesign of IG normalization and clipping is self-contained

full rationale

The paper presents A²TGPO as a set of explicit design choices—turn-group normalization of IG, variance-rescaled discounted accumulation, and adaptive turn-level clipping—to address three stated challenges with prior IG usage. These modifications are described directly in the abstract and introduction as heuristic reparameterizations of an existing intrinsic signal rather than as quantities derived from or fitted to the same data the method is applied to. No equations reduce the claimed improvements to self-referential definitions, fitted parameters renamed as predictions, or load-bearing self-citations that invoke uniqueness theorems from the authors' prior work. The derivation chain consists of problem identification followed by proposed fixes whose validity is left to empirical validation, with no reduction of the central result to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on the domain assumption that information gain computed from the policy's own probability of the ground-truth is a useful intrinsic process signal; no free parameters or new entities are introduced in the abstract.

axioms (1)

domain assumption Information gain from the policy's predicted probability of the ground-truth serves as a valid intrinsic process reward without external models
The entire redesign is motivated by and built upon this premise stated in the abstract.

pith-pipeline@v0.9.0 · 5624 in / 1331 out tokens · 56610 ms · 2026-05-08T10:36:48.098765+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 26 canonical work pages · 11 internal anchors

[1]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=WE_vluYUL-X

2023
[2]

M²IV: Towards efficient and fine-grained multimodal in-context learning via representation engineering

Yanshu Li, Yi Cao, Hongyang He, Qisen Cheng, Xiang Fu, Xi Xiao, Tianyang Wang, and Ruixiang Tang. M²IV: Towards efficient and fine-grained multimodal in-context learning via representation engineering. InSecond Conference on Language Modeling, 2025. URL https://openreview.net/forum?id= 9ffYcEiNw9

2025
[3]

Make lvlms focus: Context- aware attention modulation for better multimodal in-context learning, 2025

Yanshu Li, Jianjiang Yang, Ziteng Yang, Bozheng Li, Ligong Han, Hongyang He, Zhengtao Yao, Yingjie Victor Chen, Songlin Fei, Dongfang Liu, and Ruixiang Tang. Make lvlms focus: Context- aware attention modulation for better multimodal in-context learning, 2025. URL https://arxiv.org/ abs/2505.17097

work page arXiv 2025
[4]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516, 2025

work page internal anchor Pith review arXiv 2025
[5]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...

work page doi:10.1038/s41586-025-09422-z 2025
[6]

DAPO: An open-source LLM reinforcement learning system at scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, YuYue, Weinan Dai, Tiantian Fan, Gaohong Liu, Juncai Liu, LingJun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Ru Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao ...

2025
[7]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization, 2025. URL https://arxiv.org/abs/2507.18071

work page internal anchor Pith review arXiv 2025
[8]

RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Xing Jin, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, Eli Gottlieb, Yiping Lu, Kyunghyun Cho, Jiajun Wu, Li Fei-Fei, Lijuan Wang, Yejin Choi, and Manling Li. Ragen: Understanding self-evolution in llm agents via multi-turn reinforcement learning, 2025. URLhttps://arxiv.org/abs/2...

work page internal anchor Pith review arXiv 2025
[9]

Agentic reinforced policy optimization.arXiv preprint arXiv:2507.19849, 2025

Guanting Dong, Hangyu Mao, Kai Ma, Licheng Bao, Yifei Chen, Zhongyuan Wang, Zhongxia Chen, Jiazhen Du, Huiyang Wang, Fuzheng Zhang, Guorui Zhou, Yutao Zhu, Ji-Rong Wen, and Zhicheng Dou. Agentic reinforced policy optimization, 2025. URLhttps://arxiv.org/abs/2507.19849. 10

work page arXiv 2025
[10]

Agentic entropy-balanced policy optimization.arXiv preprint arXiv:2510.14545,

Guanting Dong, Licheng Bao, Zhongyuan Wang, Kangzhi Zhao, Xiaoxi Li, Jiajie Jin, Jinghan Yang, Hangyu Mao, Fuzheng Zhang, Kun Gai, Guorui Zhou, Yutao Zhu, Ji-Rong Wen, and Zhicheng Dou. Agentic entropy-balanced policy optimization, 2025. URLhttps://arxiv.org/abs/2510.14545

work page arXiv 2025
[11]

Group-in-Group Policy Optimization for LLM Agent Training

Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training.arXiv preprint arXiv:2505.10978, 2025

work page internal anchor Pith review arXiv 2025
[12]

At2po: Agentic turn-based policy optimization via tree search, 2026

Zefang Zong, Dingwei Chen, Yang Li, Qi Yi, Bo Zhou, Chengming Li, Bo Qian, Peng Chen, and Jie Jiang. At2po: Agentic turn-based policy optimization via tree search, 2026. URL https://arxiv.org/abs/ 2601.04767

work page arXiv 2026
[13]

CARL: Criticality-Aware Agentic Reinforcement Learning

Leyang Shen, Yang Zhang, Chun Kai Ling, Xiaoyan Zhao, and Tat-Seng Chua. Carl: Critical action focused reinforcement learning for multi-step agent, 2025. URLhttps://arxiv.org/abs/2512.04949

work page internal anchor Pith review arXiv 2025
[14]

Webdancer: Towards autonomous information seeking agency

Jialong Wu, Baixuan Li, Runnan Fang, Wenbiao Yin, Liwen Zhang, Zhenglin Wang, Zhengwei Tao, Ding- Chu Zhang, Zekun Xi, Xiangru Tang, Yong Jiang, Pengjun Xie, Fei Huang, and Jingren Zhou. Webdancer: Towards autonomous information seeking agency. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview.net/f...

2026
[15]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023

2023
[16]

Math-shepherd: Verify and reinforce llms step-by-step without human annotations

Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9426–9439, 2024

2024
[17]

TreeRL: LLM Reinforcement Learning with On-Policy Tree Search, 2025

Zhenyu Hou, Ziniu Hu, Yujiang Li, Rui Lu, Jie Tang, and Yuxiao Dong. Treerl: Llm reinforcement learning with on-policy tree search, 2025. URLhttps://arxiv.org/abs/2506.11902

work page arXiv 2025
[18]

Tree search for llm agent reinforcement learning, 2026

Yuxiang Ji, Ziyu Ma, Yong Wang, Guanhua Chen, Xiangxiang Chu, and Liaoni Wu. Tree search for llm agent reinforcement learning, 2025. URLhttps://arxiv.org/abs/2509.21240

work page arXiv 2025
[19]

Treerpo: Tree relative policy optimization, 2025

Zhicheng Yang, Zhijiang Guo, Yinya Huang, Xiaodan Liang, Yiwei Wang, and Jing Tang. Treerpo: Tree relative policy optimization, 2025. URLhttps://arxiv.org/abs/2506.05183

work page arXiv 2025
[20]

Information gain-based policy optimization: A simple and effective approach for multi-turn search agents

Guoqing Wang, Sunhao Dai, Guangze Ye, Zeyu Gan, Wei Yao, Yong Deng, Xiaofeng Wu, and Zhenzhe Ying. Information gain-based policy optimization: A simple and effective approach for multi-turn search agents. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=qkWP6phrvZ

2026
[21]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017. URLhttps://arxiv.org/abs/1707.06347

work page internal anchor Pith review arXiv 2017
[22]

Deep rein- forcement learning from human preferences.Advances in neural information processing systems, 30, 2017

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep rein- forcement learning from human preferences.Advances in neural information processing systems, 30, 2017

2017
[23]

Learning to summarize with human feedback.Advances in neural information processing systems, 33:3008–3021, 2020

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback.Advances in neural information processing systems, 33:3008–3021, 2020

2020
[24]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URLhttps://arxiv.org/abs/2402.03300

work page internal anchor Pith review arXiv 2024
[25]

Rewarding progress: Scaling automated process verifiers for LLM reasoning

Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Alekh Agarwal, Jonathan Berant, and Aviral Kumar. Rewarding progress: Scaling automated process verifiers for LLM reasoning. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=A6Y7AqlzLW

2025
[26]

arXiv preprint arXiv:2502.10325 , year=

Sanjiban Choudhury. Process reward models for llm agents: Practical framework and directions.arXiv preprint arXiv:2502.10325, 2025

work page arXiv 2025
[27]

Lillicrap, Kenji Kawaguchi, and Michael Shieh

Yuxi Xie, Anirudh Goyal, Wenyue Zheng, Min-Yen Kan, Timothy P Lillicrap, Kenji Kawaguchi, and Michael Shieh. Monte carlo tree search boosts reasoning via iterative preference learning.arXiv preprint arXiv:2405.00451, 2024. 11

work page arXiv 2024
[28]

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors,Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processin...

work page doi:10.18653/v1/d18-1259 2018
[29]

Constructing A Multi-hop

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps. In Donia Scott, Nuria Bel, and Chengqing Zong, editors,Proceedings of the 28th International Conference on Computational Linguistics, pages 6609–6625, Barcelona, Spain (Online), December 2020. International ...

work page doi:10.18653/v1/2020.coling-main.580 2020
[30]

M u S i Q ue: Multihop questions via single-hop question composition

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. MuSiQue: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022. doi: 10.1162/tacl_a_00475. URL https://aclanthology.org/2022. tacl-1.31/

work page doi:10.1162/tacl_a_00475 2022
[31]

Smith, and Mike Lewis

Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A. Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models, 2023. URL https://arxiv.org/abs/2210. 03350

2023
[32]

Transactions of the Association for Computational Linguistics , author =

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research.Transact...

work page doi:10.1162/tacl_a_00276 2019
[33]

Joshi, E

Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Regina Barzilay and Min-Yen Kan, editors, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada, July 2017. ...

work page doi:10.18653/v1/p17-1147 2017
[34]

When not to trust language models: Investigating effectiveness and limitations of parametric and non-parametric memories.arXiv preprint, 2022

Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Hannaneh Hajishirzi, and Daniel Khashabi. When not to trust language models: Investigating effectiveness and limitations of parametric and non-parametric memories.arXiv preprint, 2022

2022
[35]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Ma- jumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training.arXiv preprint arXiv:2212.03533, 2022

work page internal anchor Pith review arXiv 2022
[36]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review arXiv 2025
[37]

Qwen2.5 Technical Report

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

work page internal anchor Pith review arXiv 2025
[38]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024. 12 A Algorithm Workflow Algorithm 1 summarizes the complete A2TGPO training procedure. Each iteration consists of four stages: (1) multi-turn rollout ...

work page internal anchor Pith review arXiv 2024