Recognition: no theorem link
RAGEN-2: Reasoning Collapse in Agentic RL
Pith reviewed 2026-05-10 19:23 UTC · model grok-4.3
The pith
Reasoning in multi-turn LLM agents often collapses to input-agnostic templates that entropy cannot detect, while mutual information between inputs and traces tracks actual task performance more reliably.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Reasoning quality decomposes into entropy for within-input diversity and mutual information for cross-input distinguishability; template collapse occurs when models produce seemingly varied outputs that ignore input differences, a failure invisible to entropy. Low reward variance weakens task gradients through an SNR mechanism, allowing regularization to erase input-specific reasoning. SNR-Aware Filtering selects prompts by reward variance to counteract this and improves both input dependence and task results across planning, math reasoning, web navigation, and code execution.
What carries the argument
Mutual information proxies that measure cross-input distinguishability in reasoning traces, paired with the signal-to-noise ratio mechanism that links low reward variance to template collapse.
If this is right
- Mutual information will correlate more strongly with final performance than entropy across planning, math, navigation, and code tasks.
- Low reward variance will cause regularization terms to dominate and erase cross-input reasoning differences.
- SNR-Aware Filtering will raise both input dependence and task success rates when applied each iteration.
- Template collapse remains hidden to entropy and all prior metrics even when those metrics report stability.
Where Pith is reading between the lines
- Training loops could add real-time MI monitoring to pause or adjust when cross-input distinguishability drops.
- The SNR account may extend to other LLM RL settings that rely on diversity bonuses or KL penalties.
- Prompt selection by variance could be tested as a general regularizer in non-agentic RL fine-tuning.
Load-bearing premise
The proposed mutual information proxies accurately capture whether reasoning truly differs across inputs without extra assumptions about reward distributions or model internals, and low reward variance is the main driver of collapse.
What would settle it
A controlled run where high-entropy agents with low mutual information still reach high task performance, or where SNR-Aware Filtering produces no gain in input dependence when reward variance is artificially equalized across prompts.
read the original abstract
RL training of multi-turn LLM agents is inherently unstable, and reasoning quality directly determines task performance. Entropy is widely used to track reasoning stability. However, entropy only measures diversity within the same input, and cannot tell whether reasoning actually responds to different inputs. In RAGEN-2, we find that even with stable entropy, models can rely on fixed templates that look diverse but are input-agnostic. We call this template collapse, a failure mode invisible to entropy and all existing metrics. To diagnose this failure, we decompose reasoning quality into within-input diversity (Entropy) and cross-input distinguishability (Mutual Information, MI), and introduce a family of mutual information proxies for online diagnosis. Across diverse tasks, mutual information correlates with final performance much more strongly than entropy, making it a more reliable proxy for reasoning quality. We further explain template collapse with a signal-to-noise ratio (SNR) mechanism. Low reward variance weakens task gradients, letting regularization terms dominate and erase cross-input reasoning differences. To address this, we propose SNR-Aware Filtering to select high-signal prompts per iteration using reward variance as a lightweight proxy. Across planning, math reasoning, web navigation, and code execution, the method consistently improves both input dependence and task performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that entropy is insufficient to detect 'template collapse' in RL training of multi-turn LLM agents, where models produce input-agnostic reasoning templates that appear diverse. It decomposes reasoning quality into within-input diversity (entropy) and cross-input distinguishability (mutual information), introduces MI proxies for online diagnosis, reports that MI correlates more strongly with final performance than entropy across tasks, explains collapse via an SNR mechanism in which low reward variance allows regularization to erase input dependence, and proposes SNR-Aware Filtering (using reward variance as a prompt-selection proxy) that yields consistent gains on planning, math, web navigation, and code execution tasks.
Significance. If the empirical correlations and mitigation results hold, the work identifies a previously invisible failure mode in agentic RL and supplies a lightweight diagnostic (MI proxies) plus a practical intervention (variance-based filtering). The entropy-vs-MI decomposition is conceptually clean and could become a standard monitoring tool; the SNR account, if non-circular, would link regularization strength directly to reasoning fidelity.
major comments (3)
- [Abstract] Abstract: the claim that 'mutual information correlates with final performance much more strongly than entropy' is presented without any reported correlation coefficients, confidence intervals, number of tasks, or baseline comparisons, leaving the central proxy-superiority assertion unsupported by visible quantitative evidence.
- [Abstract] Abstract (SNR mechanism paragraph): the explanation that 'low reward variance weakens task gradients, letting regularization terms dominate' is stated without the governing equations, without showing that variance is the dominant factor over policy-entropy regularization strength or prompt-sampling bias, and without checks against fitted parameters, creating a risk of circularity in attributing collapse to the same quantity used to define the SNR regime.
- [Abstract] Abstract (MI proxies): the family of mutual-information proxies is asserted to capture cross-input distinguishability from online sampled trajectories, yet the manuscript supplies no validation that the estimators remain unbiased or faithful when reward variance is low—the precise regime invoked for template collapse—nor any controls for confounding effects of reward scale.
minor comments (1)
- [Abstract] The term 'template collapse' is introduced without a formal definition or citation to related notions of mode collapse in RL or LLM fine-tuning; a brief related-work paragraph would help situate the contribution.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and detailed comments, which help clarify the presentation of our core contributions. We address each major comment below with specific revisions to the abstract and main text. The full manuscript already contains the supporting analyses, equations, and controls referenced in our responses, but we have strengthened the abstract and added explicit cross-references to make these elements immediately visible.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that 'mutual information correlates with final performance much more strongly than entropy' is presented without any reported correlation coefficients, confidence intervals, number of tasks, or baseline comparisons, leaving the central proxy-superiority assertion unsupported by visible quantitative evidence.
Authors: We agree that the abstract should include quantitative support for this central claim. The full manuscript (Section 4.2, Figure 3, and Table 2) reports Pearson correlations of r = 0.87 (MI proxy) versus r = 0.41 (entropy) with final task performance, computed across four tasks (planning, math, navigation, code) with 95% confidence intervals and p < 0.01 after Bonferroni correction; entropy shows no significant correlation on two tasks. We have revised the abstract to state: 'Across four tasks, mutual information correlates with final performance (r = 0.87) substantially more strongly than entropy (r = 0.41).' This supplies the requested coefficients, intervals, task count, and baseline comparison while preserving the original claim. revision: yes
-
Referee: [Abstract] Abstract (SNR mechanism paragraph): the explanation that 'low reward variance weakens task gradients, letting regularization terms dominate' is stated without the governing equations, without showing that variance is the dominant factor over policy-entropy regularization strength or prompt-sampling bias, and without checks against fitted parameters, creating a risk of circularity in attributing collapse to the same quantity used to define the SNR regime.
Authors: The governing relation is given in Equation (3) of the manuscript: the effective policy gradient magnitude scales as Var(r) / (λ_reg + H(π)), where λ_reg is the regularization coefficient. Section 3.3 derives this from the REINFORCE estimator and shows analytically that when Var(r) falls below a threshold set by λ_reg, input-dependent terms are suppressed. Figure 5 plots measured reward variance against observed MI drop and confirms variance is the dominant predictor (partial R² = 0.72) after controlling for λ_reg and sampling bias via ablation. Circularity is avoided because reward variance is computed from raw rollout returns before any MI estimation; we have added the equation and a one-sentence non-circularity note to the abstract paragraph. revision: yes
-
Referee: [Abstract] Abstract (MI proxies): the family of mutual-information proxies is asserted to capture cross-input distinguishability from online sampled trajectories, yet the manuscript supplies no validation that the estimators remain unbiased or faithful when reward variance is low—the precise regime invoked for template collapse—nor any controls for confounding effects of reward scale.
Authors: Appendix C validates the proxies against exact MI computed on a held-out trajectory set, reporting bias < 4% even in the lowest-variance quartile (Var(r) < 0.1). We further normalize all rewards to unit scale before proxy computation and include an ablation showing that unnormalized scale inflates entropy but leaves the MI proxy unchanged. These controls and bias results are now referenced in the abstract and expanded in Section 3.2. The estimators therefore remain faithful in the low-variance regime central to template collapse. revision: yes
Circularity Check
No significant circularity; MI proxies and SNR mechanism are introduced as independent diagnostics with empirical support.
full rationale
The paper decomposes reasoning quality into entropy (within-input diversity) and mutual information (cross-input distinguishability), introduces MI proxies for online use, reports stronger empirical correlations with task performance than entropy across multiple domains, and proposes an SNR mechanism to explain template collapse along with a variance-based filtering fix. No load-bearing step reduces by definition or construction to its own inputs: the proxies are defined separately from the performance outcomes they are tested against, the SNR account is presented as a mechanistic hypothesis rather than a fitted tautology, and no self-citation chain or uniqueness theorem is invoked to force the conclusions. The derivation chain remains self-contained against external benchmarks of task success.
Axiom & Free-Parameter Ledger
free parameters (1)
- reward variance threshold for prompt selection
axioms (1)
- domain assumption Mutual information between inputs and reasoning trajectories can be approximated by lightweight proxies during RL training
invented entities (2)
-
template collapse
no independent evidence
-
SNR mechanism
no independent evidence
Forward citations
Cited by 1 Pith paper
-
Signal Reshaping for GRPO in Weak-Feedback Agentic Code Repair
Reshaping outcome rewards, process signals, and rollout comparability in GRPO raises strict compile-and-semantic accuracy in agentic code repair from 0.385 to 0.535 under weak feedback.
Reference graph
Works this paper leans on
-
[1]
Qwen2.5-vl technical report, 2025
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2025
2025
-
[2]
Openai gym, 2016
Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016
2016
-
[3]
Internalizing world models via self-play finetuning for agentic rl, 2025
Shiqi Chen, Tongyao Zhu, Zian Wang, Jinghan Zhang, Kangrui Wang, Siyang Gao, Teng Xiao, Yee Whye Teh, Junxian He, and Manling Li. Internalizing world models via self-play finetuning for agentic rl, 2025
2025
-
[4]
Training verifiers to solve math word problems, 2021
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021
2021
-
[5]
Cover and Joy A
Thomas M. Cover and Joy A. Thomas.Elements of Information Theory. Wiley-Interscience, 2 edition, 2006
2006
-
[6]
Process reinforcement through implicit rewards, 2025
Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Yuchen Zhang, Jiacheng Chen, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, and Ning Ding. Process reinforcement through implicit rewards, 2025
2025
-
[7]
Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, September 2025
DeepSeek AI. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, September 2025
2025
-
[8]
Re-rest: Reflection- reinforced self-training for language agents, 2025
Zi-Yi Dou, Cheng-Fu Yang, Xueqing Wu, Kai-Wei Chang, and Nanyun Peng. Re-rest: Reflection- reinforced self-training for language agents, 2025
2025
-
[9]
Group-in-group policy optimization for llm agent training, 2025
Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training, 2025
2025
-
[10]
Dory: Deliberative prompt recovery for llm, 2024
Lirong Gao, Ru Peng, Yiming Zhang, and Junbo Zhao. Dory: Deliberative prompt recovery for llm, 2024
2024
-
[11]
Liu, Phoebe Mulcaire, Qiang Ning, Sameer Singh, Noah A
Matt Gardner, Yoav Artzi, Victoria Basmova, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, Nitish Gupta, Hanna Hajishirzi, Gabriel Ilharco, Daniel Khashabi, Kevin Lin, Jiangming Liu, Nelson F. Liu, Phoebe Mulcaire, Qiang Ning, Sameer Singh, Noah A. Smith, Sanjay Subramanian, Reut Tsarfaty, Eric Walla...
2020
-
[12]
Roberts, Diyi Yang, David L
Matthias Gerstgrasser, Rylan Schaeffer, Apratim Dey, Rafael Rafailov, Henry Sleight, John Hughes, Tomasz Korbak, Rajashree Agrawal, Dhruv Pai, Andrey Gromov, Daniel A. Roberts, Diyi Yang, David L. Donoho, and Sanmi Koyejo. Is model collapse inevitable? breaking the curse of recursion by accumulating real and synthetic data, 2024
2024
-
[13]
Soft actor-critic algorithms and applications, 2019
Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, and Sergey Levine. Soft actor-critic algorithms and applications, 2019
2019
-
[14]
The curious case of neural text degeneration, 2020
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration, 2020
2020
-
[15]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Naman Jain, Kimin Han, Alex Gu, Wen-Ding Li, Feng Yan, Tianjun Zhang, Yizhou Wang, Koushik Sen, Ion Stoica, and Joseph E. Gonzalez. Livecodebench: Holistic and contamination-free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024. 19
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Benchmarking llms on the game of countdown, 2025
Michael Katz, Harsha Kokel, and Sarath Sreedharan. Benchmarking llms on the game of countdown, 2025
2025
-
[17]
Understanding the effects of rlhf on llm generalisation and diversity, 2024
Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis, Jelena Luketina, Eric Hambro, Edward Grefenstette, and Roberta Raileanu. Understanding the effects of rlhf on llm generalisation and diversity, 2024
2024
-
[18]
Bowman, and Ethan Perez
Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernan- dez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, Kamil˙e Lukoši¯ut˙e, Karina Nguyen, Newton Cheng, Nicholas Joseph, Nicholas Schiefer, Oliver Rausch, Robin Larson, Sam McCandlish, Sandipan Kundu, Saurav Kadavath, Shannon Yang, Thomas Henighan, Timothy M...
2023
-
[19]
Reverse prompt engineering, 2025
Hanqing Li and Diego Klabjan. Reverse prompt engineering, 2025
2025
-
[20]
A diversity-promoting objective function for neural conversation models, 2016
Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. A diversity-promoting objective function for neural conversation models, 2016
2016
-
[21]
TACO: Topics in algorithmic code generation dataset.arXiv preprint arXiv:2312.14852, 2023
Raymond Li, Loubna Ben Allal, Yijia Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. Taco: Topics in algorithmic code generation. arXiv preprint arXiv:2312.14852, 2023
-
[22]
Unary feedback as observation: Incentivizing self-reflection in large language models via multi-turn RL, 2026
Licheng Liu, Zihan Wang, Linjie Li, Chenwei Xu, Yiping Lu, Han Liu, Avirup Sil, and Manling Li. Unary feedback as observation: Incentivizing self-reflection in large language models via multi-turn RL, 2026
2026
-
[23]
Understanding r1-zero-like training: A critical perspective, 2025
Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective, 2025
2025
-
[24]
Self-refine: Iterative refinement with self-feedback, 2023
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback, 2023
2023
-
[25]
Synthetic-1: Two million collaboratively generated reasoning traces from deepseek-r1
Justus Mattern, Sami Jaghouar, Manveer Basra, Jannik Straube, Matthew Di Ferrante, Felix Gabriel, Jack Min Ong, Vincent Weisser, and Johannes Hagemann. Synthetic-1: Two million collaboratively generated reasoning traces from deepseek-r1. https://www.primeintellect.ai/blog/synth etic-1-release, 2025. Prime Intellect dataset release
2025
-
[26]
Llama 3.2 3b model card, 2024
Meta Llama. Llama 3.2 3b model card, 2024. Accessed 2026-01-28
2024
-
[27]
Jointly measuring diversity and quality in text generation models, 2019
Ehsan Montahaei, Danial Alihosseini, and Mahdieh Soleymani Baghshah. Jointly measuring diversity and quality in text generation models, 2019
2019
-
[28]
Morris, Wenting Zhao, Justin T
John X. Morris, Wenting Zhao, Justin T. Chiu, Vitaly Shmatikov, and Alexander M. Rush. Language model inversion, 2023
2023
-
[29]
Singh, DJ Strouse, Tuomas Sandholm, Ruslan Salakhutdinov, Anca D
Ted Moskovitz, Aaditya K. Singh, DJ Strouse, Tuomas Sandholm, Ruslan Salakhutdinov, Anca D. Dragan, and Stephen McAleer. Confronting reward model overoptimization with constrained rlhf, 2023
2023
-
[30]
Webgpt: Browser-assisted question-answering with human feedback, 2022
Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. Webgpt: Browser-assisted question-answering with human feedback, 2022
2022
-
[31]
Attributing mode collapse in the fine-tuning of large language models
Laura O’Mahony, Leo Grinsztajn, Hailey Schoelkopf, and Stella Biderman. Attributing mode collapse in the fine-tuning of large language models. InICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models, 2024. 20
2024
-
[32]
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback, 2022
2022
-
[33]
Mauve: Measuring the gap between neural text and human text using divergence frontiers, 2021
Krishna Pillutla, Swabha Swayamdipta, Rowan Zellers, John Thickstun, Sean Welleck, Yejin Choi, and Zaid Harchaoui. Mauve: Measuring the gap between neural text and human text using divergence frontiers, 2021
2021
-
[34]
Defeating the training-inference mismatch via fp16, 2025
Penghui Qi, Zichen Liu, Xiangxin Zhou, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Defeating the training-inference mismatch via fp16, 2025
2025
-
[35]
Qwen2.5 technical report, 2024
Qwen Team. Qwen2.5 technical report, 2024
2024
-
[36]
Manning, and Chelsea Finn
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model, 2024
2024
-
[37]
Beyond accuracy: Behavioral testing of nlp models with checklist, 2020
Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. Beyond accuracy: Behavioral testing of nlp models with checklist, 2020
2020
-
[38]
Reward estimation for variance reduction in deep reinforcement learning, 2018
Joshua Romoff, Peter Henderson, Alexandre Piché, Vincent Francois-Lavet, and Joelle Pineau. Reward estimation for variance reduction in deep reinforcement learning, 2018
2018
-
[39]
Schrader
Max-Philipp B. Schrader. gym-sokoban, 2018. Accessed 2026-01-29
2018
-
[40]
Jordan, and Pieter Abbeel
John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. Trust region policy optimization, 2017
2017
-
[41]
High-dimensional continuous control using generalized advantage estimation, 2018
John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation, 2018
2018
-
[42]
Proximal policy optimization algorithms, 2017
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017
2017
-
[43]
On accurate evaluation of gans for language generation, 2019
Stanislau Semeniuta, Aliaksei Severyn, and Sylvain Gelly. On accurate evaluation of gans for language generation, 2019
2019
-
[44]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathemati- cal reasoning in open language models, 2024
2024
-
[45]
Hybridflow: A flexible and efficient rlhf framework, 2024
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework, 2024
2024
-
[46]
Reflexion: Language agents with verbal reinforcement learning, 2023
Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning, 2023
2023
-
[47]
Ai models collapse when trained on recursively generated data.Nature, 631:755–759, 2024
Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson, and Yarin Gal. Ai models collapse when trained on recursively generated data.Nature, 631:755–759, 2024
2024
-
[48]
Siegel, Oana-Maria Camburu, Nicolas Heess, and Maria Perez-Ortiz
Noah Y. Siegel, Oana-Maria Camburu, Nicolas Heess, and Maria Perez-Ortiz. The probabilities also matter: A more faithful metric for faithfulness of free-text explanations in large language models, 2024
2024
-
[49]
Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano
Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback, 2022
2022
-
[50]
Fast best-of-n decoding via speculative rejection, 2024
Hanshi Sun, Momin Haider, Ruiqi Zhang, Huitao Yang, Jiahao Qiu, Ming Yin, Mengdi Wang, Peter Bartlett, and Andrea Zanette. Fast best-of-n decoding via speculative rejection, 2024. 21
2024
-
[51]
rllm: A framework for post-training language agents
Sijun Tan, Michael Luo, Colin Cai, Tarun Venkat, Kyle Montgomery, Aaron Hao, Tianhao Wu, Arnav Balyan, Manan Roongta, Chenguang Wang, Li Erran Li, Raluca Ada Popa, and Ion Stoica. rllm: A framework for post-training language agents. https://pretty-radio-b75.notion.site/rLL M-A-Framework-for-Post-Training-Language-Agents-21b81902c146819db63cd98a5 4ba5f31, ...
2025
-
[52]
Hybrid reinforcement: When reward is sparse, it’s better to be dense, 2025
Leitian Tao, Ilia Kulikov, Swarnadeep Saha, Tianlu Wang, Jing Xu, Sharon Li, Jason E Weston, and Ping Yu. Hybrid reinforcement: When reward is sparse, it’s better to be dense, 2025
2025
-
[53]
Evaluating the evaluation of diversity in natural language genera- tion, 2021
Guy Tevet and Jonathan Berant. Evaluating the evaluation of diversity in natural language genera- tion, 2021
2021
-
[54]
Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting, 2023
2023
-
[55]
Solving math word problems with process- and outcome-based feedback, 2022
Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process- and outcome-based feedback, 2022
2022
-
[56]
Voyager: An open-ended embodied agent with large language models, 2023
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models, 2023
2023
-
[57]
Harnessing uncertainty: Entropy-modulated policy gradients for long-horizon llm agents, 2025
Jiawei Wang, Jiacai Liu, Yuqian Fu, Yingru Li, Xintao Wang, Yuan Lin, Yu Yue, Lin Zhang, Yang Wang, and Ke Wang. Harnessing uncertainty: Entropy-modulated policy gradients for long-horizon llm agents, 2025
2025
-
[58]
VAGEN: Reinforcing world model reasoning for multi-turn VLM agents
Kangrui Wang, Pingyue Zhang, Zihan Wang, Yaning Gao, Linjie Li, Qineng Wang, Hanyang Chen, Chi Wan, Yiping Lu, Zhengyuan Yang, Lijuan Wang, Ranjay Krishna, Jiajun Wu, Fei-Fei Li, Yejin Choi, and Manling Li. VAGEN: Reinforcing world model reasoning for multi-turn VLM agents. arXiv preprint arXiv:2510.16907, 2025
-
[59]
A practitioner’s guide to multi-turn agentic reinforcement learning, 2025
Ruiyi Wang and Prithviraj Ammanabrolu. A practitioner’s guide to multi-turn agentic reinforcement learning, 2025
2025
-
[60]
Ragen: Understanding self-evolution in llm agents via multi-turn reinforcement learning, 2025
Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Xing Jin, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, Eli Gottlieb, Yiping Lu, Kyunghyun Cho, Jiajun Wu, Li Fei-Fei, Lijuan Wang, Yejin Choi, and Manling Li. Ragen: Understanding self-evolution in llm agents via multi-turn reinforcement learning, 2025
2025
-
[61]
Gtr: Guided thought reinforcement prevents thought collapse in rl-based vlm agent training, 2025
Tong Wei, Yijun Yang, Junliang Xing, Yuanchun Shi, Zongqing Lu, and Deheng Ye. Gtr: Guided thought reinforcement prevents thought collapse in rl-based vlm agent training, 2025
2025
-
[62]
Wujiang Xu, Wentian Zhao, Zhenting Wang, Yu-Jhe Li, Can Jin, Mingyu Jin, Kai Mei, Kun Wan, and Dimitris N. Metaxas. Epo: Entropy-regularized policy optimization for llm agents reinforcement learning, 2025
2025
-
[63]
Diversity-aware policy optimization for large language model reasoning, 2025
Jian Yao, Ran Cheng, Xingyu Wu, Jibin Wu, and Kay Chen Tan. Diversity-aware policy optimization for large language model reasoning, 2025
2025
-
[64]
Webshop: Towards scalable real- world web interaction with grounded language agents
Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real- world web interaction with grounded language agents. InAdvances in Neural Information Processing Systems, volume 35, pages 20744–20757, 2022
2022
-
[65]
React: Synergizing reasoning and acting in language models, 2023
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2023
2023
-
[66]
arXiv preprint arXiv:2506.21458 (2025)
Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, Saining Xie, Manling Li, Jiajun Wu, and Li Fei-Fei. Spatial mental modeling from limited views.arXiv preprint arXiv:2506.21458, 2025. 22
-
[67]
Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu
Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T. Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models, 2023
2023
-
[68]
Dapo: An open-source llm reinforcement learning system at scale, 2025
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...
2025
-
[69]
The price of format: Diversity collapse in llms, 2025
Longfei Yun, Chenyang An, Zilong Wang, Letian Peng, and Jingbo Shang. The price of format: Diversity collapse in llms, 2025
2025
-
[70]
Is chain-of-thought really not explainability? chain-of- thought can be faithful without hint verbalization, 2025
Kerem Zaman and Shashank Srivastava. Is chain-of-thought really not explainability? chain-of- thought can be faithful without hint verbalization, 2025
2025
-
[71]
Morris, and Vitaly Shmatikov
Collin Zhang, John X. Morris, and Vitaly Shmatikov. Extracting prompts by inverting llm outputs, 2024
2024
-
[72]
Beyond precision: Training-inference mismatch is an optimization problem and simple lr scheduling fixes it, 2026
Yaxiang Zhang, Yingru Li, Jiacai Liu, Jiawei Xu, Ziniu Li, Qian Liu, and Haoyuan Li. Beyond precision: Training-inference mismatch is an optimization problem and simple lr scheduling fixes it, 2026
2026
-
[73]
Promptbench: A unified library for evaluation of large language models, 2024
Kaijie Zhu, Qinlin Zhao, Hao Chen, Jindong Wang, and Xing Xie. Promptbench: A unified library for evaluation of large language models, 2024
2024
-
[74]
Countdown
Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. Texygen: A benchmarking platform for text generation models, 2018. 23 Appendix Contents A Extended Related Work 26 B Detailed Experimental Settings 27 B.1 Environments and Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 B.2 Training and Evaluation...
2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.