arxiv: 2605.11775 · v1 · submitted 2026-05-12 · 💻 cs.LG · cs.CL

Recognition: 2 theorem links

· Lean Theorem

Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control

Aiden Adams, Chenxin An, Dingwei Zhu, Fei Huang, Han Li, Jiazheng Zhang, Junrui Shen, Long Ma, Qi Zhang, Shaofan Liu, Shichun Liu, Shihan Dou, Tao Gui, Wiggin Zhou, Xuanjing Huang, Yunbin Zhao, Yunke Zhang, Zhihao Zhang, Zhiheng Xi, Ziche Fu

Pith reviewed 2026-05-13 07:27 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords entropy polaritypolicy entropyreinforcement learningLLM fine-tuningRLVRpolicy optimizationexploration control

0 comments

The pith

A first-order approximation of entropy change produces entropy polarity, a signed token-level quantity that predicts whether a sampled policy update expands or contracts entropy in LLM reinforcement fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a theoretical framework for how policy updates alter entropy at the token level during reinforcement learning with verifiable rewards. It derives entropy polarity as a signed quantity from a first-order approximation, showing that the sign indicates expansion or contraction and exposing an asymmetry where high-probability tokens favor contraction while expansion requires lower-probability samples. This token-level view enables a method that preserves both polarity branches and reweights advantages using the observed entropy trajectory as a signal. A reader would care because global entropy objectives become replaceable by direct, local control that balances exploration and exploitation more precisely during training.

Core claim

Our analysis yields a first-order approximation of the entropy change, giving rise to entropy polarity, a signed token-level quantity that predicts how much a sampled update expands or contracts entropy. This analysis further reveals a structural asymmetry: reinforcing frequent high-probability tokens triggers contraction tendencies, whereas expansive tendencies typically require lower-probability samples or stronger distributional correction. Empirically, entropy polarity reliably predicts entropy changes, and positive and negative polarity branches play complementary roles in preserving exploration while strengthening exploitation. Building on these insights, Polarity-Aware Policy Optimz (

What carries the argument

entropy polarity, a signed token-level quantity obtained from the first-order approximation of entropy change

Load-bearing premise

The first-order approximation of entropy change accurately captures the token-level mechanics and structural asymmetry without higher-order effects dominating.

What would settle it

Collect token-level entropy changes on a held-out set of updates and check whether the measured sign and magnitude match the predicted polarity values within a small error bound; systematic mismatches at moderate step sizes would falsify the approximation.

read the original abstract

Policy entropy has emerged as a fundamental measure for understanding and controlling exploration in reinforcement learning with verifiable rewards (RLVR) for LLMs. However, existing entropy-aware methods mainly regulate entropy through global objectives, while the token-level mechanism by which sampled policy updates reshape policy entropy remains underexplored. In this work, we develop a theoretical framework of entropy mechanics in RLVR. Our analysis yields a first-order approximation of the entropy change, giving rise to entropy polarity, a signed token-level quantity that predicts how much a sampled update expands or contracts entropy. This analysis further reveals a structural asymmetry: reinforcing frequent high-probability tokens triggers contraction tendencies, whereas expansive tendencies typically require lower-probability samples or stronger distributional correction. Empirically, we show that entropy polarity reliably predicts entropy changes, and that positive and negative polarity branches play complementary roles in preserving exploration while strengthening exploitation. Building on these insights, we propose Polarity-Aware Policy Optimization (PAPO), which preserves both polarity branches and implements entropy control through advantage reweighting. With the empirical entropy trajectory as an online phase signal, PAPO adaptively reallocates optimization pressure between entropy-expanding and entropy-contracting updates. Experiments on mathematical reasoning and agentic benchmarks show that PAPO consistently outperforms competitive baselines, while delivering superior training efficiency and substantial reward improvements.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a token-level entropy polarity from a first-order approximation and turns it into PAPO for adaptive control in LLM RL fine-tuning, with some empirical gains but shaky grounding on the approximation itself.

read the letter

The one thing to take away is that this work derives a signed token-level quantity called entropy polarity from a first-order approximation of how policy updates change entropy, then uses the observed asymmetry (high-prob tokens contract, low-prob ones expand) to build PAPO, which reweights advantages and switches pressure based on online entropy signals. They report consistent wins over baselines on math reasoning and agent benchmarks, plus better training efficiency and final rewards. That part is straightforward and addresses a real pain point in RLVR where global entropy penalties often feel blunt. The experiments look like they were run properly and the method is simple enough to implement. The soft spot is the approximation. In RL on LLMs, sampled updates routinely produce probability ratios well above 2 or below 0.5 on important tokens, so the remainder terms in the Taylor expansion of entropy are not obviously small. Without explicit checks against the full finite-difference entropy or error bounds in the results, it is unclear how often the polarity sign actually matches the true direction of change. The asymmetry claim could be an artifact of the linear regime rather than a general structural fact. This is for people already working on entropy-aware RL fine-tuning of LLMs. A reader who wants a new knob for balancing exploration and exploitation will find usable ideas in the polarity branches and the adaptive reweighting. It is worth sending to peer review because the empirical results are there and the problem matters, though the theoretical section will need tighter validation of the approximation before it can be trusted.

Referee Report

3 major / 2 minor

Summary. The paper develops a theoretical framework for entropy mechanics in RLVR for LLMs. It derives a first-order approximation of token-level entropy change that yields entropy polarity, a signed quantity predicting whether a sampled update expands or contracts entropy. The analysis identifies a structural asymmetry (frequent high-probability tokens contract entropy; lower-probability samples expand it) and shows that positive and negative polarity branches play complementary roles. Building on this, the authors propose Polarity-Aware Policy Optimization (PAPO), which preserves both polarity branches via advantage reweighting and adaptively reallocates optimization pressure using the empirical entropy trajectory as a phase signal. Experiments on mathematical reasoning and agentic benchmarks report that PAPO outperforms competitive baselines with improved training efficiency and reward gains.

Significance. If the first-order approximation holds and the empirical predictions are robust, the work would provide a token-level mechanism for entropy control that is more granular than global regularization approaches. PAPO's adaptive reweighting and the reported gains on reasoning benchmarks could influence how exploration is managed in LLM fine-tuning pipelines.

major comments (3)

[§3.2, Eq. (4)] §3.2, Eq. (4): The first-order Taylor expansion ΔH ≈ ∑_t (∂H/∂π_t) · Δπ_t is introduced without an explicit bound on the remainder involving second derivatives of -p log p and cross-token terms. In RLVR, sampled updates frequently produce probability ratios >2 or <0.5 on high-mass tokens; under these conditions the quadratic and higher-order contributions are not demonstrably negligible, which directly affects whether polarity reliably predicts actual finite-difference entropy change.
[§4.3, Table 2] §4.3, Table 2: The reported correlation between polarity and observed entropy change is given only for the linear regime; no ablation compares polarity-based predictions against the full finite-difference entropy ΔH computed directly from the updated policy. This leaves open whether the structural asymmetry (frequent tokens contract, rare tokens expand) persists outside the first-order regime.
[§5.1, Algorithm 1] §5.1, Algorithm 1: PAPO's advantage reweighting and adaptive allocation between polarity branches are motivated by the first-order analysis, yet the method is evaluated only against global-entropy baselines. A direct comparison that disables the polarity-specific reweighting while keeping the adaptive phase signal would isolate whether the claimed benefit stems from the polarity mechanism or from the adaptive schedule alone.

minor comments (2)

[§3] Notation for the polarity quantity is introduced in §3 but the sign convention (positive = expansion) is not restated in the experimental sections, making it easy to misread the polarity-branch plots.
[Figure 3] Figure 3 caption does not specify the exact probability-ratio threshold used to color tokens as 'frequent' versus 'rare', which affects reproducibility of the asymmetry claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and have revised the manuscript to incorporate additional theoretical discussion and empirical ablations where appropriate.

read point-by-point responses

Referee: [§3.2, Eq. (4)] §3.2, Eq. (4): The first-order Taylor expansion ΔH ≈ ∑_t (∂H/∂π_t) · Δπ_t is introduced without an explicit bound on the remainder involving second derivatives of -p log p and cross-token terms. In RLVR, sampled updates frequently produce probability ratios >2 or <0.5 on high-mass tokens; under these conditions the quadratic and higher-order contributions are not demonstrably negligible, which directly affects whether polarity reliably predicts actual finite-difference entropy change.

Authors: We acknowledge that the first-order approximation lacks an explicit remainder bound and that large probability shifts can introduce higher-order effects. In the revised manuscript, we have added a discussion in §3.2 on the validity conditions of the linearization, including a reference to the Hessian of the entropy function for the remainder term. We also report empirical checks confirming that polarity sign remains predictive of entropy change direction for the update magnitudes typical in our RLVR setting. This clarifies the approximation's scope while preserving the core theoretical claims. revision: partial
Referee: [§4.3, Table 2] §4.3, Table 2: The reported correlation between polarity and observed entropy change is given only for the linear regime; no ablation compares polarity-based predictions against the full finite-difference entropy ΔH computed directly from the updated policy. This leaves open whether the structural asymmetry (frequent tokens contract, rare tokens expand) persists outside the first-order regime.

Authors: We agree that direct comparison to the full finite-difference ΔH strengthens the evidence. In the revised version, Table 2 has been extended to report correlations using the exact entropy change ΔH = H(π') − H(π). The updated results confirm that the structural asymmetry persists, with high-probability tokens continuing to show net contraction tendencies outside the strict linear regime. revision: yes
Referee: [§5.1, Algorithm 1] §5.1, Algorithm 1: PAPO's advantage reweighting and adaptive allocation between polarity branches are motivated by the first-order analysis, yet the method is evaluated only against global-entropy baselines. A direct comparison that disables the polarity-specific reweighting while keeping the adaptive phase signal would isolate whether the claimed benefit stems from the polarity mechanism or from the adaptive schedule alone.

Authors: We have added the requested ablation in the revised §5.1. We compare PAPO to a controlled variant that retains the adaptive phase signal (based on the empirical entropy trajectory) but replaces polarity-aware advantage reweighting with uniform weighting. The results show that the polarity-specific reweighting contributes additional gains in reward and training efficiency beyond the adaptive schedule alone, supporting the design of PAPO. revision: yes

Circularity Check

1 steps flagged

Entropy polarity arises directly from first-order Taylor expansion, rendering its predictive role definitional rather than independent

specific steps

self definitional [Abstract]
"Our analysis yields a first-order approximation of the entropy change, giving rise to entropy polarity, a signed token-level quantity that predicts how much a sampled update expands or contracts entropy."

Polarity is introduced as the direct output of the first-order term in the entropy-change approximation. Its claimed ability to predict the sign and magnitude of actual entropy change is therefore equivalent to the linear approximation by construction; any deviation from the full finite-difference entropy is outside the derivation and not addressed within it.

full rationale

The paper's core theoretical step introduces a first-order approximation of entropy change under policy updates and defines entropy polarity as the resulting signed token-level quantity. This quantity is then said to predict expansion or contraction. Because polarity is constructed exactly as the linear term in the expansion, the predictive claim holds by the definition of the approximation itself. No self-citations, fitted parameters, or uniqueness theorems are invoked in the provided text to support the derivation. Empirical validation is presented separately and does not alter the definitional character of the theoretical step. This produces moderate circularity confined to the framing of the new quantity; the underlying calculus is standard and the remainder of the work (asymmetry observations, PAPO algorithm) retains independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on standard RL assumptions plus the validity of the first-order approximation; entropy polarity is introduced as a derived quantity without independent external evidence.

axioms (1)

domain assumption Standard assumptions of RLVR setups for LLMs hold, including verifiable rewards and policy gradient updates.
The entropy mechanics analysis builds directly on existing RLVR methodology.

invented entities (1)

entropy polarity no independent evidence
purpose: Signed token-level quantity to predict entropy expansion or contraction from sampled updates.
New derived measure introduced via the first-order approximation.

pith-pipeline@v0.9.0 · 5601 in / 1210 out tokens · 41649 ms · 2026-05-13T07:27:10.156712+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
Theorem 1(First-order Entropy Change via Sampled Updates)... ΔH_t = -η A t1(s_t, y_t) + η A t2(s_t) + O(η^{2}) where t1 = p_t (H_t + log p_t), t2 = ∑ p_v^{2} (H_t + log p_v)
Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear
Definition 1(Intrinsic Entropy Tendency). T(s_t, y_t) := -t1 + t2 ... P(s_t, y_t, A) := A T

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 13 internal anchors

[1]

, author=

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. , author=. Nature , volume=

work page
[2]

OpenAI o1 System Card

Openai o1 system card , author=. arXiv preprint arXiv:2412.16720 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[3]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Group Sequence Policy Optimization

Group sequence policy optimization , author=. arXiv preprint arXiv:2507.18071 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[5]

2026 , url=

Qiying Yu and Zheng Zhang and Ruofei Zhu and Yufeng Yuan and Xiaochen Zuo and YuYue and Weinan Dai and Tiantian Fan and Gaohong Liu and Juncai Liu and LingJun Liu and Xin Liu and Haibin Lin and Zhiqi Lin and Bole Ma and Guangming Sheng and Yuxuan Tong and Chi Zhang and Mofan Zhang and Ru Zhang and Wang Zhang and Hang Zhu and Jinhua Zhu and Jiaze Chen and ...

work page 2026
[6]

VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

VAPO: Efficient and reliable reinforcement learning for advanced reasoning tasks , author=. arXiv preprint arXiv:2504.05118 , year=

work page internal anchor Pith review arXiv
[7]

Skip-Connected Policy Optimization for Implicit Advantage

Skip-Connected Policy Optimization for Implicit Advantage , author=. arXiv preprint arXiv:2604.08690 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[8]

TTVS: Boosting Self-Exploring Reinforcement Learning via Test-time Variational Synthesis

TTVS: Boosting Self-Exploring Reinforcement Learning via Test-time Variational Synthesis , author=. arXiv preprint arXiv:2604.08468 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[9]

POLARIS: A Post-Training Recipe for Scaling Reinforcement Learning on Advanced Reasoning Models , url =

An, Chenxin and Xie, Zhihui and Li, Xiaonan and Li, Lei and Zhang, Jun and Gong, Shansan and Zhong, Ming and Xu, Jingjing and Qiu, Xipeng and Wang, Mingxuan and Kong, Lingpeng , year =. POLARIS: A Post-Training Recipe for Scaling Reinforcement Learning on Advanced Reasoning Models , url =

work page
[10]

arXiv preprint arXiv:2512.16649 , year=

Justrl: Scaling a 1.5 b llm with a simple rl recipe , author=. arXiv preprint arXiv:2512.16649 , year=

work page arXiv
[11]

CoRR , volume =

Chang Gao and Chujie Zheng and Xionghui Chen and Kai Dang and Shixuan Liu and Bowen Yu and An Yang and Shuai Bai and Jingren Zhou and Junyang Lin , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2511.20347 , eprinttype =

work page doi:10.48550/arxiv.2511.20347 2025
[12]

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Ganqu Cui and Yuchen Zhang and Jiacheng Chen and Lifan Yuan and Zhi Wang and Yuxin Zuo and Haozhan Li and Yuchen Fan and Huayu Chen and Weize Chen and Zhiyuan Liu and Hao Peng and Lei Bai and Wanli Ouyang and Yu Cheng and Bowen Zhou and Ning Ding , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2505.22617 , eprinttype =. 2505.22617 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.22617 2025
[13]

Beyond high-entropy exploration: Correctness-aware low-entropy segment-based advantage shaping for reasoning llms.arXiv preprint arXiv:2512.00908, 2025

Xinzhu Chen and Xuesheng Li and Zhongxiang Sun and Weijie Yu , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2512.00908 , eprinttype =. 2512.00908 , timestamp =

work page doi:10.48550/arxiv.2512.00908 2025
[14]

Flexible Entropy Control in RLVR with a Gradient-Preserving Perspective

Kun Chen and Peng Shi and Fanfan Liu and Haibo Qiu and Zhixiong Zeng and Siqi Yang and Wenji Mao , title =. CoRR , volume =. 2026 , url =. doi:10.48550/ARXIV.2602.09782 , eprinttype =. 2602.09782 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.09782 2026
[15]

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for

Shenzhi Wang and Le Yu and Chang Gao and Chujie Zheng and Shixuan Liu and Rui Lu and Kai Dang and Xiong-Hui Chen and Jianxin Yang and Zhenru Zhang and Yuqiong Liu and An Yang and Andrew Zhao and Yang Yue and Shiji Song and Bowen Yu and Gao Huang and Junyang Lin , booktitle=. Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement...

work page 2026
[16]

CoRR , volume =

Kai Yang and Xin Xu and Yangkun Chen and Weijie Liu and Jiafei Lyu and Zichuan Lin and Deheng Ye and Saiyong Yang , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2511.15248 , eprinttype =. 2511.15248 , timestamp =

work page doi:10.48550/arxiv.2511.15248 2025
[17]

Cusumano

Aleksei Petrenko and Ben Lipkin and Kevin Chen and Erik Wijmans and Marco F. Cusumano. Entropy-Preserving Reinforcement Learning , journal =. 2026 , url =. doi:10.48550/ARXIV.2603.11682 , eprinttype =. 2603.11682 , timestamp =

work page doi:10.48550/arxiv.2603.11682 2026
[18]

2026 , url=

Chen Huang and Wei Lu and Wenxuan Zhang , booktitle=. 2026 , url=

work page 2026
[19]

Beyond Magnitude: Leveraging Direction of

Kexin Huang and Haoming Meng and Junkang Wu and Jinda Lu and Chiyu Ma and Ziqian Chen and Xue Wang and Bolin Ding and Jiancan Wu and Xiang Wang and Xiangnan He and Guoyin Wang and Jingren Zhou , booktitle=. Beyond Magnitude: Leveraging Direction of. 2026 , url=

work page 2026
[20]

Sparse but Critical: A Token-Level Analysis of Distributional Shifts in

Haoming Meng and Kexin Huang and Shaohang Wei and Chiyu Ma and Shuo Yang and Xue Wang and Guoyin Wang and Bolin Ding and Jingren Zhou , booktitle=. Sparse but Critical: A Token-Level Analysis of Distributional Shifts in. 2026 , url=

work page 2026
[21]

2026 , url=

Zhiheng Xi and Xin Guo and Yang Nan and Enyu Zhou and Junrui Shen and Wenxiang Chen and Jiaqi Liu and Jixuan Huang and Xun Deng and Zhihao Zhang and Honglin Guo and Zhikai Lei and Miao Zheng and Guoteng Wang and Peng Sun and Rui Zheng and Hang Yan and Tao Gui and Qi Zhang and Xuanjing Huang , booktitle=. 2026 , url=

work page 2026
[22]

CoRR , volume =

Kaiyan Zhang and Yuxin Zuo and Bingxiang He and Youbang Sun and Runze Liu and Che Jiang and Yuchen Fan and Kai Tian and Guoli Jia and Pengfei Li and Yu Fu and Xingtai Lv and Yuchen Zhang and Sihang Zeng and Shang Qu and Haozhan Li and Shijie Wang and Yuru Wang and Xinwei Long and Fangfu Liu and Xiang Xu and Jiaze Ma and Xuekai Zhu and Ermo Hua and Yihao L...

work page doi:10.48550/arxiv.2509.08827 2025
[23]

CE-GPPO: Coordinating Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning

Zhenpeng Su and Leiyu Pan and Minxuan Lv and Yuntao Li and Wenping Hu and Fuzheng Zhang and Kun Gai and Guorui Zhou , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2509.20712 , eprinttype =. 2509.20712 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509.20712 2025
[24]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[25]

The Twelfth International Conference on Learning Representations,

Hunter Lightman and Vineet Kosaraju and Yuri Burda and Harrison Edwards and Bowen Baker and Teddy Lee and Jan Leike and John Schulman and Ilya Sutskever and Karl Cobbe , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

work page 2024
[26]

Hugging Face repository , volume=

Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions , author=. Hugging Face repository , volume=

work page
[27]

2024 , note=

American invitational mathematics examination (aime) 2024 , author=. 2024 , note=

work page 2024
[28]

2025 , note=

American invitational mathematics examination (aime) 2025 , author=. 2025 , note=

work page 2025
[29]

Advances in neural information processing systems , volume=

Solving quantitative reasoning problems with language models , author=. Advances in neural information processing systems , volume=

work page
[30]

arXiv preprint arXiv:2505.00024 , year=

Nemotron-research-tool-n1: Exploring tool-using language models with reinforced reasoning , author=. arXiv preprint arXiv:2505.00024 , year=

work page arXiv
[31]

Forty-second International Conference on Machine Learning , year=

The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models , author=. Forty-second International Conference on Machine Learning , year=

work page
[32]

arXiv preprint arXiv:2501.12851 , year=

Acebench: Who wins the match point in tool usage? , author=. arXiv preprint arXiv:2501.12851 , year=

work page arXiv
[33]

Proceedings of the 41st International Conference on Machine Learning , pages=

CRUXEval: a benchmark for code reasoning, understanding and execution , author=. Proceedings of the 41st International Conference on Machine Learning , pages=

work page
[34]

The Thirteenth International Conference on Learning Representations , year=

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[35]

Advances in Neural Information Processing Systems , volume=

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark , author=. Advances in Neural Information Processing Systems , volume=

work page
[36]

Instruction-Following Evaluation for Large Language Models

Instruction-following evaluation for large language models , author=. arXiv preprint arXiv:2311.07911 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[37]

Qwen2 Technical Report

Qwen2 Technical Report , author=. arXiv preprint arXiv:2407.10671 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[38]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto , title =. 1998 , url =. doi:10.1109/TNN.1998.712192 , timestamp =

work page doi:10.1109/tnn.1998.712192 1998
[39]

CoRR , volume =

Zhiheng Xi and Xin Guo and Jiaqi Liu and Jiazheng Zhang and Yutao Fan and Zhihao Zhang and Shichun Liu and Mingxu Chai and Xiaowei Shi and Yitao Zhai and Xunliang Cai and Tao Gui and Qi Zhang and Xuanjing Huang , title =. CoRR , volume =. 2026 , url =. doi:10.48550/ARXIV.2603.12011 , eprinttype =. 2603.12011 , timestamp =

work page doi:10.48550/arxiv.2603.12011 2026
[40]

Group-in-Group Policy Optimization for LLM Agent Training

Lang Feng and Zhenghai Xue and Tingcong Liu and Bo An , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2505.10978 , eprinttype =. 2505.10978 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.10978 2025
[41]

CoRR , volume =

Kexin Huang and Haoming Meng and Junkang Wu and Jinda Lu and Chiyu Ma and Ziqian Chen and Xue Wang and Bolin Ding and Jiancan Wu and Xiang Wang and Xiangnan He and Guoyin Wang and Jingren Zhou , title =. CoRR , volume =. 2026 , url =. doi:10.48550/ARXIV.2603.22117 , eprinttype =. 2603.22117 , timestamp =

work page doi:10.48550/arxiv.2603.22117 2026
[42]

CoRR , volume =

Xinyu Tang and Yuliang Zhan and Zhixun Li and Wayne Xin Zhao and Zhenduo Zhang and Zujie Wen and Zhiqiang Zhang and Jun Zhou , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2512.21625 , eprinttype =. 2512.21625 , timestamp =

work page doi:10.48550/arxiv.2512.21625 2025
[43]

AgentGym-RL: Training

Zhiheng Xi and Jixuan Huang and Chenyang Liao and Baodai Huang and Honglin Guo and Jiaqi Liu and Rui Zheng and Junjie Ye and Jiazheng Zhang and Wenxiang Chen and Wei He and Yiwen Ding and Guanyu Li and Zehui Chen and Zhengyin Du and Xuesong Yao and Yufei Xu and Jiecao Chen and Tao Gui and Zuxuan Wu and Qi Zhang and Xuanjing Huang and Yu. AgentGym-RL: Trai...

work page doi:10.48550/arxiv.2509.08755 2025
[44]

Decoupling Exploration and Exploitation for Meta-Reinforcement Learning without Sacrifices , booktitle =

Evan Zheran Liu and Aditi Raghunathan and Percy Liang and Chelsea Finn , editor =. Decoupling Exploration and Exploitation for Meta-Reinforcement Learning without Sacrifices , booktitle =. 2021 , url =

work page 2021
[45]

Reasoning with Exploration: An Entropy Perspective , booktitle =

Daixuan Cheng and Shaohan Huang and Xuekai Zhu and Bo Dai and Xin Zhao and Zhenliang Zhang and Furu Wei , editor =. Reasoning with Exploration: An Entropy Perspective , booktitle =. 2026 , url =. doi:10.1609/AAAI.V40I36.40290 , timestamp =

work page doi:10.1609/aaai.v40i36.40290 2026
[46]

CoRR , volume =

Peter Chen and Xiaopeng Li and Ziniu Li and Wotao Yin and Xi Chen and Tianyi Lin , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2512.16912 , eprinttype =. 2512.16912 , timestamp =

work page doi:10.48550/arxiv.2512.16912 2025
[47]

CoRR , volume =

Mihir Prabhudesai and Lili Chen and Alex Ippoliti and Katerina Fragkiadaki and Hao Liu and Deepak Pathak , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2505.22660 , eprinttype =. 2505.22660 , timestamp =

work page doi:10.48550/arxiv.2505.22660 2025
[48]

CoRR , volume =

Han Shen , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2509.03493 , eprinttype =. 2509.03493 , timestamp =

work page doi:10.48550/arxiv.2509.03493 2025
[49]

2026 , eprint=

AgentV-RL: Scaling Reward Modeling with Agentic Verifier , author=. 2026 , eprint=

work page 2026
[50]

arXiv preprint arXiv:2506.23508 , year=

Why Reinforcement Fine-Tuning Enables MLLMs Preserve Prior Knowledge Better: A Data Perspective , author=. arXiv preprint arXiv:2506.23508 , year=

work page arXiv
[51]

AgentV-RL: Scaling Reward Modeling with Agentic Verifier

AgentV-RL: Scaling Reward Modeling with Agentic Verifier , author=. arXiv preprint arXiv:2604.16004 , year=

work page internal anchor Pith review Pith/arXiv arXiv