pith. machine review for the scientific record. sign in

arxiv: 2605.11775 · v1 · submitted 2026-05-12 · 💻 cs.LG · cs.CL

Recognition: 2 theorem links

· Lean Theorem

Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control

Aiden Adams, Chenxin An, Dingwei Zhu, Fei Huang, Han Li, Jiazheng Zhang, Junrui Shen, Long Ma, Qi Zhang, Shaofan Liu, Shichun Liu, Shihan Dou, Tao Gui, Wiggin Zhou, Xuanjing Huang, Yunbin Zhao, Yunke Zhang, Zhihao Zhang, Zhiheng Xi, Ziche Fu

Pith reviewed 2026-05-13 07:27 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords entropy polaritypolicy entropyreinforcement learningLLM fine-tuningRLVRpolicy optimizationexploration control
0
0 comments X

The pith

A first-order approximation of entropy change produces entropy polarity, a signed token-level quantity that predicts whether a sampled policy update expands or contracts entropy in LLM reinforcement fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a theoretical framework for how policy updates alter entropy at the token level during reinforcement learning with verifiable rewards. It derives entropy polarity as a signed quantity from a first-order approximation, showing that the sign indicates expansion or contraction and exposing an asymmetry where high-probability tokens favor contraction while expansion requires lower-probability samples. This token-level view enables a method that preserves both polarity branches and reweights advantages using the observed entropy trajectory as a signal. A reader would care because global entropy objectives become replaceable by direct, local control that balances exploration and exploitation more precisely during training.

Core claim

Our analysis yields a first-order approximation of the entropy change, giving rise to entropy polarity, a signed token-level quantity that predicts how much a sampled update expands or contracts entropy. This analysis further reveals a structural asymmetry: reinforcing frequent high-probability tokens triggers contraction tendencies, whereas expansive tendencies typically require lower-probability samples or stronger distributional correction. Empirically, entropy polarity reliably predicts entropy changes, and positive and negative polarity branches play complementary roles in preserving exploration while strengthening exploitation. Building on these insights, Polarity-Aware Policy Optimz (

What carries the argument

entropy polarity, a signed token-level quantity obtained from the first-order approximation of entropy change

Load-bearing premise

The first-order approximation of entropy change accurately captures the token-level mechanics and structural asymmetry without higher-order effects dominating.

What would settle it

Collect token-level entropy changes on a held-out set of updates and check whether the measured sign and magnitude match the predicted polarity values within a small error bound; systematic mismatches at moderate step sizes would falsify the approximation.

read the original abstract

Policy entropy has emerged as a fundamental measure for understanding and controlling exploration in reinforcement learning with verifiable rewards (RLVR) for LLMs. However, existing entropy-aware methods mainly regulate entropy through global objectives, while the token-level mechanism by which sampled policy updates reshape policy entropy remains underexplored. In this work, we develop a theoretical framework of entropy mechanics in RLVR. Our analysis yields a first-order approximation of the entropy change, giving rise to entropy polarity, a signed token-level quantity that predicts how much a sampled update expands or contracts entropy. This analysis further reveals a structural asymmetry: reinforcing frequent high-probability tokens triggers contraction tendencies, whereas expansive tendencies typically require lower-probability samples or stronger distributional correction. Empirically, we show that entropy polarity reliably predicts entropy changes, and that positive and negative polarity branches play complementary roles in preserving exploration while strengthening exploitation. Building on these insights, we propose Polarity-Aware Policy Optimization (PAPO), which preserves both polarity branches and implements entropy control through advantage reweighting. With the empirical entropy trajectory as an online phase signal, PAPO adaptively reallocates optimization pressure between entropy-expanding and entropy-contracting updates. Experiments on mathematical reasoning and agentic benchmarks show that PAPO consistently outperforms competitive baselines, while delivering superior training efficiency and substantial reward improvements.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper develops a theoretical framework for entropy mechanics in RLVR for LLMs. It derives a first-order approximation of token-level entropy change that yields entropy polarity, a signed quantity predicting whether a sampled update expands or contracts entropy. The analysis identifies a structural asymmetry (frequent high-probability tokens contract entropy; lower-probability samples expand it) and shows that positive and negative polarity branches play complementary roles. Building on this, the authors propose Polarity-Aware Policy Optimization (PAPO), which preserves both polarity branches via advantage reweighting and adaptively reallocates optimization pressure using the empirical entropy trajectory as a phase signal. Experiments on mathematical reasoning and agentic benchmarks report that PAPO outperforms competitive baselines with improved training efficiency and reward gains.

Significance. If the first-order approximation holds and the empirical predictions are robust, the work would provide a token-level mechanism for entropy control that is more granular than global regularization approaches. PAPO's adaptive reweighting and the reported gains on reasoning benchmarks could influence how exploration is managed in LLM fine-tuning pipelines.

major comments (3)
  1. [§3.2, Eq. (4)] §3.2, Eq. (4): The first-order Taylor expansion ΔH ≈ ∑_t (∂H/∂π_t) · Δπ_t is introduced without an explicit bound on the remainder involving second derivatives of -p log p and cross-token terms. In RLVR, sampled updates frequently produce probability ratios >2 or <0.5 on high-mass tokens; under these conditions the quadratic and higher-order contributions are not demonstrably negligible, which directly affects whether polarity reliably predicts actual finite-difference entropy change.
  2. [§4.3, Table 2] §4.3, Table 2: The reported correlation between polarity and observed entropy change is given only for the linear regime; no ablation compares polarity-based predictions against the full finite-difference entropy ΔH computed directly from the updated policy. This leaves open whether the structural asymmetry (frequent tokens contract, rare tokens expand) persists outside the first-order regime.
  3. [§5.1, Algorithm 1] §5.1, Algorithm 1: PAPO's advantage reweighting and adaptive allocation between polarity branches are motivated by the first-order analysis, yet the method is evaluated only against global-entropy baselines. A direct comparison that disables the polarity-specific reweighting while keeping the adaptive phase signal would isolate whether the claimed benefit stems from the polarity mechanism or from the adaptive schedule alone.
minor comments (2)
  1. [§3] Notation for the polarity quantity is introduced in §3 but the sign convention (positive = expansion) is not restated in the experimental sections, making it easy to misread the polarity-branch plots.
  2. [Figure 3] Figure 3 caption does not specify the exact probability-ratio threshold used to color tokens as 'frequent' versus 'rare', which affects reproducibility of the asymmetry claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and have revised the manuscript to incorporate additional theoretical discussion and empirical ablations where appropriate.

read point-by-point responses
  1. Referee: [§3.2, Eq. (4)] §3.2, Eq. (4): The first-order Taylor expansion ΔH ≈ ∑_t (∂H/∂π_t) · Δπ_t is introduced without an explicit bound on the remainder involving second derivatives of -p log p and cross-token terms. In RLVR, sampled updates frequently produce probability ratios >2 or <0.5 on high-mass tokens; under these conditions the quadratic and higher-order contributions are not demonstrably negligible, which directly affects whether polarity reliably predicts actual finite-difference entropy change.

    Authors: We acknowledge that the first-order approximation lacks an explicit remainder bound and that large probability shifts can introduce higher-order effects. In the revised manuscript, we have added a discussion in §3.2 on the validity conditions of the linearization, including a reference to the Hessian of the entropy function for the remainder term. We also report empirical checks confirming that polarity sign remains predictive of entropy change direction for the update magnitudes typical in our RLVR setting. This clarifies the approximation's scope while preserving the core theoretical claims. revision: partial

  2. Referee: [§4.3, Table 2] §4.3, Table 2: The reported correlation between polarity and observed entropy change is given only for the linear regime; no ablation compares polarity-based predictions against the full finite-difference entropy ΔH computed directly from the updated policy. This leaves open whether the structural asymmetry (frequent tokens contract, rare tokens expand) persists outside the first-order regime.

    Authors: We agree that direct comparison to the full finite-difference ΔH strengthens the evidence. In the revised version, Table 2 has been extended to report correlations using the exact entropy change ΔH = H(π') − H(π). The updated results confirm that the structural asymmetry persists, with high-probability tokens continuing to show net contraction tendencies outside the strict linear regime. revision: yes

  3. Referee: [§5.1, Algorithm 1] §5.1, Algorithm 1: PAPO's advantage reweighting and adaptive allocation between polarity branches are motivated by the first-order analysis, yet the method is evaluated only against global-entropy baselines. A direct comparison that disables the polarity-specific reweighting while keeping the adaptive phase signal would isolate whether the claimed benefit stems from the polarity mechanism or from the adaptive schedule alone.

    Authors: We have added the requested ablation in the revised §5.1. We compare PAPO to a controlled variant that retains the adaptive phase signal (based on the empirical entropy trajectory) but replaces polarity-aware advantage reweighting with uniform weighting. The results show that the polarity-specific reweighting contributes additional gains in reward and training efficiency beyond the adaptive schedule alone, supporting the design of PAPO. revision: yes

Circularity Check

1 steps flagged

Entropy polarity arises directly from first-order Taylor expansion, rendering its predictive role definitional rather than independent

specific steps
  1. self definitional [Abstract]
    "Our analysis yields a first-order approximation of the entropy change, giving rise to entropy polarity, a signed token-level quantity that predicts how much a sampled update expands or contracts entropy."

    Polarity is introduced as the direct output of the first-order term in the entropy-change approximation. Its claimed ability to predict the sign and magnitude of actual entropy change is therefore equivalent to the linear approximation by construction; any deviation from the full finite-difference entropy is outside the derivation and not addressed within it.

full rationale

The paper's core theoretical step introduces a first-order approximation of entropy change under policy updates and defines entropy polarity as the resulting signed token-level quantity. This quantity is then said to predict expansion or contraction. Because polarity is constructed exactly as the linear term in the expansion, the predictive claim holds by the definition of the approximation itself. No self-citations, fitted parameters, or uniqueness theorems are invoked in the provided text to support the derivation. Empirical validation is presented separately and does not alter the definitional character of the theoretical step. This produces moderate circularity confined to the framing of the new quantity; the underlying calculus is standard and the remainder of the work (asymmetry observations, PAPO algorithm) retains independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on standard RL assumptions plus the validity of the first-order approximation; entropy polarity is introduced as a derived quantity without independent external evidence.

axioms (1)
  • domain assumption Standard assumptions of RLVR setups for LLMs hold, including verifiable rewards and policy gradient updates.
    The entropy mechanics analysis builds directly on existing RLVR methodology.
invented entities (1)
  • entropy polarity no independent evidence
    purpose: Signed token-level quantity to predict entropy expansion or contraction from sampled updates.
    New derived measure introduced via the first-order approximation.

pith-pipeline@v0.9.0 · 5601 in / 1210 out tokens · 41649 ms · 2026-05-13T07:27:10.156712+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 13 internal anchors

  1. [1]

    , author=

    DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. , author=. Nature , volume=

  2. [2]

    OpenAI o1 System Card

    Openai o1 system card , author=. arXiv preprint arXiv:2412.16720 , year=

  3. [3]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

  4. [4]

    Group Sequence Policy Optimization

    Group sequence policy optimization , author=. arXiv preprint arXiv:2507.18071 , year=

  5. [5]

    2026 , url=

    Qiying Yu and Zheng Zhang and Ruofei Zhu and Yufeng Yuan and Xiaochen Zuo and YuYue and Weinan Dai and Tiantian Fan and Gaohong Liu and Juncai Liu and LingJun Liu and Xin Liu and Haibin Lin and Zhiqi Lin and Bole Ma and Guangming Sheng and Yuxuan Tong and Chi Zhang and Mofan Zhang and Ru Zhang and Wang Zhang and Hang Zhu and Jinhua Zhu and Jiaze Chen and ...

  6. [6]

    VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

    VAPO: Efficient and reliable reinforcement learning for advanced reasoning tasks , author=. arXiv preprint arXiv:2504.05118 , year=

  7. [7]

    Skip-Connected Policy Optimization for Implicit Advantage

    Skip-Connected Policy Optimization for Implicit Advantage , author=. arXiv preprint arXiv:2604.08690 , year=

  8. [8]

    TTVS: Boosting Self-Exploring Reinforcement Learning via Test-time Variational Synthesis

    TTVS: Boosting Self-Exploring Reinforcement Learning via Test-time Variational Synthesis , author=. arXiv preprint arXiv:2604.08468 , year=

  9. [9]

    POLARIS: A Post-Training Recipe for Scaling Reinforcement Learning on Advanced Reasoning Models , url =

    An, Chenxin and Xie, Zhihui and Li, Xiaonan and Li, Lei and Zhang, Jun and Gong, Shansan and Zhong, Ming and Xu, Jingjing and Qiu, Xipeng and Wang, Mingxuan and Kong, Lingpeng , year =. POLARIS: A Post-Training Recipe for Scaling Reinforcement Learning on Advanced Reasoning Models , url =

  10. [10]

    arXiv preprint arXiv:2512.16649 , year=

    Justrl: Scaling a 1.5 b llm with a simple rl recipe , author=. arXiv preprint arXiv:2512.16649 , year=

  11. [11]

    CoRR , volume =

    Chang Gao and Chujie Zheng and Xionghui Chen and Kai Dang and Shixuan Liu and Bowen Yu and An Yang and Shuai Bai and Jingren Zhou and Junyang Lin , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2511.20347 , eprinttype =

  12. [12]

    The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

    Ganqu Cui and Yuchen Zhang and Jiacheng Chen and Lifan Yuan and Zhi Wang and Yuxin Zuo and Haozhan Li and Yuchen Fan and Huayu Chen and Weize Chen and Zhiyuan Liu and Hao Peng and Lei Bai and Wanli Ouyang and Yu Cheng and Bowen Zhou and Ning Ding , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2505.22617 , eprinttype =. 2505.22617 , timestamp =

  13. [13]

    Beyond high-entropy exploration: Correctness-aware low-entropy segment-based advantage shaping for reasoning llms.arXiv preprint arXiv:2512.00908, 2025

    Xinzhu Chen and Xuesheng Li and Zhongxiang Sun and Weijie Yu , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2512.00908 , eprinttype =. 2512.00908 , timestamp =

  14. [14]

    Flexible Entropy Control in RLVR with a Gradient-Preserving Perspective

    Kun Chen and Peng Shi and Fanfan Liu and Haibo Qiu and Zhixiong Zeng and Siqi Yang and Wenji Mao , title =. CoRR , volume =. 2026 , url =. doi:10.48550/ARXIV.2602.09782 , eprinttype =. 2602.09782 , timestamp =

  15. [15]

    Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for

    Shenzhi Wang and Le Yu and Chang Gao and Chujie Zheng and Shixuan Liu and Rui Lu and Kai Dang and Xiong-Hui Chen and Jianxin Yang and Zhenru Zhang and Yuqiong Liu and An Yang and Andrew Zhao and Yang Yue and Shiji Song and Bowen Yu and Gao Huang and Junyang Lin , booktitle=. Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement...

  16. [16]

    CoRR , volume =

    Kai Yang and Xin Xu and Yangkun Chen and Weijie Liu and Jiafei Lyu and Zichuan Lin and Deheng Ye and Saiyong Yang , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2511.15248 , eprinttype =. 2511.15248 , timestamp =

  17. [17]

    Cusumano

    Aleksei Petrenko and Ben Lipkin and Kevin Chen and Erik Wijmans and Marco F. Cusumano. Entropy-Preserving Reinforcement Learning , journal =. 2026 , url =. doi:10.48550/ARXIV.2603.11682 , eprinttype =. 2603.11682 , timestamp =

  18. [18]

    2026 , url=

    Chen Huang and Wei Lu and Wenxuan Zhang , booktitle=. 2026 , url=

  19. [19]

    Beyond Magnitude: Leveraging Direction of

    Kexin Huang and Haoming Meng and Junkang Wu and Jinda Lu and Chiyu Ma and Ziqian Chen and Xue Wang and Bolin Ding and Jiancan Wu and Xiang Wang and Xiangnan He and Guoyin Wang and Jingren Zhou , booktitle=. Beyond Magnitude: Leveraging Direction of. 2026 , url=

  20. [20]

    Sparse but Critical: A Token-Level Analysis of Distributional Shifts in

    Haoming Meng and Kexin Huang and Shaohang Wei and Chiyu Ma and Shuo Yang and Xue Wang and Guoyin Wang and Bolin Ding and Jingren Zhou , booktitle=. Sparse but Critical: A Token-Level Analysis of Distributional Shifts in. 2026 , url=

  21. [21]

    2026 , url=

    Zhiheng Xi and Xin Guo and Yang Nan and Enyu Zhou and Junrui Shen and Wenxiang Chen and Jiaqi Liu and Jixuan Huang and Xun Deng and Zhihao Zhang and Honglin Guo and Zhikai Lei and Miao Zheng and Guoteng Wang and Peng Sun and Rui Zheng and Hang Yan and Tao Gui and Qi Zhang and Xuanjing Huang , booktitle=. 2026 , url=

  22. [22]

    CoRR , volume =

    Kaiyan Zhang and Yuxin Zuo and Bingxiang He and Youbang Sun and Runze Liu and Che Jiang and Yuchen Fan and Kai Tian and Guoli Jia and Pengfei Li and Yu Fu and Xingtai Lv and Yuchen Zhang and Sihang Zeng and Shang Qu and Haozhan Li and Shijie Wang and Yuru Wang and Xinwei Long and Fangfu Liu and Xiang Xu and Jiaze Ma and Xuekai Zhu and Ermo Hua and Yihao L...

  23. [23]

    CE-GPPO: Coordinating Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning

    Zhenpeng Su and Leiyu Pan and Minxuan Lv and Yuntao Li and Wenping Hu and Fuzheng Zhang and Kun Gai and Guorui Zhou , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2509.20712 , eprinttype =. 2509.20712 , timestamp =

  24. [24]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  25. [25]

    The Twelfth International Conference on Learning Representations,

    Hunter Lightman and Vineet Kosaraju and Yuri Burda and Harrison Edwards and Bowen Baker and Teddy Lee and Jan Leike and John Schulman and Ilya Sutskever and Karl Cobbe , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

  26. [26]

    Hugging Face repository , volume=

    Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions , author=. Hugging Face repository , volume=

  27. [27]

    2024 , note=

    American invitational mathematics examination (aime) 2024 , author=. 2024 , note=

  28. [28]

    2025 , note=

    American invitational mathematics examination (aime) 2025 , author=. 2025 , note=

  29. [29]

    Advances in neural information processing systems , volume=

    Solving quantitative reasoning problems with language models , author=. Advances in neural information processing systems , volume=

  30. [30]

    arXiv preprint arXiv:2505.00024 , year=

    Nemotron-research-tool-n1: Exploring tool-using language models with reinforced reasoning , author=. arXiv preprint arXiv:2505.00024 , year=

  31. [31]

    Forty-second International Conference on Machine Learning , year=

    The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models , author=. Forty-second International Conference on Machine Learning , year=

  32. [32]

    arXiv preprint arXiv:2501.12851 , year=

    Acebench: Who wins the match point in tool usage? , author=. arXiv preprint arXiv:2501.12851 , year=

  33. [33]

    Proceedings of the 41st International Conference on Machine Learning , pages=

    CRUXEval: a benchmark for code reasoning, understanding and execution , author=. Proceedings of the 41st International Conference on Machine Learning , pages=

  34. [34]

    The Thirteenth International Conference on Learning Representations , year=

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code , author=. The Thirteenth International Conference on Learning Representations , year=

  35. [35]

    Advances in Neural Information Processing Systems , volume=

    Mmlu-pro: A more robust and challenging multi-task language understanding benchmark , author=. Advances in Neural Information Processing Systems , volume=

  36. [36]

    Instruction-Following Evaluation for Large Language Models

    Instruction-following evaluation for large language models , author=. arXiv preprint arXiv:2311.07911 , year=

  37. [37]

    Qwen2 Technical Report

    Qwen2 Technical Report , author=. arXiv preprint arXiv:2407.10671 , year=

  38. [38]

    Sutton and Andrew G

    Richard S. Sutton and Andrew G. Barto , title =. 1998 , url =. doi:10.1109/TNN.1998.712192 , timestamp =

  39. [39]

    CoRR , volume =

    Zhiheng Xi and Xin Guo and Jiaqi Liu and Jiazheng Zhang and Yutao Fan and Zhihao Zhang and Shichun Liu and Mingxu Chai and Xiaowei Shi and Yitao Zhai and Xunliang Cai and Tao Gui and Qi Zhang and Xuanjing Huang , title =. CoRR , volume =. 2026 , url =. doi:10.48550/ARXIV.2603.12011 , eprinttype =. 2603.12011 , timestamp =

  40. [40]

    Group-in-Group Policy Optimization for LLM Agent Training

    Lang Feng and Zhenghai Xue and Tingcong Liu and Bo An , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2505.10978 , eprinttype =. 2505.10978 , timestamp =

  41. [41]

    CoRR , volume =

    Kexin Huang and Haoming Meng and Junkang Wu and Jinda Lu and Chiyu Ma and Ziqian Chen and Xue Wang and Bolin Ding and Jiancan Wu and Xiang Wang and Xiangnan He and Guoyin Wang and Jingren Zhou , title =. CoRR , volume =. 2026 , url =. doi:10.48550/ARXIV.2603.22117 , eprinttype =. 2603.22117 , timestamp =

  42. [42]

    CoRR , volume =

    Xinyu Tang and Yuliang Zhan and Zhixun Li and Wayne Xin Zhao and Zhenduo Zhang and Zujie Wen and Zhiqiang Zhang and Jun Zhou , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2512.21625 , eprinttype =. 2512.21625 , timestamp =

  43. [43]

    AgentGym-RL: Training

    Zhiheng Xi and Jixuan Huang and Chenyang Liao and Baodai Huang and Honglin Guo and Jiaqi Liu and Rui Zheng and Junjie Ye and Jiazheng Zhang and Wenxiang Chen and Wei He and Yiwen Ding and Guanyu Li and Zehui Chen and Zhengyin Du and Xuesong Yao and Yufei Xu and Jiecao Chen and Tao Gui and Zuxuan Wu and Qi Zhang and Xuanjing Huang and Yu. AgentGym-RL: Trai...

  44. [44]

    Decoupling Exploration and Exploitation for Meta-Reinforcement Learning without Sacrifices , booktitle =

    Evan Zheran Liu and Aditi Raghunathan and Percy Liang and Chelsea Finn , editor =. Decoupling Exploration and Exploitation for Meta-Reinforcement Learning without Sacrifices , booktitle =. 2021 , url =

  45. [45]

    Reasoning with Exploration: An Entropy Perspective , booktitle =

    Daixuan Cheng and Shaohan Huang and Xuekai Zhu and Bo Dai and Xin Zhao and Zhenliang Zhang and Furu Wei , editor =. Reasoning with Exploration: An Entropy Perspective , booktitle =. 2026 , url =. doi:10.1609/AAAI.V40I36.40290 , timestamp =

  46. [46]

    CoRR , volume =

    Peter Chen and Xiaopeng Li and Ziniu Li and Wotao Yin and Xi Chen and Tianyi Lin , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2512.16912 , eprinttype =. 2512.16912 , timestamp =

  47. [47]

    CoRR , volume =

    Mihir Prabhudesai and Lili Chen and Alex Ippoliti and Katerina Fragkiadaki and Hao Liu and Deepak Pathak , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2505.22660 , eprinttype =. 2505.22660 , timestamp =

  48. [48]

    CoRR , volume =

    Han Shen , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2509.03493 , eprinttype =. 2509.03493 , timestamp =

  49. [49]

    2026 , eprint=

    AgentV-RL: Scaling Reward Modeling with Agentic Verifier , author=. 2026 , eprint=

  50. [50]

    arXiv preprint arXiv:2506.23508 , year=

    Why Reinforcement Fine-Tuning Enables MLLMs Preserve Prior Knowledge Better: A Data Perspective , author=. arXiv preprint arXiv:2506.23508 , year=

  51. [51]

    AgentV-RL: Scaling Reward Modeling with Agentic Verifier

    AgentV-RL: Scaling Reward Modeling with Agentic Verifier , author=. arXiv preprint arXiv:2604.16004 , year=