EAPO: Entropy-Driven Adaptive Positive-Negative Sample Weighting for Policy Optimization in Open-Ended QA

Bo Yuan; Gen Li; Jianwei Lv; Junfeng Wang; Luning Wang; Siyu Chen; Xiandong Li; Yujin Wang; Yunhao Qiao; Yunsheng Zeng

arxiv: 2605.27846 · v1 · pith:3W4NRXBBnew · submitted 2026-05-27 · 💻 cs.AI

EAPO: Entropy-Driven Adaptive Positive-Negative Sample Weighting for Policy Optimization in Open-Ended QA

Yunsheng Zeng , Gen Li , Yuwei Miao , Xiandong Li , Yujin Wang , Siyu Chen , Luning Wang , Yunhao Qiao

show 3 more authors

Junfeng Wang Jianwei Lv Bo Yuan

This is my paper

Pith reviewed 2026-06-29 12:55 UTC · model grok-4.3

classification 💻 cs.AI

keywords reinforcement learningpolicy optimizationopen-ended QAentropy adaptationpositive-negative weightingmedical question answeringresponse diversityentropy collapse

0 comments

The pith

EAPO uses the ratio of current to initial policy entropy to adaptively weight positive samples, improving diversity and stability over fixed-weight methods in open-ended QA.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how positive and negative samples function differently in reinforcement learning from verifiable rewards for open-ended question answering. It reports that negative samples mainly drive response diversity and set the performance ceiling, while positive samples shape answer quality and training stability. From these observations the authors derive EAPO, which scales the contribution of positive samples by the current-policy-entropy to initial-entropy ratio: the weight drops when entropy falls to keep exploration alive and rises when entropy climbs to anchor convergence. Experiments on two public medical QA datasets show EAPO yields higher diversity and more stable training than constant-weight baselines.

Core claim

Negative samples predominantly govern response diversity and the performance upper bound, whereas positive samples primarily determine response quality and convergence stability. EAPO therefore computes an adaptive coefficient for positive samples equal to the ratio of current policy entropy to initial entropy; this coefficient is lowered during the entropy-decreasing phase to preserve exploration and raised during the entropy-increasing phase to reinforce stability, thereby mitigating entropy collapse.

What carries the argument

The entropy-ratio coefficient (current policy entropy divided by initial entropy) that dynamically scales the weight given to positive samples inside the policy-gradient objective.

If this is right

Response diversity is maintained by deliberately lowering positive-sample weight whenever entropy begins to fall.
Training stability improves when positive-sample weight is increased during periods of rising entropy.
The adaptive scheme prevents the entropy collapse that fixed positive-negative weights commonly produce in open-ended QA.
The performance upper bound set by negative samples is reached more reliably because exploration is preserved longer.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same entropy-ratio rule could be tested on non-medical open-ended tasks such as creative writing or code generation to check whether the positive-negative division of labor holds outside medicine.
Combining the EAPO coefficient with existing entropy-regularization terms in PPO-style algorithms might yield a parameter-free way to control collapse without extra hyperparameters.
If the entropy trajectory is measured on a held-out validation set rather than the training batch, the method could be made more robust to noisy reward signals.

Load-bearing premise

Negative samples control diversity and the performance ceiling while positive samples control quality and stability.

What would settle it

Running the same medical QA experiments with the identical reward function and finding that EAPO produces no measurable gain in response diversity or training stability relative to the fixed-weight baselines would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.27846 by Bo Yuan, Gen Li, Jianwei Lv, Junfeng Wang, Luning Wang, Siyu Chen, Xiandong Li, Yujin Wang, Yunhao Qiao, Yunsheng Zeng, Yuwei Miao.

**Figure 1.** Figure 1: Training Dynamics on the RJUA dataset. probability distribution and preserving exploration diversity. This observation is consistent with the findings of W-REINFORCE (Zhu et al., 2026) and A3PO (Tang et al., 2025) under the RLVR setting. Positive sample rewards improve slowly, while negative sample rewards improve quickly. As shown in [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Training dynamics under asymmetric weighting on the RJUA dataset. The corresponding results on the [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Training dynamics of EAPO against various baseline methods on the RJUA dataset. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Training dynamics under asymmetric weighting on the CMD dataset. [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

read the original abstract

Large Reasoning Models are typically trained via reinforcement learning from verifiable rewards (RLVR). However, existing approaches adopt fixed weights for positive and negative samples, and the conclusions hardly generalize to open-ended question answering (QA). In this paper, we systematically investigate the roles of positive and negative samples in reinforcement learning for open-ended QA. We propose a reward-mean-based strategy for distinguishing positive from negative samples, and observe that negative samples predominantly govern response diversity and the performance upper bound, whereas positive samples primarily determine response quality and convergence stability. Building on these observations, we propose EAPO, an Entropy-driven Adaptive Policy Optimization method that adaptively computes the weighting coefficients of positive samples based on the ratio of the current policy entropy to the initial entropy. During the entropy-decreasing phase, the weight assigned to positive samples is reduced to preserve exploration, whereas during the entropy-increasing phase it is amplified to reinforce stability, thereby mitigating entropy collapse. Experiments on two publicly available open-ended medical QA datasets demonstrate that EAPO consistently and substantially outperforms fixed-weight baselines in both response diversity and stability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EAPO adds an entropy-ratio adaptive weight on positive samples to RLVR for open-ended QA and reports gains in diversity and stability on medical datasets.

read the letter

The main thing here is that EAPO uses the ratio of current policy entropy to initial entropy to scale the weight on positive samples during training. When entropy is falling the weight drops to keep exploration alive; when it rises the weight increases to steady the run. This is presented as a fix for entropy collapse in open-ended settings where fixed positive-negative weights fall short.

The authors first split samples by reward mean and report an observation that negative samples mostly set diversity and the performance ceiling while positive samples set quality and convergence speed. They then build the adaptive rule on top of that split. The experiments run on two public medical QA datasets and claim consistent improvement over fixed-weight baselines in both diversity and stability.

The adaptive rule itself is the concrete new piece. It is a simple, implementable heuristic that directly ties weight to a measurable training statistic. The motivation from the positive-negative split is reasonable for open-ended QA and the choice of medical datasets gives a clear test bed.

The soft spots are the missing numbers. No effect sizes, error bars, or ablation tables appear in the description, so the size and reliability of the gains are hard to judge. The reward-mean split for labeling samples could be sensitive to the particular reward model, and nothing is shown about behavior outside medical QA. The entropy phases themselves depend on the training trajectory, which may limit how portable the rule is.

This is for people already running RLVR on open-ended generation tasks who want a lightweight way to protect diversity. A reader who cares about practical stability tweaks in applied domains would get something usable from the heuristic and the reported comparisons.

It deserves peer review. The method is explicit enough to reproduce and the experiments are on public data, so referees can check the claims directly.

Referee Report

0 major / 1 minor

Summary. The paper claims that in RLVR for open-ended QA, negative samples predominantly control response diversity and the performance upper bound while positive samples control response quality and convergence stability. It introduces a reward-mean strategy to label samples and proposes EAPO, which adaptively sets positive-sample weights from the ratio of current policy entropy to initial entropy (reducing the weight during entropy decrease to preserve exploration and increasing it during entropy increase to reinforce stability). Experiments on two public open-ended medical QA datasets are reported to show that EAPO substantially outperforms fixed-weight baselines on both diversity and stability metrics.

Significance. If the empirical results hold, EAPO supplies a concrete, falsifiable heuristic for adaptive weighting that directly addresses the failure of fixed-weight RLVR on open-ended tasks. The work is grounded in an explicit observation about differential sample roles and includes reproducible experiments on public datasets, which strengthens its contribution to training stability and diversity in reasoning models for domains such as medical QA.

minor comments (1)

[Abstract] Abstract: the performance claims are stated qualitatively ('consistently and substantially outperforms') without any numerical deltas, error bars, or dataset identifiers; adding these would improve readability while remaining a presentation issue.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our work, the clear summary of our contributions, and the recommendation for minor revision. We are pleased that the empirical grounding on public datasets and the explicit observations on positive/negative sample roles were viewed favorably.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's claims rest on empirical observations from RL training runs and a concrete heuristic (entropy-ratio weighting) whose validity is tested directly via experiments on two public datasets. No equations reduce a prediction to a fitted input by construction, no self-citations are invoked as load-bearing uniqueness theorems, and the adaptive rule is presented as an externally falsifiable design choice rather than a self-referential definition. The central result (outperformance in diversity and stability) is therefore independent of the method's internal construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract, no explicit free parameters, axioms, or invented entities are detailed beyond standard RL assumptions; the entropy ratio serves as the core mechanism but its implementation thresholds are unspecified.

axioms (1)

domain assumption Reward-mean-based strategy reliably distinguishes positive from negative samples in open-ended QA.
Abstract states this as the basis for observations on sample roles.

pith-pipeline@v0.9.1-grok · 5752 in / 1236 out tokens · 32573 ms · 2026-06-29T12:55:24.943456+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 20 canonical work pages · 13 internal anchors

[1]

Advances in Neural Information Processing Systems, 38:9640–9664

Asymmetric reinforce for off-policy reinforcement learning: Balancing positive and negative rewards. Advances in Neural Information Processing Systems, 38:9640–9664. Peter Chen, Xiaopeng Li, Ziniu Li, Wotao Yin, Xi Chen, and Tianyi Lin. 2025a. Exploration vs exploitation: Rethinking rlvr through clipping, entropy, and spuri- ous reward.arXiv preprint arXi...

work page arXiv 2025
[2]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948. Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Yash Ingle, Jaival Chauhan, Ankit Yadav, and Sudhakar Mishra

On the direction of rlvr updates for llm reason- ing: Identification and exploitation.arXiv preprint arXiv:2603.22117. Yash Ingle, Jaival Chauhan, Ankit Yadav, and Sudhakar Mishra

work page arXiv
[4]

Adaptive negative reinforcement for llm reasoning: Dynamically balancing correction and diversity in rlvr.arXiv preprint arXiv:2605.07137. Shiwei Lyu, Chenfei Chi, Hongbo Cai, Lei Shi, Xiaoyan Yang, Lei Liu, Xiang Chen, Deng Zhao, Zhiqiang Zhang, Xianguo Lyu, Ming Zhang, Fangzhou Li, Xiaowei Ma, Yue Shen, Jinjie Gu, Wei Xue, and Yiran Huang

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Iman Mirzadeh, Keivan Alizadeh-Vahid, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar

Rjua-qa: A comprehensive qa dataset for urology.Preprint, arXiv:2312.09785. Iman Mirzadeh, Keivan Alizadeh-Vahid, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar

work page arXiv
[6]

InInternational Conference on Learn- ing Representations, volume 2025, pages 94743– 94765

Gsm-symbolic: Understanding the limitations of mathematical reasoning in large lan- guage models. InInternational Conference on Learn- ing Representations, volume 2025, pages 94743– 94765. Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, and 1 oth- ers

2025
[7]

Humanity’s last exam.arXiv preprint arXiv:2501.14249. Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, and 25 oth- ers

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Qwen2.5 Technical Report

Qwen2.5 technical report.Preprint, arXiv:2412.15115. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Proximal Policy Optimization Algorithms

Proxi- mal policy optimization algorithms.arXiv preprint arXiv:1707.06347. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, and 1 others

work page internal anchor Pith review Pith/arXiv arXiv
[10]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300. Xinyu Tang, Yuliang Zhan, Zhixun Li, Wayne Xin Zhao, Zhenduo Zhang, Zujie Wen, Zhiqiang Zhang, and Jun Zhou

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, and 1 others

Rethinking sample polarity in reinforce- ment learning with verifiable rewards.arXiv preprint arXiv:2512.21625. Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, and 1 others. 9

work page arXiv
[12]

Kimi k1.5: Scaling Reinforcement Learning with LLMs

Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599. Qwen Team

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Qwen3 Technical Report

Qwen3 technical report.Preprint, arXiv:2505.09388. Xueyun Tian, Minghua Ma, Bingbing Xu, Nuoyan Lyu, Wei Li, Heng Dong, Zheng Chu, Yuanzhuo Wang, and Huawei Shen

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Jiakang Wang, Runze Liu, Lei Lin, Wenping Hu, Xiu Li, Fuzheng Zhang, Guorui Zhou, and Kun Gai

Learning from mistakes: Negative reasoning samples enhance out-of-domain generalization.arXiv preprint arXiv:2601.04992. Jiakang Wang, Runze Liu, Lei Lin, Wenping Hu, Xiu Li, Fuzheng Zhang, Guorui Zhou, and Kun Gai

work page arXiv
[15]

When Importance Sampling Misallocates Credit: Asymmetric Ratios for Outcome-Supervised RL

Aspo: Asymmetric importance sampling policy opti- mization.arXiv preprint arXiv:2510.06062. Peng-Yuan Wang, Tian-Shuo Liu, Chenyang Wang, Ziniu Li, Yidi Wang, Shu Yan, Chengxing Jia, Xu- Hui Liu, Xinwei Chen, Jiacheng Xu, and 1 oth- ers

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Unlocking Exploration in RLVR: Uncertainty-aware Advantage Shaping for Deeper Reasoning

Un- locking exploration in rlvr: Uncertainty-aware advan- tage shaping for deeper reasoning.arXiv preprint arXiv:2510.10649. Huimin Xu, Shuai Zhao, Xiaobao Wu, and Anh Tuan Luu. 2026a. Understanding and preventing entropy collapse in rlvr with on-policy entropy flow optimiza- tion.arXiv preprint arXiv:2605.11491. Ming Xu

work page internal anchor Pith review Pith/arXiv arXiv
[17]

How You Begin is How You Reason: Driving Exploration in RLVR via Prefix-Tuned Priors

Medical dataset (cmd). https://github.com/Toyhom/ Chinese-medical-dialogue-data. Yifan Xu, Junren Chen, and Yifan Chen. 2026b. How you begin is how you reason: Driving exploration in rlvr via prefix-tuned priors.arXiv preprint arXiv:2605.08817. Yibo Yan, Shen Wang, Jiahao Huo, Jingheng Ye, Zhen- dong Chu, Xuming Hu, Philip S Yu, Carla Gomes, Bart Selman, ...

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning

Posi- tion: Multimodal large language models can signifi- cantly advance scientific reasoning.arXiv preprint arXiv:2502.02871. Dayu Yang, Tianyang Liu, Daoan Zhang, Antoine Simoulin, Xiaoyi Liu, Yuwei Cao, Zhaopu Teng, Xin Qian, Grey Yang, Jiebo Luo, and 1 others

work page internal anchor Pith review Pith/arXiv arXiv
[19]

InProceedings of the 2025 Con- ference on Empirical Methods in Natural Language Processing, pages 2586–2616

Code to think, think to code: A survey on code- enhanced reasoning and reasoning-driven code in- telligence in llms. InProceedings of the 2025 Con- ference on Empirical Methods in Natural Language Processing, pages 2586–2616. Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, and 1 others

2025
[20]

Group Sequence Policy Optimization

Group sequence policy optimization.arXiv preprint arXiv:2507.18071. Xinyu Zhu, Mengzhou Xia, Zhepei Wei, Wei-Lin Chen, Danqi Chen, and Yu Meng

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Provide the diagnostic conclusion together with recommendations for further diagnostic workup or treatment within the<advice>and</advice> tags. Patient consultation: {question} A.2 Evaluation Dimensions on LLM-as-a-Judge We adopt an LLM-as-a-Judge paradigm to score the model’s reasoning content in<think>tag during reinforcement training and the correspond...

work page arXiv 1940
[22]

operating regime

W-REINFORCE adopts a fixed positive-sample weight w+ that remains unchanged throughout training; consequently, its final performance is highly sensitive to this hyperparameter: varying w+ alone lifts Rouge-L on RUJA from 0.263 to 0.337 and Reranker on RUJA from 0.903 to 0.983, indicating that an improper choice of w+ can lead to a substantial degradation ...

work page arXiv

[1] [1]

Advances in Neural Information Processing Systems, 38:9640–9664

Asymmetric reinforce for off-policy reinforcement learning: Balancing positive and negative rewards. Advances in Neural Information Processing Systems, 38:9640–9664. Peter Chen, Xiaopeng Li, Ziniu Li, Wotao Yin, Xi Chen, and Tianyi Lin. 2025a. Exploration vs exploitation: Rethinking rlvr through clipping, entropy, and spuri- ous reward.arXiv preprint arXi...

work page arXiv 2025

[2] [2]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948. Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Yash Ingle, Jaival Chauhan, Ankit Yadav, and Sudhakar Mishra

On the direction of rlvr updates for llm reason- ing: Identification and exploitation.arXiv preprint arXiv:2603.22117. Yash Ingle, Jaival Chauhan, Ankit Yadav, and Sudhakar Mishra

work page arXiv

[4] [4]

Adaptive negative reinforcement for llm reasoning: Dynamically balancing correction and diversity in rlvr.arXiv preprint arXiv:2605.07137. Shiwei Lyu, Chenfei Chi, Hongbo Cai, Lei Shi, Xiaoyan Yang, Lei Liu, Xiang Chen, Deng Zhao, Zhiqiang Zhang, Xianguo Lyu, Ming Zhang, Fangzhou Li, Xiaowei Ma, Yue Shen, Jinjie Gu, Wei Xue, and Yiran Huang

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Iman Mirzadeh, Keivan Alizadeh-Vahid, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar

Rjua-qa: A comprehensive qa dataset for urology.Preprint, arXiv:2312.09785. Iman Mirzadeh, Keivan Alizadeh-Vahid, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar

work page arXiv

[6] [6]

InInternational Conference on Learn- ing Representations, volume 2025, pages 94743– 94765

Gsm-symbolic: Understanding the limitations of mathematical reasoning in large lan- guage models. InInternational Conference on Learn- ing Representations, volume 2025, pages 94743– 94765. Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, and 1 oth- ers

2025

[7] [7]

Humanity’s last exam.arXiv preprint arXiv:2501.14249. Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, and 25 oth- ers

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Qwen2.5 Technical Report

Qwen2.5 technical report.Preprint, arXiv:2412.15115. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Proximal Policy Optimization Algorithms

Proxi- mal policy optimization algorithms.arXiv preprint arXiv:1707.06347. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, and 1 others

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300. Xinyu Tang, Yuliang Zhan, Zhixun Li, Wayne Xin Zhao, Zhenduo Zhang, Zujie Wen, Zhiqiang Zhang, and Jun Zhou

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, and 1 others

Rethinking sample polarity in reinforce- ment learning with verifiable rewards.arXiv preprint arXiv:2512.21625. Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, and 1 others. 9

work page arXiv

[12] [12]

Kimi k1.5: Scaling Reinforcement Learning with LLMs

Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599. Qwen Team

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Qwen3 Technical Report

Qwen3 technical report.Preprint, arXiv:2505.09388. Xueyun Tian, Minghua Ma, Bingbing Xu, Nuoyan Lyu, Wei Li, Heng Dong, Zheng Chu, Yuanzhuo Wang, and Huawei Shen

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Jiakang Wang, Runze Liu, Lei Lin, Wenping Hu, Xiu Li, Fuzheng Zhang, Guorui Zhou, and Kun Gai

Learning from mistakes: Negative reasoning samples enhance out-of-domain generalization.arXiv preprint arXiv:2601.04992. Jiakang Wang, Runze Liu, Lei Lin, Wenping Hu, Xiu Li, Fuzheng Zhang, Guorui Zhou, and Kun Gai

work page arXiv

[15] [15]

When Importance Sampling Misallocates Credit: Asymmetric Ratios for Outcome-Supervised RL

Aspo: Asymmetric importance sampling policy opti- mization.arXiv preprint arXiv:2510.06062. Peng-Yuan Wang, Tian-Shuo Liu, Chenyang Wang, Ziniu Li, Yidi Wang, Shu Yan, Chengxing Jia, Xu- Hui Liu, Xinwei Chen, Jiacheng Xu, and 1 oth- ers

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Unlocking Exploration in RLVR: Uncertainty-aware Advantage Shaping for Deeper Reasoning

Un- locking exploration in rlvr: Uncertainty-aware advan- tage shaping for deeper reasoning.arXiv preprint arXiv:2510.10649. Huimin Xu, Shuai Zhao, Xiaobao Wu, and Anh Tuan Luu. 2026a. Understanding and preventing entropy collapse in rlvr with on-policy entropy flow optimiza- tion.arXiv preprint arXiv:2605.11491. Ming Xu

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

How You Begin is How You Reason: Driving Exploration in RLVR via Prefix-Tuned Priors

Medical dataset (cmd). https://github.com/Toyhom/ Chinese-medical-dialogue-data. Yifan Xu, Junren Chen, and Yifan Chen. 2026b. How you begin is how you reason: Driving exploration in rlvr via prefix-tuned priors.arXiv preprint arXiv:2605.08817. Yibo Yan, Shen Wang, Jiahao Huo, Jingheng Ye, Zhen- dong Chu, Xuming Hu, Philip S Yu, Carla Gomes, Bart Selman, ...

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning

Posi- tion: Multimodal large language models can signifi- cantly advance scientific reasoning.arXiv preprint arXiv:2502.02871. Dayu Yang, Tianyang Liu, Daoan Zhang, Antoine Simoulin, Xiaoyi Liu, Yuwei Cao, Zhaopu Teng, Xin Qian, Grey Yang, Jiebo Luo, and 1 others

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

InProceedings of the 2025 Con- ference on Empirical Methods in Natural Language Processing, pages 2586–2616

Code to think, think to code: A survey on code- enhanced reasoning and reasoning-driven code in- telligence in llms. InProceedings of the 2025 Con- ference on Empirical Methods in Natural Language Processing, pages 2586–2616. Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, and 1 others

2025

[20] [20]

Group Sequence Policy Optimization

Group sequence policy optimization.arXiv preprint arXiv:2507.18071. Xinyu Zhu, Mengzhou Xia, Zhepei Wei, Wei-Lin Chen, Danqi Chen, and Yu Meng

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Provide the diagnostic conclusion together with recommendations for further diagnostic workup or treatment within the<advice>and</advice> tags. Patient consultation: {question} A.2 Evaluation Dimensions on LLM-as-a-Judge We adopt an LLM-as-a-Judge paradigm to score the model’s reasoning content in<think>tag during reinforcement training and the correspond...

work page arXiv 1940

[22] [22]

operating regime

W-REINFORCE adopts a fixed positive-sample weight w+ that remains unchanged throughout training; consequently, its final performance is highly sensitive to this hyperparameter: varying w+ alone lifts Rouge-L on RUJA from 0.263 to 0.337 and Reranker on RUJA from 0.903 to 0.983, indicating that an improper choice of w+ can lead to a substantial degradation ...

work page arXiv