arxiv: 2605.07579 · v2 · submitted 2026-05-08 · 💻 cs.LG · cs.AI· cs.CL

Recognition: no theorem link

Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States

Jeonghoon Shim, Jongwon Lim, Minjae Oh, Woojin Ahn, Yohan Jo, Yunho Choi

Pith reviewed 2026-05-12 03:56 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords reinforcement learningvalue estimationinternal statespolicy optimizationlarge language modelsreasoning modelsbaseline estimationRLVR

0 comments

The pith

A policy model's hidden states can estimate value baselines for reinforcement learning without extra models or multiple samples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that value estimation for RL with verifiable rewards can be done using only the actor's internal hidden states and entropy statistics via a lightweight probe. This is trained online and uses a cross-rollout design to avoid biasing the policy gradients. The result is that training requires only one rollout per prompt, enabling more diverse prompts within the same compute budget and reducing overhead from zero-advantage cases. If correct, this would simplify and speed up fine-tuning of large reasoning models on tasks with verifiable outcomes.

Core claim

POISE obtains a baseline by predicting expected verifiable reward from the hidden states of the prompt and generated trajectory along with token-entropy statistics using a lightweight probe trained online alongside the policy. To keep gradients unbiased, each rollout's value is predicted from an independent rollout's internal states. This matches the performance of DAPO on math reasoning benchmarks for models like Qwen3-4B while using less compute, and the estimator performs comparably to a full-scale value model.

What carries the argument

The lightweight probe that predicts value from hidden states and entropy, together with the cross-rollout construction for unbiased estimation.

If this is right

Training can use higher prompt diversity at fixed compute.
Gradient variance is reduced leading to more stable learning.
No extra sampling needed to detect zero-advantage prompts.
The value estimator generalizes across verifiable tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach might extend to non-verifiable reward settings if internal states encode useful signals.
It could reduce the overall parameter count needed for RL training by eliminating separate critics.
Future work might explore what specific aspects of hidden states carry the most value information.

Load-bearing premise

That training a small probe on the actor's hidden states and entropy produces value estimates accurate enough to reduce variance without biasing the policy updates.

What would settle it

Observing that the POISE-trained policy achieves lower accuracy or higher variance than DAPO on the same math benchmarks when both use equivalent total compute and rollouts.

Figures

Figures reproduced from arXiv: 2605.07579 by Jeonghoon Shim, Jongwon Lim, Minjae Oh, Woojin Ahn, Yohan Jo, Yunho Choi.

**Figure 2.** Figure 2: Overview of POISE. Left: Probe features ϕ(x, y, π) combine hidden states with token entropy. Right: The value estimator predicts each rollout’s baseline from the other rollout’s features [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of training dynamics between POISE and DAPO on Deepseek-Distill-Qwen [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 5.** Figure 5: Comparison between our value estimator and a critic model in online settings. Our estimator remains well aligned with the evolving policy while using substantially less computation. For full results, refer to § D.2. continuous value from internal state signals already produced during generation, thereby avoiding degeneration and saving a substantial amount of rollout compute. Concretely, in our setting, … view at source ↗

**Figure 6.** Figure 6: The green line reports the online MAE gf [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: Training dynamics of POISE on Qwen3-4B. The reward and predicted value increase [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: Training dynamics of POISE on DeepSeek-R1-Distill-Qwen-1.5B. Although the smaller [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Online value prediction for Qwen3-4B. The target at each checkpoint is the empirical [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

**Figure 10.** Figure 10: Online value prediction for DeepSeek-R1-Distill-Qwen-1.5B. The target at each check [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗

read the original abstract

Reinforcement learning with verifiable rewards (RLVR) for Large Reasoning Models hinges on baseline estimation for variance reduction, but existing approaches pay a heavy price: PPO requires a policy-model scale critic, while GRPO needs multiple rollouts per prompt to keep its empirical group mean stable. We introduce Policy Optimization with Internal State Value Estimation), which obtains a baseline at negligible cost by using the policy model's internal signals already computed during the policy forward pass. A lightweight probe predicts the expected verifiable reward from the hidden states of the prompt and generated trajectory, as well as token-entropy statistics, and is trained online alongside the policy. To preserve gradient unbiasedness despite using trajectory-conditioned features, we introduce a cross-rollout construction that predicts each rollout's value from an independent rollout's internal states. Because POISE estimates prompt value using only a single rollout, it enables higher prompt diversity for a fixed compute budget during training. This reduces gradient variance for more stable learning and also eliminates the compute overhead of sampling costs for detecting zero-advantage prompts. On Qwen3-4B and DeepSeek-R1-Distill-Qwen-1.5B across math reasoning benchmarks, POISE matches DAPO while requiring less compute. Moreover, its value estimator shows similar performance to a separate LLM-scale value model and generalizes to various verifiable tasks. By leveraging the model's own internal representations, POISE enables more stable and efficient policy optimization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

POISE gets value estimates from the policy's own hidden states plus a cross-rollout trick, which looks practically useful for cutting critic overhead in RLVR but leaves the unbiasedness claim open to the shared-prompt correlation issue.

read the letter

The core move is to replace a separate value model or group averaging with a small probe that reads the actor's hidden states and token entropy during the normal forward pass. They train that probe online and use cross-rollout so the value label for one trajectory comes from the internals of a different rollout on the same prompt. On Qwen3-4B and the 1.5B DeepSeek distill, this matches DAPO on math benchmarks while using fewer samples per prompt and no extra model forward passes for the critic. The probe also performs close to a full-scale value head and works across other verifiable tasks. That is the practical win: higher prompt diversity inside a fixed compute budget and lower variance from single-rollout baselines. The experiments appear to show stable training and no obvious degradation, which is worth noting. The soft spot is exactly the one the stress test flags. Because every rollout for a prompt shares the same early-layer activations, the probe inputs are not independent even after cross-rollout. If the probe learns any prompt-level signal, the baseline can still covary with the policy's own trajectory in a way that violates the zero-covariance condition for unbiased policy gradients. The paper introduces the cross-rollout construction to break dependence, but without seeing the explicit derivation or ablations that isolate prompt versus trajectory features, it is not clear the fix is complete. Online joint training adds another risk that the probe exploits whatever correlation is easiest to fit. This is a real but fixable concern rather than a fatal flaw. The work is aimed at groups already running RLVR on reasoning models and looking for cheaper baselines. A reader who cares about scaling laws or prompt diversity will get immediate value from the efficiency numbers. I would send it to peer review because the empirical results are concrete and the method is simple enough to test quickly; referees can check the bias math and ask for the missing ablations on shared embeddings.

Referee Report

2 major / 2 minor

Summary. The paper introduces POISE (Policy Optimization with Internal State Value Estimation) for RLVR in large reasoning models. It trains a lightweight probe online on the policy model's hidden states (from prompt and trajectory) plus token-entropy statistics to produce value baselines for advantage estimation. A cross-rollout construction is used to predict each rollout's value from an independent rollout's internal states, intended to preserve unbiased policy gradients despite trajectory-conditioned features. The method claims to match DAPO performance on math reasoning benchmarks with Qwen3-4B and DeepSeek-R1-Distill-Qwen-1.5B while using less compute, enabling higher prompt diversity, and that the probe performs comparably to a separate LLM-scale value model while generalizing across verifiable tasks.

Significance. If the unbiasedness of the cross-rollout gradients and the accuracy of the lightweight internal-state probe both hold, POISE would offer a low-overhead alternative to PPO-style critics or multi-rollout group baselines, reducing compute per prompt and supporting more diverse training batches in RLVR. This could meaningfully improve efficiency for reasoning model post-training without sacrificing stability.

major comments (2)

[cross-rollout construction] Cross-rollout construction (method description): the claim that predicting value for rollout i from rollout j's (j≠i) hidden states and entropy fully breaks trajectory-specific dependence is not obviously true. All rollouts for a given prompt share identical prompt tokens and therefore identical early-layer hidden states; any prompt-level signal captured by the probe can still correlate the baseline with the policy's own trajectory, violating the zero-covariance requirement for unbiased policy gradients. A formal argument or covariance measurement under this construction is required.
[value estimator training] Value estimator training (online probe section): the probe is trained jointly with the policy on the same rollouts. Even with cross-rollout prediction, the online fitting on trajectory-conditioned features risks the probe learning spurious correlations rather than true expected reward, especially since no ablation isolating the probe's contribution or measuring value prediction error against ground-truth returns is referenced in the provided claims.

minor comments (2)

The abstract and summary claims reference specific performance parity and generalization results, but the manuscript should include explicit experimental details, ablation tables, and statistical significance tests for the compute savings and value-estimator comparisons to allow verification.
Notation for the probe inputs (hidden states + entropy) and the exact loss used for online training should be formalized with equations to clarify how the cross-rollout prediction is implemented during the forward pass.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and detailed review. The comments raise important points about the theoretical grounding of the cross-rollout construction and the empirical validation of the value probe. We address each major comment below with clarifications drawn from the manuscript and commit to revisions that strengthen the presentation without altering the core claims or results.

read point-by-point responses

Referee: [cross-rollout construction] Cross-rollout construction (method description): the claim that predicting value for rollout i from rollout j's (j≠i) hidden states and entropy fully breaks trajectory-specific dependence is not obviously true. All rollouts for a given prompt share identical prompt tokens and therefore identical early-layer hidden states; any prompt-level signal captured by the probe can still correlate the baseline with the policy's own trajectory, violating the zero-covariance requirement for unbiased policy gradients. A formal argument or covariance measurement under this construction is required.

Authors: We appreciate the referee's careful scrutiny of the independence argument. Although prompt tokens (and thus early-layer hidden states) are identical across rollouts for a given prompt, the cross-rollout construction feeds the probe exclusively with the full set of hidden states and entropy statistics from an independent rollout j when computing the baseline for rollout i. Because j is sampled independently of i, the baseline random variable for i is statistically independent of the actions, log-probabilities, and reward of rollout i. This independence directly implies that the covariance between the baseline and the policy gradient term for i is zero, satisfying the condition for unbiased policy gradients. Prompt-level signals captured by the probe are fixed for the prompt and do not introduce dependence on i's specific trajectory. To make this argument fully rigorous, we will add a short formal proof of unbiasedness (showing E[baseline_i * ∇log π_i] = 0) together with empirical covariance measurements between the cross-rollout baselines and the corresponding advantages in the revised manuscript. revision: yes
Referee: [value estimator training] Value estimator training (online probe section): the probe is trained jointly with the policy on the same rollouts. Even with cross-rollout prediction, the online fitting on trajectory-conditioned features risks the probe learning spurious correlations rather than true expected reward, especially since no ablation isolating the probe's contribution or measuring value prediction error against ground-truth returns is referenced in the provided claims.

Authors: We agree that joint online training on trajectory-conditioned features warrants explicit safeguards and validation. The cross-rollout prediction already prevents the baseline used for a rollout from depending on that rollout's own features, reducing the risk of direct spurious correlation with the advantage. Nevertheless, the manuscript currently demonstrates the probe's utility through end-to-end performance matching DAPO and comparability to a separate LLM-scale value model, without dedicated ablations that isolate the probe or report its prediction error (e.g., MSE) against ground-truth returns. We will incorporate these analyses in the revision, including (i) value-prediction error curves against actual verifiable rewards, (ii) an ablation that replaces the probe with a constant or random baseline, and (iii) a comparison of probe accuracy when trained with versus without cross-rollout, to confirm that the estimator learns meaningful expected-reward signals rather than artifacts. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in POISE derivation

full rationale

The paper's claimed derivation chain centers on introducing a lightweight probe trained online on hidden states and entropy, combined with an explicit cross-rollout construction to estimate values while aiming to preserve unbiased policy gradients. This construction is presented as a methodological safeguard (predicting rollout i's value from rollout j's independent internal states) rather than a quantity that reduces to its inputs by definition or a fitted parameter renamed as a prediction. No equations or steps in the provided text show the advantage estimate or gradient unbiasedness being equivalent to the probe's training data by construction. No self-citation chains, uniqueness theorems imported from prior author work, or ansatzes smuggled via citation are invoked to force the central result. The performance claims (matching DAPO with less compute) are framed as empirical outcomes on benchmarks, not tautological derivations. The method is self-contained against external benchmarks like DAPO and separate value models, with the cross-rollout serving as an independent design choice to address bias rather than a circular reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the effectiveness of an online-trained probe and the validity of the cross-rollout unbiasedness construction; limited information is available from the abstract alone.

free parameters (1)

lightweight probe weights
The probe is trained online alongside the policy, so its parameters are fitted during the RL process.

axioms (1)

domain assumption Cross-rollout construction preserves unbiased policy gradients when using trajectory-conditioned internal states
Invoked to justify using one rollout's states to estimate another's value without introducing bias.

invented entities (1)

lightweight probe no independent evidence
purpose: Predict expected verifiable reward from prompt/trajectory hidden states and token-entropy statistics
New component introduced to obtain the baseline at negligible extra cost.

pith-pipeline@v0.9.0 · 5576 in / 1433 out tokens · 133505 ms · 2026-05-12T03:56:50.835037+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 8 internal anchors

[1]

The internal state of an LLM knows when it's lying

Amos Azaria and Tom Mitchell. The internal state of an LLM knows when it’s lying. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Com- putational Linguistics: EMNLP 2023, pages 967–976, Singapore, December 2023. Associ- ation for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.68. URL https://aclanthol...

work page doi:10.18653/v1/2023.findings-emnlp.68 2023
[2]

BrUMO 2025: Brown University Mathematics Olympiad

Brown University Math Olympiad Team. BrUMO 2025: Brown University Mathematics Olympiad. https://www.brumo.org , 2025. Inaugural competition, held April 4–5, 2025, Brown University, Providence, RI

work page 2025
[3]

Discovering latent knowledge in language models without supervision

Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=ETKGuby0hcs

work page 2023
[4]

No answer needed: Predicting LLM answer accuracy from question-only linear probes, 2026

Iván Vicente Moreno Cencerrado, Arnau Padrés Masdemont, Anton Gonzalvez Hawthorne, David Demitri Africa, and Lorenzo Pacchiardi. No answer needed: Predicting LLM answer accuracy from question-only linear probes, 2026. URL https://openreview.net/forum?i d=OhN25uxVab

work page 2026
[5]

Trace length is a simple uncertainty signal in reasoning models

Siddartha Devic, Charlotte Peale, Arwen Bradley, Sinead Williamson, Preetum Nakkiran, and Aravind Gollakota. Trace length is a simple uncertainty signal in reasoning models.arXiv preprint arXiv:2510.10409, 2025

work page arXiv 2025
[6]

A mathematical framework for transformer circuits.Transformer Circuits Thread,

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A...

work page
[7]

https://transformer-circuits.pub/2021/framework/index.html

work page 2021
[8]

Prompt curriculum learning for efficient llm post-training, 2025

Zhaolin Gao, Joongwon Kim, Wen Sun, Thorsten Joachims, Sid Wang, Richard Yuanzhe Pang, and Liang Tan. Prompt curriculum learning for efficient llm post-training, 2025. URL https://arxiv.org/abs/2510.01135

work page arXiv 2025
[9]

Bartlett, and Jonathan Baxter

Evan Greensmith, Peter L. Bartlett, and Jonathan Baxter. Variance reduction techniques for gradient estimates in reinforcement learning.Journal of Machine Learning Research, 5:1471– 1530, 2004

work page 2004
[10]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...

work page doi:10.1038/s41586-025-09422-z 2025
[11]

HMMT February 2025: Harvard–MIT Mathematics Tournament

HMMT Organization. HMMT February 2025: Harvard–MIT Mathematics Tournament. https: //www.hmmt.org/www/archive/282, 2025. Individual round problems, February 2025, MIT, Cambridge, MA

work page 2025
[12]

PROS: Towards compute-efficient RLVR via rollout prefix reuse

Baizhou Huang and Xiaojun Wan. PROS: Towards compute-efficient RLVR via rollout prefix reuse. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=lz1SRTcnUb

work page 2026
[13]

OpenAI o1 System Card

Aaron Jaech et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

work page 2023
[15]

Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V . Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hannaneh Hajishirz...

work page 2025
[16]

Inference- time intervention: Eliciting truthful answers from a language model

Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference- time intervention: Eliciting truthful answers from a language model. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 41451–41530. Curran Associates, Inc., 2023. URL...

work page 2023
[17]

Critical tokens matter: Token-level contrastive estimation enhence llm’s reasoning capability.arXiv preprint arXiv:2411.19943,

Zicheng Lin, Tian Liang, Jiahao Xu, Qiuzhi Lin, Xing Wang, Ruilin Luo, Chufan Shi, Siheng Li, Yujiu Yang, and Zhaopeng Tu. Critical tokens matter: Token-level contrastive estimation enhances llm’s reasoning capability, 2025. URLhttps://arxiv.org/abs/2411.19943

work page arXiv 2025
[18]

Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Tianjun Zhang, Li Erran Li, Raluca Ada Popa, and Ion Stoica

Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y . Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Tianjun Zhang, Li Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl. https://pretty-radio-b75.not ion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling -RL-19681902c1468005...

work page 2025
[19]

MAA American Mathematics Competitions (AMC)

Mathematical Association of America. MAA American Mathematics Competitions (AMC). https://maa.org/student-programs/amc/, 2023–2026

work page 2023
[20]

American Invitational Mathematics Examination (AIME).https://maa.org/maa-invitational-competitions/, 2024–2026

Mathematical Association of America. American Invitational Mathematics Examination (AIME).https://maa.org/maa-invitational-competitions/, 2024–2026

work page 2024
[21]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback....

work page 2022
[22]

The linear representation hypothesis and the geometry of large language models

Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models. InForty-first International Conference on Machine Learning, 2024. URLhttps://openreview.net/forum?id=UGpGkLzwpP

work page 2024
[23]

Generalizing verifiable instruction following

Valentina Pyatkin, Saumya Malik, Victoria Graf, Hamish Ivison, Shengyi Huang, Pradeep Dasigi, Nathan Lambert, and Hannaneh Hajishirzi. Generalizing verifiable instruction following. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2026. URLhttps://openreview.net/forum?id=yfYgwjj5F8

work page 2026
[24]

Manning, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model,

work page
[25]

URLhttps://arxiv.org/abs/2305.18290

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[27]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Tooldial: Multi-turn dialogue generation method for tool-augmented language models.arXiv preprint arXiv:2503.00564, 2025

Jeonghoon Shim, Gyuhyeon Seo, Cheongsu Lim, and Yohan Jo. Tooldial: Multi-turn dialogue generation method for tool-augmented language models.arXiv preprint arXiv:2503.00564, 2025

work page arXiv 2025
[29]

Stop overthinking: A survey on efficient reasoning for large language models.Transactions on Machine Learning Research, 2025

Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Na Zou, Hanjie Chen, and Xia Hu. Stop overthinking: A survey on efficient reasoning for large language models.Transactions on Machine Learning Research, 2025. ISSN 2835-8856. URL https://openreview.net/forum?id=HvoG8Sxg gZ

work page 2025
[30]

Sutton, David McAllester, Satinder Singh, and Yishay Mansour

Richard S. Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. InAdvances in Neural Information Processing Systems (NIPS) 12, pages 1057–1063. MIT Press, 2000. URLhttps: //papers.nips.cc/paper/1713-policy-gradient-methods-for-reinforcement-l earning-with-function-app...

work page 2000
[31]

Turner, Zoubin Ghahramani, and Sergey Levine

George Tucker, Surya Bhupatiraju, Shixiang Gu, Richard E. Turner, Zoubin Ghahramani, and Sergey Levine. The mirage of action-dependent baselines in reinforcement learning, 2018. URL https://arxiv.org/abs/1802.10031

work page arXiv 2018
[32]

Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for LLM reasoning

Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xiong- Hui Chen, Jianxin Yang, Zhenru Zhang, Yuqiong Liu, An Yang, Andrew Zhao, Yang Yue, Shiji Song, Bowen Yu, Gao Huang, and Junyang Lin. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for LLM reasoning. InThe Thirty- ninth Annual Con...

work page 2026
[33]

SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks

Tianyi Wang, Yixia Li, Long Li, Yibiao Chen, Shaohan Huang, Yun Chen, Peng Li, Yang Liu, and Guanhua Chen. Sppo: Sequence-level ppo for long-horizon reasoning tasks, 2026. URL https://arxiv.org/abs/2604.08865

work page internal anchor Pith review Pith/arXiv arXiv 2026
[34]

The optimal reward baseline for gradient-based reinforcement learning

Lex Weaver and Nigel Tao. The optimal reward baseline for gradient-based reinforcement learning. InProceedings of the 17th Conference on Uncertainty in Artificial Intelligence (UAI), pages 538–545, 2001

work page 2001
[35]

Williams

Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforce- ment learning.Machine Learning, 8(3–4):229–256, 1992. doi: 10.1007/BF00992696

work page doi:10.1007/bf00992696 1992
[36]

Single-stream policy optimization.arXiv preprint arXiv:2509.13232,

Zhongwen Xu and Zihan Ding. Single-stream policy optimization, 2025. URL https: //arxiv.org/abs/2509.13232. 13

work page arXiv 2025
[37]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Demystifying long chain-of-thought reasoning in llms

Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neubig, and Xiang Yue. Demystifying long chain-of-thought reasoning in llms, 2025. URLhttps://arxiv.org/abs/2502.03373

work page arXiv 2025
[39]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

ACECODER: Acing coder RL via automated test-case synthesis

Huaye Zeng, Dongfu Jiang, Haozhe Wang, Ping Nie, Xiaotong Chen, and Wenhu Chen. ACECODER: Acing coder RL via automated test-case synthesis. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12023–1204...

work page doi:10.18653/v1/2025.acl-long.587 2025
[41]

Rea- soning models know when they’re right: Probing hidden states for self-verification

Anqi Zhang, Yulin Chen, Jane Pan, Chen Zhao, Aurojit Panda, Jinyang Li, and He He. Rea- soning models know when they’re right: Probing hidden states for self-verification. InSecond Conference on Language Modeling, 2025. URL https://openreview.net/forum?id=O6 I0Av7683

work page 2025
[42]

v0.5: Generalist value model as a prior for sparse rl rollouts, 2026

Yi-Kai Zhang, Yueqing Sun, Hongyan Hao, Qi Gu, Xunliang Cai, De-Chuan Zhan, and Han- Jia Ye. v0.5: Generalist value model as a prior for sparse rl rollouts, 2026. URL https: //arxiv.org/abs/2603.10848

work page arXiv 2026
[43]

Bartoldson, Bhavya Kailkhura, Fan Lai, Jiawei Zhao, and Beidi Chen

Haizhong Zheng, Yang Zhou, Brian R. Bartoldson, Bhavya Kailkhura, Fan Lai, Jiawei Zhao, and Beidi Chen. Act only when it pays: Efficient reinforcement learning for llm reasoning via selective rollouts, 2025. URLhttps://arxiv.org/abs/2506.02177

work page arXiv 2025
[44]

The LLM already knows: Estimating LLM-perceived question difficulty via hidden representations

Yubo Zhu, Dongrui Liu, Zecheng Lin, Wei Tong, Sheng Zhong, and Jing Shao. The LLM already knows: Estimating LLM-perceived question difficulty via hidden representations. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors, Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 1...

work page doi:10.18653/v1/2025.emnlp-main.61 2025
[45]

Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engineering: A top-down approach to ai transparency.arXiv preprint arXiv:2310.01405, 2023. 14 A Theoretical Proofs A.1 Proof of Proposition 1 Proof.Define: µ(x) =E[Z(x, y)|x],Σ w =E x[Cov(Z(x, ...

work page internal anchor Pith review Pith/arXiv arXiv 2023