arxiv: 2604.08905 · v1 · submitted 2026-04-10 · 💻 cs.AI · cs.LG

Recognition: 2 theorem links

· Lean Theorem

StaRPO: Stability-Augmented Reinforcement Policy Optimization

Jinghan Zhang , Fengran Mo , Tharindu Cyril Weerasooriya , Ruimin Dai , Xiaoyan Han , Yanjie Fu , Dakuo Wang , Kunpeng Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:07 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords reinforcement learninglarge language modelsreasoning stabilityautocorrelation functionpath efficiencypolicy optimizationlogical consistencyreasoning benchmarks

0 comments

The pith

StaRPO augments RL rewards for LLMs with autocorrelation and path efficiency to improve both accuracy and logical stability in reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that existing reinforcement learning for large language models on complex reasoning relies solely on final-answer correctness as feedback. This produces fluent outputs that often contain logical inconsistencies or erratic structures. StaRPO addresses this by adding two lightweight process metrics to the reward: the autocorrelation function to capture local step coherence and path efficiency to capture global goal-directedness. A sympathetic reader would care because the change could yield reasoning chains that are both more accurate and more internally consistent across tasks without new architectures or heavy tuning.

Core claim

StaRPO decomposes reasoning stability into the autocorrelation function for measuring local step-to-step coherence and path efficiency for measuring global goal-directedness of the trajectory. These metrics are combined with standard task rewards to supply complementary process-aware signals during policy optimization, resulting in models that reduce logic errors while raising final-answer accuracy on reasoning benchmarks.

What carries the argument

The stability-augmented reward that integrates the autocorrelation function for local coherence and path efficiency for global directedness with task rewards during reinforcement policy optimization.

If this is right

The combined reward produces higher final-answer accuracy together with fewer logical inconsistencies on four reasoning benchmarks.
ACF and PE rewards correlate with logic errors across two different backbone models, confirming their utility as stability signals.
The framework delivers consistent gains over baselines that use only final-answer feedback.
Logical stability improves without requiring new model architectures or extensive additional tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same lightweight stability proxies might transfer to other sequential generation tasks such as planning or code synthesis.
The approach could be combined with existing process-supervision methods to produce even stronger consistency gains.
If the correlation between these metrics and errors generalizes, it suggests a broader class of cheap trajectory-quality signals for RL in language models.

Load-bearing premise

The assumption that the autocorrelation function and path efficiency serve as valid lightweight proxies for logical stability whose addition to the reward will reduce inconsistencies without creating new failure modes or demanding extensive hyperparameter tuning.

What would settle it

A new set of experiments on the same benchmarks where the ACF and PE rewards show no measurable correlation with logic errors or fail to improve logical stability metrics beyond standard RL baselines.

Figures

Figures reproduced from arXiv: 2604.08905 by Dakuo Wang, Fengran Mo, Jinghan Zhang, Kunpeng Liu, Ruimin Dai, Tharindu Cyril Weerasooriya, Xiaoyan Han, Yanjie Fu.

**Figure 2.** Figure 2: Logical-stable space in token generation space for a rea [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Example of logical reasoning errors on Game of 24. directly incorporates stability-augmented signals into the RL reward, enabling consistent and robust enhancements in both process and final-answer accuracy. 5.3 Correlation between Logical Errors and Stability Metrics In this experiment, we investigate whether the abnormal values of the two proposed metrics, ACF and PE, are significantly correlated with … view at source ↗

**Figure 4.** Figure 4: Ablation studies with result and process accuracy across different model variants on two datasets. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Reinforcement learning (RL) is effective in enhancing the accuracy of large language models in complex reasoning tasks. Existing RL policy optimization frameworks rely on final-answer correctness as feedback signals and rarely capture the internal logical structure of the reasoning process. Consequently, the models would generate fluent and semantically relevant responses but logically inconsistent, structurally erratic, or redundant. To this end, we propose StaRPO, a stability-augmented reinforcement learning framework that explicitly incorporates reasoning stability into the optimization objective. Our StaRPO decomposes stability into two computable lightweight metrics: the Autocorrelation Function (ACF) to evaluate local step-to-step coherence, and Path Efficiency (PE) to evaluate global goal-directedness of the reasoning trajectory. These stability rewards are combined with task rewards to provide complementary and process-aware feedback. We validate the effectiveness of using ACF and PE rewards by showing their correlation with logic errors on two backbone models. Experiments on four reasoning benchmarks show that StaRPO consistently outperforms compared baselines and can enhance both final-answer accuracy and logical stability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

StaRPO adds ACF and PE as process stability rewards to RL for LLM reasoning, but correlation with errors does not yet show that optimization on them reliably improves stability.

read the letter

The main point is that StaRPO decomposes reasoning stability into two lightweight signals—autocorrelation function for local step coherence and path efficiency for global goal-directedness—then folds them into the RL reward alongside the usual task signal. The abstract reports that these track logic errors on two backbones and produce better results than baselines on four reasoning benchmarks. That framing correctly identifies the gap left by final-answer-only rewards, which can reward fluent but internally inconsistent chains. The decomposition itself is a reasonable, computable way to add process awareness without heavy overhead. Credit for trying to make stability explicit rather than hoping it emerges from accuracy alone. The soft spot is the missing step from correlation to optimization outcome. Reporting that ACF and PE correlate with fewer logic errors is a start, but it does not establish that gradient updates on the combined reward will reduce those errors, avoid superficially coherent but still wrong trajectories, or stay stable when weights or model scale change. The abstract gives no formulas, scaling details, ablations, or statistical tests, so it is hard to judge whether the claimed dual gains in accuracy and stability hold up or whether one trades off against the other. This is aimed at people already running RL on reasoning tasks who want to experiment with process-level signals. A reader working on LLM reliability or reward design could get value from the concrete metrics and the benchmark results, but only after seeing the implementation details. It deserves peer review so the methods and controls can be checked and the optimization claim tested more directly.

Referee Report

3 major / 3 minor

Summary. The paper proposes StaRPO, a stability-augmented RL policy optimization framework for LLMs on reasoning tasks. It decomposes reasoning stability into two lightweight metrics—Autocorrelation Function (ACF) for local step-to-step coherence and Path Efficiency (PE) for global goal-directedness—combines them with task rewards, validates the metrics via correlation with logic errors on two backbone models, and reports consistent outperformance over baselines on four reasoning benchmarks for both final-answer accuracy and logical stability.

Significance. If the central results hold, StaRPO would offer a practical, low-overhead extension to existing RL methods for reasoning that directly targets process-level consistency rather than final-answer correctness alone. This could meaningfully reduce logically inconsistent or redundant outputs in deployed LLM reasoners, with the metrics' claimed lightness making them attractive for scaling.

major comments (3)

[Abstract] Abstract: The validation of ACF and PE rests solely on reported correlation with logic errors on two backbones, yet no coefficients, p-values, annotation protocol for logic errors, or controls for confounding factors (e.g., length, fluency) are supplied. Without these, the claim that the metrics are reliable proxies for stability cannot be evaluated, and the subsequent assertion that their inclusion in the RL objective will improve both accuracy and stability lacks direct support.
[Experiments] Experiments section (and §4): No details are given on the exact reward combination formula (weights between ACF, PE, and task reward, normalization, or clipping), the precise definition of reasoning steps or trajectories over which ACF/PE are computed, statistical significance tests for benchmark gains, or full baseline implementations (including whether baselines also received process rewards). These omissions make it impossible to verify that the reported outperformance is attributable to the stability augmentation rather than implementation differences or hyperparameter tuning.
[Abstract] Abstract and §3: The paper does not address whether maximizing ACF and PE can be achieved by superficially coherent but still erroneous trajectories (reward hacking) or whether the combined objective introduces trade-offs with final-answer accuracy. Correlation with errors on fixed models does not establish that gradient updates on the augmented reward will reduce those errors without new failure modes across scales or tasks.

minor comments (3)

[§3] Clarify whether ACF is computed on token embeddings, hidden states, or discrete step representations, and provide the exact lag range used.
[Experiments] Add a table or figure showing per-benchmark accuracy and stability deltas with error bars or multiple seeds to support the 'consistently outperforms' claim.
[Introduction] Include a brief related-work paragraph contrasting StaRPO with prior process-reward or self-consistency methods.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thorough and constructive feedback on our manuscript. We appreciate the opportunity to clarify key aspects of StaRPO and strengthen the presentation of our results. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The validation of ACF and PE rests solely on reported correlation with logic errors on two backbones, yet no coefficients, p-values, annotation protocol for logic errors, or controls for confounding factors (e.g., length, fluency) are supplied. Without these, the claim that the metrics are reliable proxies for stability cannot be evaluated, and the subsequent assertion that their inclusion in the RL objective will improve both accuracy and stability lacks direct support.

Authors: We agree that the validation section requires more rigorous statistical reporting to support ACF and PE as reliable proxies. The manuscript currently describes the correlation with logic errors but omits coefficients, p-values, the annotation protocol, and controls for confounders. In the revised version, we will add these details (including controls for length and fluency) to the abstract and relevant sections, providing stronger empirical grounding for the metrics' utility in the RL objective. revision: yes
Referee: [Experiments] Experiments section (and §4): No details are given on the exact reward combination formula (weights between ACF, PE, and task reward, normalization, or clipping), the precise definition of reasoning steps or trajectories over which ACF/PE are computed, statistical significance tests for benchmark gains, or full baseline implementations (including whether baselines also received process rewards). These omissions make it impossible to verify that the reported outperformance is attributable to the stability augmentation rather than implementation differences or hyperparameter tuning.

Authors: We acknowledge that the current experiments section lacks the necessary implementation details for full reproducibility and attribution of gains. We will expand the Experiments section and §4 to specify the exact reward combination formula (including weights, normalization, and clipping), precise definitions of reasoning steps and trajectories for ACF/PE, statistical significance tests for benchmark results, and complete baseline implementation details (clarifying process reward usage). These additions will enable verification that improvements arise from the stability augmentation. revision: yes
Referee: [Abstract] Abstract and §3: The paper does not address whether maximizing ACF and PE can be achieved by superficially coherent but still erroneous trajectories (reward hacking) or whether the combined objective introduces trade-offs with final-answer accuracy. Correlation with errors on fixed models does not establish that gradient updates on the augmented reward will reduce those errors without new failure modes across scales or tasks.

Authors: We recognize this as a substantive limitation in the current analysis. While we demonstrate correlation on fixed models, the manuscript does not explicitly discuss reward hacking risks or accuracy trade-offs under optimization. In the revision, we will add a discussion in §3 addressing these concerns, including any observed trade-offs from our experiments and notes on potential new failure modes. We will also highlight this as a limitation and outline directions for future validation across scales. revision: partial

Circularity Check

0 steps flagged

No significant circularity; metrics and claims remain independent of fitted inputs or self-citations

full rationale

The paper defines ACF (local step-to-step coherence) and PE (global goal-directedness) as standalone computable metrics, then reports their empirical correlation with logic errors on two backbone models as validation. This correlation is an external measurement, not a definitional reduction or fitted parameter renamed as prediction. The StaRPO objective simply adds these process rewards to task rewards without any equation or claim reducing by construction to the target accuracy or stability outcomes. No self-citation chains, uniqueness theorems, or ansatzes imported from prior author work are invoked as load-bearing steps. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that reasoning stability decomposes into local coherence (ACF) and global goal-directedness (PE) and that these provide complementary feedback to task rewards. No free parameters or invented entities are specified in the abstract.

axioms (1)

domain assumption ACF and PE are computable lightweight metrics that evaluate reasoning stability and correlate with logic errors
Explicitly stated as the basis for incorporating stability into the optimization objective.

pith-pipeline@v0.9.0 · 5502 in / 1205 out tokens · 50297 ms · 2026-05-10T18:07:26.242261+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We quantify reasoning stability by decomposing it into two measurable metrics... Autocorrelation Function (ACF) ... Path Efficiency (PE) ... rStaRPO(x, yi) = r(x, yi) + λ_acf · r_acf + λ_pe · r_pe
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Experiments on four reasoning benchmarks show that StaRPO consistently outperforms compared baselines

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Learning to Route Queries to Heads for Attention-based Re-ranking with Large Language Models
cs.IR 2026-04 conditional novelty 6.0

RouteHead trains a lightweight router to dynamically select optimal LLM attention heads per query for improved attention-based document re-ranking.
CroSearch-R1: Better Leveraging Cross-lingual Knowledge for Retrieval-Augmented Generation
cs.CL 2026-04 unverdicted novelty 5.0

CroSearch-R1 applies search-augmented RL with cross-lingual integration and multilingual rollouts to improve RAG effectiveness on multilingual collections.

Reference graph

Works this paper leans on

48 extracted references · 31 canonical work pages · cited by 2 Pith papers · 12 internal anchors

[1]

Semantic coherence dynamics in large language models through layered syntax-aware memory retention mechanism.Authroea: New York, NY, USA,

[Andersonet al., 2024 ] Carl Anderson, Benjamin Vanden- berg, Christopher Hauser, Alexander Johansson, and Nathaniel Galloway. Semantic coherence dynamics in large language models through layered syntax-aware memory retention mechanism.Authroea: New York, NY, USA,

2024
[2]

Thinking machines: A survey of llm based reasoning strategies

[Bandyopadhyayet al., 2025 ] Dibyanayan Bandyopadhyay, Soham Bhattacharjee, and Asif Ekbal. Thinking ma- chines: A survey of llm based reasoning strategies. arXiv:2503.10814,

work page arXiv 2025
[3]

Ace-rl: Adaptive constraint-enhanced reward for long-form gen- eration reinforcement learning.arXiv:2509.04903,

[Chenet al., 2025a ] Jianghao Chen, Wei Sun, Qixiang Yin, Lingxing Kong, Zhixing Tan, and Jiajun Zhang. Ace-rl: Adaptive constraint-enhanced reward for long-form gen- eration reinforcement learning.arXiv:2509.04903,

work page arXiv
[4]

Exploration vs exploitation: Rethinking RLVR through clipping, entropy, and spurious reward

[Chenet al., 2025b ] Peter Chen, Xiaopeng Li, Ziniu Li, Wotao Yin, Xi Chen, and Tianyi Lin. Exploration vs ex- ploitation: Rethinking rlvr through clipping, entropy, and spurious reward.arXiv:2512.16912,

work page arXiv
[5]

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

[Chenet al., 2025c ] Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. Towards rea- soning era: A survey of long chain-of-thought for reason- ing large language models.arXiv:2503.09567,

work page internal anchor Pith review arXiv
[6]

Breaking the safety-capability tradeoff: Reinforcement learning with verifiable rewards maintains safety guardrails in llms.arXiv:2511.21050,

[Choet al., 2025 ] Dongkyu Derek Cho, Huan Song, Ari- jit Ghosh Chowdhury, Haotian An, Yawei Wang, Rohit Thekkanal, Negin Sokhandan, Sharlina Keshava, and Han- nah Marlowe. Breaking the safety-capability tradeoff: Reinforcement learning with verifiable rewards maintains safety guardrails in llms.arXiv:2511.21050,

work page arXiv 2025
[7]

Training Verifiers to Solve Math Word Problems

[Cobbeet al., 2021 ] Karl Cobbe, Vineet Kosaraju, Moham- mad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word prob- lems.arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

Using word embedding to evaluate the coherence of topics from twitter data

[Fanget al., 2016 ] Anjie Fang, Craig Macdonald, Iadh Ou- nis, and Philip Habel. Using word embedding to evaluate the coherence of topics from twitter data. InSIGIR,

2016
[9]

Group-in-Group Policy Optimization for LLM Agent Training

[Fenget al., 2025 ] Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training.arXiv:2505.10978,

work page internal anchor Pith review arXiv 2025
[10]

The endorsement of the premises: Assumption-based or belief-based reasoning

[George, 1995] Christian George. The endorsement of the premises: Assumption-based or belief-based reasoning. British Journal of Psychology, 86(1):93–111,

1995
[11]

Wadsworth Belmont, CA,

[Govier and others, 2010] Trudy Govier et al.A practical study of argument. Wadsworth Belmont, CA,

2010
[12]

A survey on llm- as-a-judge.The Innovation,

[Guet al., 2024 ] Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hex- iang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. A survey on llm- as-a-judge.The Innovation,

2024
[13]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

[Guoet al., 2025 ] Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shi- rong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Measuring Mathematical Problem Solving With the MATH Dataset

[Hendryckset al., 2021 ] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv:2103.03874,

work page internal anchor Pith review Pith/arXiv arXiv 2021
[15]

A review of stability in topic modeling: Metrics for assessing and techniques for improving stability

[Hosseiny Marani and Baumer, 2023] Amin Hos- seiny Marani and Eric PS Baumer. A review of stability in topic modeling: Metrics for assessing and techniques for improving stability

2023
[16]

Line: Logical query reasoning over hi- erarchical knowledge graphs

[Huanget al., 2022 ] Zijian Huang, Meng-Fen Chiang, and Wang-Chien Lee. Line: Logical query reasoning over hi- erarchical knowledge graphs. InSIGKDD,

2022
[17]

Cppo: Accelerating the training of group relative policy optimization-based reasoning models.arXiv preprint arXiv:2503.22342,

[Linet al., 2025 ] Zhihang Lin, Mingbao Lin, Yuan Xie, and Rongrong Ji. Cppo: Accelerating the training of group relative policy optimization-based reasoning mod- els.arXiv:2503.22342,

work page arXiv 2025
[18]

DeepSeek-V3 Technical Report

[Liuet al., 2024 ] Aixin Liu, Bei Feng, Bing Xue, Bingx- uan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv:2412.19437,

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Reinforcement learn- ing meets large language models: A survey of ad- vancements and applications across the llm lifecycle

[Liuet al., 2025a ] Keliang Liu, Dingkang Yang, Ziyun Qian, Weijie Yin, Yuchi Wang, Hongsheng Li, Jun Liu, Peng Zhai, Yang Liu, and Lihua Zhang. Reinforcement learn- ing meets large language models: A survey of ad- vancements and applications across the llm lifecycle. arXiv:2509.16679,

work page arXiv
[20]

Logic embeddings for com- plex query answering.arXiv:2103.00418,

[Luuset al., 2021 ] Francois Luus, Prithviraj Sen, Pavan Ka- panipathi, Ryan Riegel, Ndivhuwo Makondo, Thabang Lebese, and Alexander Gray. Logic embeddings for com- plex query answering.arXiv:2103.00418,

work page arXiv 2021
[21]

[Math-AI, 2024] Math-AI. Aime

2024
[22]

[Moet al., 2026 ] Fengran Mo, Zhan Su, Yuchen Hui, Jing- han Zhang, Jia Ao Sun, Zheyuan Liu, Chao Zhang, Tet- suya Sakai, and Jian-Yun Nie

Accessed: 2025-10-06. [Moet al., 2026 ] Fengran Mo, Zhan Su, Yuchen Hui, Jing- han Zhang, Jia Ao Sun, Zheyuan Liu, Chao Zhang, Tet- suya Sakai, and Jian-Yun Nie. Opendecoder: Open large language model decoding to incorporate document quality in rag.arXiv preprint arXiv:2601.09028,

work page arXiv 2025
[23]

Represen- tation learning in complex logical query answering on knowledge graphs: A survey.ACM Computing Surveys

[Nguyenet al., ] Chau DM Nguyen, Tim French, Michael Stewart, Melinda Hodkiewicz, and Wei Liu. Represen- tation learning in complex logical query answering on knowledge graphs: A survey.ACM Computing Surveys. [OpenAI, 2024] OpenAI. GPT-4o-mini: Advancing Cost- Efficient Intelligence,

2024
[24]

[Patilet al., 2023 ] Rajvardhan Patil, Sorio Boit, Venkat Gu- divada, and Jagadeesh Nandigam

Accessed: 2024-10-05. [Patilet al., 2023 ] Rajvardhan Patil, Sorio Boit, Venkat Gu- divada, and Jagadeesh Nandigam. A survey of text repre- sentation and embedding techniques in nlp.IEEe Access, 11:36120–36146,

2024
[25]

From reasoning to code: Grpo optimization for underrepresented languages.arXiv:2506.11027,

[Penninoet al., 2025 ] Federico Pennino, Bianca Raimondi, Massimo Rondelli, Andrea Gurioli, and Maurizio Gab- brielli. From reasoning to code: Grpo optimization for underrepresented languages.arXiv:2506.11027,

work page arXiv 2025
[26]

Gpqa: A graduate-level google-proof q&a benchmark

[Reinet al., 2024 ] David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling,

2024
[27]

Proximal Policy Optimization Algorithms

[Schulmanet al., 2017 ] John Schulman, Filip Wolski, Pra- fulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv 2017
[28]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

[Shaoet al., 2024 ] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open lan- guage models.arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Llm-planner: Few-shot grounded planning for embodied agents with large language models

[Songet al., 2023 ] Chan Hee Song, Jiaman Wu, Clayton Washington, Brian M Sadler, Wei-Lun Chao, and Yu Su. Llm-planner: Few-shot grounded planning for embodied agents with large language models. InICCV,

2023
[30]

A survey of reasoning with foundation models: Concepts, methodologies, and outlook.ACM Computing Surveys,

[Sunet al., 2023 ] Jiankai Sun, Chuanyang Zheng, Enze Xie, Zhengying Liu, Ruihang Chu, Jianing Qiu, Jiaqi Xu, Mingyu Ding, Hongyang Li, Mengzhe Geng, et al. A survey of reasoning with foundation models: Concepts, methodologies, and outlook.ACM Computing Surveys,

2023
[31]

Neural multi-hop logical query answering with concept-level answers

[Tanget al., 2023 ] Zhenwei Tang, Shichao Pei, Xi Peng, Fuzhen Zhuang, Xiangliang Zhang, and Robert Hoehn- dorf. Neural multi-hop logical query answering with concept-level answers. InInternational Semantic Web Conference,

2023
[32]

Qwen2 Technical Report

[Team, 2024] Qwen Team. Qwen2 technical report. arXiv:2407.10671, 2,

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

[Wanget al., 2025 ] Yining Wang, Jinman Zhao, Chuangxin Zhao, Shuhao Guan, Gerald Penn, and Shinan Liu.λ- grpo: Unifying the grpo frameworks with learnable token preferences.arXiv:2510.06870,

work page arXiv 2025
[34]

Ctrls: Chain-of-thought rea- soning via latent state-transition.arXiv:2507.08182,

[Wuet al., 2025a ] Junda Wu, Yuxin Xiong, Xintong Li, Zhengmian Hu, Tong Yu, Rui Wang, Xiang Chen, Jingbo Shang, and Julian McAuley. Ctrls: Chain-of-thought rea- soning via latent state-transition.arXiv:2507.08182,

work page arXiv
[35]

Lapo: Internal- izing reasoning efficiency via length-adaptive policy opti- mization.arXiv:2507.15758,

[Wuet al., 2025b ] Xingyu Wu, Yuchen Yan, Shangke Lyu, Linjuan Wu, Yiwen Qiu, Yongliang Shen, Weiming Lu, Jian Shao, Jun Xiao, and Yueting Zhuang. Lapo: Internal- izing reasoning efficiency via length-adaptive policy opti- mization.arXiv:2507.15758,

work page arXiv
[36]

Enhancing semantic consistency of large language models through model editing: An interpretability-oriented approach

[Yanget al., 2024 ] Jingyuan Yang, Dapeng Chen, Yajing Sun, Rongjun Li, Zhiyong Feng, and Wei Peng. Enhancing semantic consistency of large language models through model editing: An interpretability-oriented approach. In Findings of the Association for Computational Linguistics: ACL 2024, pages 3343–3353,

2024
[37]

Tree of thoughts: Deliberate problem solving with large language models

[Yaoet al., 2023 ] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. InNeurips,

2023
[38]

Rlpr: Extrapolating rlvr to general domains without verifiers, 2025

[Yuet al., 2025 ] Tianyu Yu, Bo Ji, Shouli Wang, Shu Yao, Zefan Wang, Ganqu Cui, Lifan Yuan, Ning Ding, Yuan Yao, Zhiyuan Liu, et al. Rlpr: Extrapolating rlvr to general domains without verifiers.arXiv:2506.18254,

work page arXiv 2025
[39]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

[Yueet al., 2025 ] Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capac- ity in llms beyond the base model?arXiv:2504.13837,

work page internal anchor Pith review arXiv 2025
[40]

Llm as a mastermind: A survey of strate- gic reasoning with large language models.arXiv preprint arXiv:2404.01230, 2024

[Zhanget al., 2024 ] Yadong Zhang, Shaoguang Mao, Tao Ge, Xun Wang, Adrian de Wynter, Yan Xia, Wenshan Wu, Ting Song, Man Lan, and Furu Wei. Llm as a mastermind: A survey of strategic reasoning with large language mod- els.arXiv:2404.01230,

work page arXiv 2024
[41]

On the interplay of pre-training, mid-training, and rl on reasoning language models.arXiv preprint arXiv:2512.07783,

[Zhanget al., 2025a ] Charlie Zhang, Graham Neubig, and Xiang Yue. On the interplay of pre-training, mid-training, and rl on reasoning language models.arXiv:2512.07783,

work page arXiv
[42]

The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

[Zhanget al., 2025b ] Guibin Zhang, Hejia Geng, Xiaohang Yu, Zhenfei Yin, Zaibin Zhang, Zelin Tan, Heng Zhou, Zhongzhi Li, Xiangyuan Xue, Yijiang Li, et al. The land- scape of agentic reinforcement learning for llms: A survey. arXiv:2509.02547,

work page internal anchor Pith review Pith/arXiv arXiv
[43]

12 Why are Entropy Dynamics and Reasoning Correlated in LLMs? A

[Zhanget al., 2025c ] Jinghan Zhang, Xiting Wang, Fen- gran Mo, Yeyang Zhou, Wanfu Gao, and Kunpeng Liu. Entropy-based exploration conduction for multi-step rea- soning.arXiv:2503.15848,

work page arXiv
[44]

A survey of reinforcement learning for large reasoning models.arXiv preprint arXiv:2509.08827,

[Zhanget al., 2025e ] Kaiyan Zhang, Yuxin Zuo, Bingxi- ang He, Youbang Sun, Runze Liu, Che Jiang, Yuchen Fan, Kai Tian, Guoli Jia, Pengfei Li, et al. A sur- vey of reinforcement learning for large reasoning models. arXiv:2509.08827,

work page arXiv
[45]

EDGE-GRPO: Entropy-Driven GRPO with Guided Error Correction for Advantage Diversity.arXiv e-prints, page arXiv:2507.21848, July 2025

[Zhanget al., 2025f ] Xingjian Zhang, Siwei Wen, Wenjun Wu, and Lei Huang. Edge-grpo: Entropy-driven grpo with guided error correction for advantage diversity. arXiv:2507.21848,

work page arXiv
[46]

R1-reward: Training multimodal reward model through stable reinforcement learning.CoRR, abs/2505.02835, 2025

[Zhanget al., 2025g ] Yi-Fan Zhang, Xingyu Lu, Xiao Hu, Chaoyou Fu, Bin Wen, Tianke Zhang, Changyi Liu, Kaiyu Jiang, Kaibing Chen, Kaiyu Tang, et al. R1-reward: Train- ing multimodal reward model through stable reinforce- ment learning.arXiv:2505.02835,

work page arXiv
[47]

Blind spot navi- gation in large language model reasoning with thought space explorer

[Zhanget al., 2026 ] Jinghan Zhang, Fengran Mo, Tharindu Cyril Weerasooriya, Xinyue Ye, Dongjie Wang, Yanjie Fu, and Kunpeng Liu. Blind spot navi- gation in large language model reasoning with thought space explorer. In Vera Demberg, Kentaro Inui, and Lluís Marquez, editors,Findings of the Association for Computational Linguistics: EACL 2026, pages 3691– ...

2026
[48]

Group Sequence Policy Optimization

Association for Computational Linguistics. [Zhenget al., 2025 ] Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv:2507.18071, 2025

work page internal anchor Pith review arXiv 2025