Reinforcement Learning without Ground-Truth Solutions can Improve LLMs

Kun Zhou; Nikki Lijing Kuang; Qiyue Gao; Tongtong Liang; Xunpeng Huang; Yi-An Ma; Yingyu Lin; Yuxiong He; Zhewei Yao

arxiv: 2606.27369 · v1 · pith:A325CT6Lnew · submitted 2026-06-25 · 💻 cs.LG

Reinforcement Learning without Ground-Truth Solutions can Improve LLMs

Yingyu Lin , Qiyue Gao , Nikki Lijing Kuang , Xunpeng Huang , Kun Zhou , Tongtong Liang , Zhewei Yao , Yi-An Ma

show 1 more author

Yuxiong He

This is my paper

Pith reviewed 2026-06-26 04:44 UTC · model grok-4.3

classification 💻 cs.LG

keywords reinforcement learninglarge language modelsreward shapingcoding benchmarksgeneralizationheuristic optimization

0 comments

The pith

Calibrated ranking rewards let LLMs improve on exact coding benchmarks after training only on score-based tasks without ground-truth answers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that reinforcement learning on score-based optimization tasks without any ground-truth solutions can still enhance LLMs on exact-solution coding problems when rewards receive proper calibration. It identifies two issues with continuous execution scores under group-relative RL, scale dominance across instances and frequency dominance from repeated suboptimal samples, and solves them with instance-wise comparisons that emphasize top-ranked solutions while keeping bounded feedback for others. A sympathetic reader would care because this expands usable training environments for LLMs to problems where exact answers are unavailable or unverifiable.

Core claim

The RiVER framework trains LLMs on 12 AtCoder Heuristic Contest tasks using deterministic execution feedback as continuous rewards. Calibrated reward shaping performs instance-wise comparisons and prioritizes top-ranked solvers to counteract scale and frequency dominance. This produces 8.9 percent and 9.4 percent gains in ALE rating rank for Qwen3-8B and GLM-Z1-9B, plus absolute average improvements of 2.4 percent on LiveCodeBench and 3.5 percent on USACO, whereas raw-score baselines improve only the heuristic tasks and fail to transfer.

What carries the argument

calibrated reward shaping that uses instance-wise comparisons and emphasizes top-ranked solvers while retaining bounded feedback for other valid solutions

If this is right

Qwen3-8B and GLM-Z1-9B advance 8.9 percent and 9.4 percent in ALE rating rank after training on the heuristic tasks.
The same training yields 2.4 percent average absolute improvement on LiveCodeBench and 3.5 percent on USACO despite no ground-truth answers in training.
Baselines trained with raw execution scores improve ALE rating but show no transfer to exact-solution benchmarks.
Score-based optimization tasks can therefore serve as effective training environments for general coding ability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Score-based tasks with calibrated rewards may act as scalable proxies for developing broader problem-solving ability in LLMs.
The calibration technique could extend to other continuous-reward RL settings where magnitude and repetition distort updates.
Mixed training that combines score-based and exact tasks might produce further gains on both types of benchmark.

Load-bearing premise

The calibrated reward shaping produces policy updates that generalize from the score-based training distribution to exact-solution tasks.

What would settle it

Training identical models with raw uncalibrated execution scores and checking whether transfer gains on LiveCodeBench and USACO remain at comparable levels.

Figures

Figures reproduced from arXiv: 2606.27369 by Kun Zhou, Nikki Lijing Kuang, Qiyue Gao, Tongtong Liang, Xunpeng Huang, Yi-An Ma, Yingyu Lin, Yuxiong He, Zhewei Yao.

**Figure 1.** Figure 1: Comparison between traditional answer-matching verification and our group-wise rank-induced verifiable [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Best-so-far performance across 12 AHC problems. [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

read the original abstract

Reinforcement learning with verifiable rewards (RLVR) for training LLMs typically rely on ground-truth answers to assign rewards, limiting their applicability to tasks where the ground-truth solution is unknown. We introduce a \textbf{R}anking-\textbf{i}nduced \textbf{VER}ifiable framework (RiVER) that trains LLMs on score-based optimization tasks without ground-truth solutions, using deterministic execution feedback as continuous-valued supervision. When applying group-relative RL to such continuous rewards, we identify two key challenges: \emph{scale dominance}, where uncalibrated score magnitudes across test instances distort policy updates, and \emph{frequency dominance}, where repeatedly sampled suboptimal solutions can outweigh rare but stronger candidates. RiVER addresses these challenges with calibrated reward shaping that uses instance-wise comparisons and emphasizes top-ranked solvers while retaining bounded feedback for other valid solutions. We train on 12 AtCoder Heuristic Contest tasks and evaluate on Algorithm Engineering Benchmark (ALE-Bench), LiveCodeBench, and USACO. RiVER advances Qwen3-8B and GLM-Z1-9B-0414 by 8.9\% and 9.4\% in ALE rating rank. More importantly, despite training exclusively on score-based tasks without any ground-truth solutions, RiVER also improves the backbones across exact-solution benchmarks such as LiveCodeBench and USACO by an absolute average improvement of 2.4\% and 3.5\%. By contrast, baselines trained with raw execution scores improve ALE rating but fail to transfer to exact-solution benchmarks. These results suggest that score-based optimization tasks, combined with proper reward calibration, can serve as effective training environments for general coding ability without ground-truth solutions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims calibrated ranking rewards on score-only AtCoder tasks can transfer to exact benchmarks like LiveCodeBench, but the abstract gives no ablations or variance to show the calibration is what causes it.

read the letter

The one thing to know is that they train on 12 AtCoder heuristic tasks using only execution scores, apply a new reward calibration called RiVER, and report that the models then improve on LiveCodeBench and USACO by 2.4% and 3.5% absolute even though those evals use exact pass/fail. Raw-score training improves the training metric but does not transfer.

What they actually introduce is a fix for two problems in group-relative RL with continuous rewards: scale dominance across instances and frequency dominance from repeated weak samples. RiVER does instance-wise comparisons, boosts the top-ranked solutions, and keeps bounded feedback for the rest. They show this on Qwen3-8B and GLM-Z1-9B, with larger gains on the ALE rating than the baselines.

The soft spot is exactly what the stress-test note flags. The abstract gives only point estimates with no seed variance, no component ablations, and no check on whether the policy actually solves more exact cases or just rides correlations that happen to exist between the training tasks and the test suites. Without those, the transfer could come from extra RL compute, incidental task overlap, or the base models already having the capability. The paper does not isolate the ranking calibration as the load-bearing piece.

This is for groups working on reward design for LLM coding agents who need to use tasks without reliable ground truth. A reader who wants to try continuous-score RL would find the problem statement and the proposed shaping useful to test.

I would send it to peer review. The direction is worth a proper experimental check even if the current evidence for the causal claim is still thin.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces RiVER, a ranking-induced verifiable RL framework that trains LLMs on score-based optimization tasks without ground-truth solutions by using deterministic execution feedback as continuous rewards. It identifies scale dominance and frequency dominance issues when applying group-relative RL to such rewards and proposes calibrated reward shaping via instance-wise comparisons that emphasize top-ranked solutions while bounding feedback for others. Training occurs on 12 AtCoder Heuristic Contest tasks; evaluation on ALE-Bench shows rating gains of 8.9% and 9.4% for Qwen3-8B and GLM-Z1-9B, with additional absolute gains of 2.4% and 3.5% reported on the exact-solution benchmarks LiveCodeBench and USACO. Raw-score baselines improve ALE rating but do not transfer.

Significance. If the transfer results hold after rigorous controls, the work would be significant because it provides evidence that score-based tasks with calibrated continuous rewards can improve general coding performance in LLMs without requiring ground-truth verifiers. This would expand the applicability of RLVR methods beyond closed-ended tasks with known solutions and offer a practical route to leverage abundant heuristic optimization environments for capability improvement.

major comments (3)

[Abstract] Abstract: the central transfer claim (absolute gains of 2.4% on LiveCodeBench and 3.5% on USACO) is presented as point estimates with no mention of run count, seed-wise variance, or statistical tests; this is load-bearing because the abstract contrasts these gains against raw-score baselines that fail to transfer, yet the reliability of the difference cannot be assessed without those details.
[Description of RiVER and experimental outcomes] Description of RiVER and experimental outcomes: no ablation or component analysis is referenced that isolates the causal contribution of instance-wise comparisons plus top-rank emphasis (versus other training factors) to the observed generalization from score-based AtCoder tasks to exact-match benchmarks; without such controls the transfer could be explained by incidental task overlap or base-model capability rather than the calibrated shaping.
[Results on exact-solution benchmarks] Results on exact-solution benchmarks: it is not stated whether the reported improvements reflect an increase in the number of instances that receive exact solutions or merely higher partial scores that happen to correlate with the test distribution; this distinction is required to substantiate the claim that policy updates generalize beyond score quirks.

minor comments (2)

[Abstract] The abstract lists two base models (Qwen3-8B, GLM-Z1-9B-0414) but does not indicate whether the same training hyperparameters and data splits were used for both, which would clarify the generality of the reported gains.
Table or figure captions for the ALE-Bench, LiveCodeBench, and USACO results should explicitly note the evaluation protocol (e.g., number of samples per problem, temperature) to support reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive comments on our work. We provide point-by-point responses to the major comments below and commit to revisions where appropriate to address the concerns raised.

read point-by-point responses

Referee: [Abstract] Abstract: the central transfer claim (absolute gains of 2.4% on LiveCodeBench and 3.5% on USACO) is presented as point estimates with no mention of run count, seed-wise variance, or statistical tests; this is load-bearing because the abstract contrasts these gains against raw-score baselines that fail to transfer, yet the reliability of the difference cannot be assessed without those details.

Authors: We agree with this observation. The current abstract reports point estimates without detailing the experimental setup for variance. In the revised manuscript, we will update the abstract to include the number of runs (3 independent seeds), report mean and standard deviation for the gains, and mention that the improvements are statistically significant based on paired t-tests or similar. We will also add a dedicated section on experimental details for reproducibility. revision: yes
Referee: [Description of RiVER and experimental outcomes] Description of RiVER and experimental outcomes: no ablation or component analysis is referenced that isolates the causal contribution of instance-wise comparisons plus top-rank emphasis (versus other training factors) to the observed generalization from score-based AtCoder tasks to exact-match benchmarks; without such controls the transfer could be explained by incidental task overlap or base-model capability rather than the calibrated shaping.

Authors: This is a fair criticism. While the manuscript describes the components of RiVER, it does not present ablations to isolate their effects on transfer. We will conduct and include additional ablation experiments in the revision, comparing the full RiVER against versions without instance-wise calibration and without top-rank emphasis, to quantify their contribution to the generalization observed on LiveCodeBench and USACO. revision: yes
Referee: [Results on exact-solution benchmarks] Results on exact-solution benchmarks: it is not stated whether the reported improvements reflect an increase in the number of instances that receive exact solutions or merely higher partial scores that happen to correlate with the test distribution; this distinction is required to substantiate the claim that policy updates generalize beyond score quirks.

Authors: We thank the referee for highlighting this important distinction. The manuscript reports absolute improvements on exact-solution benchmarks but does not break down whether these stem from more exact matches or better partial scores. In the revision, we will add analysis showing the number of problems solved exactly before and after training, as well as average scores, to clarify that the gains include increases in exact solution rates. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical results with no derivations or self-referential claims

full rationale

The paper reports experimental outcomes from RL training on AtCoder heuristic tasks using a proposed calibrated reward shaping method (RiVER) and measures improvements on ALE-Bench, LiveCodeBench, and USACO. No equations, first-principles derivations, or predictions appear; the transfer to exact-solution benchmarks is presented as an observed empirical result rather than a mathematically derived claim. The abstract explicitly contrasts raw-score baselines (which improve ALE but fail to transfer) against RiVER, but this is a factual reporting of training runs, not a reduction of any result to its own inputs by construction. No self-citations, ansatzes, or fitted parameters renamed as predictions are invoked in the load-bearing steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are specified in the provided text.

pith-pipeline@v0.9.1-grok · 5867 in / 1114 out tokens · 48151 ms · 2026-06-26T04:44:37.876063+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

71 extracted references · 1 canonical work pages

[1]

2025 , url =

Miles: Enterprise-Grade Reinforcement Learning for Large-Scale Model Post-Training , author =. 2025 , url =

2025
[2]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=
[3]

arXiv preprint arXiv:1707.06347 , year=

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

Pith/arXiv arXiv
[4]

2024 , eprint=

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code , author=. 2024 , eprint=

2024
[5]

Hugging Face repository , howpublished =

CodeForces , author=. Hugging Face repository , howpublished =. 2025 , publisher =

2025
[6]

Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals

Yujia Li and David Choi and Junyoung Chung and Nate Kushman and Julian Schrittwieser and R. Competition-level code generation with AlphaCode , journal =. 2022 , doi =. https://www.science.org/doi/pdf/10.1126/science.abq1158 , abstract =

work page doi:10.1126/science.abq1158 2022
[7]

2024 , eprint=

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools , author=. 2024 , eprint=

2024
[8]

arXiv preprint arXiv:2506.20512 , year=

Octothinker: Mid-training incentivizes reinforcement learning scaling , author=. arXiv preprint arXiv:2506.20512 , year=

arXiv
[9]

arXiv preprint arXiv:2512.07783 , year=

On the interplay of pre-training, mid-training, and rl on reasoning language models , author=. arXiv preprint arXiv:2512.07783 , year=

arXiv
[10]

arXiv preprint arXiv:2502.21321 , year=

Llm post-training: A deep dive into reasoning large language models , author=. arXiv preprint arXiv:2502.21321 , year=

arXiv
[11]

arXiv preprint arXiv:2506.01939 , year=

Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning , author=. arXiv preprint arXiv:2506.01939 , year=

Pith/arXiv arXiv
[12]

arXiv preprint arXiv:2504.13837 , year=

Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? , author=. arXiv preprint arXiv:2504.13837 , year=

Pith/arXiv arXiv
[13]

arXiv preprint arXiv:2505.22617 , year=

The entropy mechanism of reinforcement learning for reasoning language models , author=. arXiv preprint arXiv:2505.22617 , year=

Pith/arXiv arXiv
[14]

arXiv preprint arXiv:2506.14758 , year=

Reasoning with exploration: An entropy perspective , author=. arXiv preprint arXiv:2506.14758 , year=

Pith/arXiv arXiv
[15]

arXiv preprint arXiv:2411.15124 , year=

Tulu 3: Pushing frontiers in open language model post-training , author=. arXiv preprint arXiv:2411.15124 , year=

Pith/arXiv arXiv
[16]

arXiv preprint arXiv:2402.03300 , year=

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

Pith/arXiv arXiv
[17]

arXiv preprint arXiv:2503.14476 , year=

Dapo: An open-source llm reinforcement learning system at scale , author=. arXiv preprint arXiv:2503.14476 , year=

Pith/arXiv arXiv
[18]

arXiv preprint arXiv:2510.13554 , year=

Attention illuminates llm reasoning: The preplan-and-anchor rhythm enables fine-grained policy optimization , author=. arXiv preprint arXiv:2510.13554 , year=

Pith/arXiv arXiv
[19]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Math-shepherd: Verify and reinforce llms step-by-step without human annotations , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[20]

The Twelfth International Conference on Learning Representations , year=

Let's verify step by step , author=. The Twelfth International Conference on Learning Representations , year=
[21]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Learning planning-based reasoning by trajectories collection and process reward synthesizing , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

2024
[22]

arXiv preprint arXiv:2502.14276 , year=

Steca: Step-level trajectory calibration for llm agent learning , author=. arXiv preprint arXiv:2502.14276 , year=

arXiv
[23]

arXiv preprint arXiv:2505.10978 , year=

Group-in-group policy optimization for llm agent training , author=. arXiv preprint arXiv:2505.10978 , year=

Pith/arXiv arXiv
[24]

arXiv preprint arXiv:2406.11176 , year=

Watch every step! llm agent learning via iterative step-level process refinement , author=. arXiv preprint arXiv:2406.11176 , year=

arXiv
[25]

arXiv preprint arXiv:2501.12948 , year=

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

Pith/arXiv arXiv
[26]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Opencoder: The open cookbook for top-tier code large language models , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[27]

arXiv preprint arXiv:2505.21297 , year=

rStar-Coder: Scaling Competitive Code Reasoning with a Large-Scale Verified Dataset , author=. arXiv preprint arXiv:2505.21297 , year=

arXiv
[28]

5-coder technical report , author=

Qwen2. 5-coder technical report , author=. arXiv preprint arXiv:2409.12186 , year=

Pith/arXiv arXiv
[29]

arXiv preprint arXiv:2505.22648 , year=

Webdancer: Towards autonomous information seeking agency , author=. arXiv preprint arXiv:2505.22648 , year=

arXiv
[30]

arXiv preprint arXiv:2507.02592 , year=

WebSailor: Navigating Super-human Reasoning for Web Agent , author=. arXiv preprint arXiv:2507.02592 , year=

Pith/arXiv arXiv
[31]

arXiv preprint arXiv:2510.16476 , year=

NP-Engine: Empowering Optimization Reasoning in Large Language Models with Verifiable Synthetic NP Problems , author=. arXiv preprint arXiv:2510.16476 , year=

arXiv
[32]

arXiv preprint arXiv:2509.16865 , year=

Large language models as end-to-end combinatorial optimization solvers , author=. arXiv preprint arXiv:2509.16865 , year=

arXiv
[33]

arXiv preprint arXiv:2504.11239 , year=

Nondeterministic Polynomial-time Problem Challenge: An Ever-Scaling Reasoning Benchmark for LLMs , author=. arXiv preprint arXiv:2504.11239 , year=

arXiv
[34]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Nphardeval: Dynamic benchmark on reasoning ability of large language models via complexity classes , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[35]

arXiv preprint arXiv:2410.13213 , year=

LLMOPT: Learning to Define and Solve General Optimization Problems from Scratch , author=. arXiv preprint arXiv:2410.13213 , year=

arXiv
[36]

arXiv preprint arXiv:2506.15196 , year=

HeurAgenix: Leveraging LLMs for Solving Complex Combinatorial Optimization Challenges , author=. arXiv preprint arXiv:2506.15196 , year=

arXiv
[37]

arXiv preprint arXiv:2506.07972 , year=

HeuriGym: An Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial Optimization , author=. arXiv preprint arXiv:2506.07972 , year=

arXiv
[38]

Advances in neural information processing systems , volume=

Reevo: Large language models as hyper-heuristics with reflective evolution , author=. Advances in neural information processing systems , volume=
[39]

arXiv preprint arXiv:2506.11057 , year=

STRCMP: Integrating Graph Structural Priors with Language Models for Combinatorial Optimization , author=. arXiv preprint arXiv:2506.11057 , year=

arXiv
[40]

Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V

Efficient heuristics generation for solving combinatorial optimization problems using large language models , author=. Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2 , pages=
[41]

arXiv preprint arXiv:2508.20373 , year=

Graph-r1: Unleashing llm reasoning with np-hard graph problems , author=. arXiv preprint arXiv:2508.20373 , year=

arXiv
[42]

arXiv preprint arXiv:2505.03335 , year=

Absolute zero: Reinforced self-play reasoning with zero data , author=. arXiv preprint arXiv:2505.03335 , year=

Pith/arXiv arXiv
[43]

arXiv preprint arXiv:2506.03136 , year=

Co-evolving llm coder and unit tester via reinforcement learning , author=. arXiv preprint arXiv:2506.03136 , year=

arXiv
[44]

arXiv preprint arXiv:2505.20347 , year=

SeRL: Self-Play Reinforcement Learning for Large Language Models with Limited Data , author=. arXiv preprint arXiv:2505.20347 , year=

arXiv
[45]

arXiv preprint arXiv:2509.23863 , year=

Spell: Self-play reinforcement learning for evolving long-context language models , author=. arXiv preprint arXiv:2509.23863 , year=

arXiv
[46]

arXiv preprint arXiv:2508.00410 , year=

Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models , author=. arXiv preprint arXiv:2508.00410 , year=

arXiv
[47]

arXiv preprint arXiv:2506.08745 , year=

Consistent Paths Lead to Truth: Self-Rewarding Reinforcement Learning for LLM Reasoning , author=. arXiv preprint arXiv:2506.08745 , year=

arXiv
[48]

arXiv preprint arXiv:2510.18821 , year=

Search self-play: Pushing the frontier of agent capability without supervision , author=. arXiv preprint arXiv:2510.18821 , year=

Pith/arXiv arXiv
[49]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=
[50]

arXiv preprint arXiv:2410.11287 , year=

Process reward model with q-value rankings , author=. arXiv preprint arXiv:2410.11287 , year=

arXiv
[51]

arXiv preprint arXiv:2410.08146 , year=

Rewarding progress: Scaling automated process verifiers for llm reasoning , author=. arXiv preprint arXiv:2410.08146 , year=

Pith/arXiv arXiv
[52]

arXiv preprint arXiv:2501.07301 , year=

The lessons of developing process reward models in mathematical reasoning , author=. arXiv preprint arXiv:2501.07301 , year=

Pith/arXiv arXiv
[53]

arXiv preprint arXiv:2503.21295 , year=

R-prm: Reasoning-driven process reward modeling , author=. arXiv preprint arXiv:2503.21295 , year=

arXiv
[54]

arXiv preprint arXiv:2412.11006 , year=

Entropy-regularized process reward model , author=. arXiv preprint arXiv:2412.11006 , year=

arXiv
[55]

arXiv preprint arXiv:2505.23433 , year=

Diversity-Aware Policy Optimization for Large Language Model Reasoning , author=. arXiv preprint arXiv:2505.23433 , year=

arXiv
[56]

SIAM Review , volume=

Problem 66-11, Moving furniture through a hallway , author=. SIAM Review , volume=. 1966 , publisher=

1966
[57]

Proceedings of the fourth Berkeley symposium on mathematical statistics and probability, volume 1: contributions to the theory of statistics , volume=

On measures of entropy and information , author=. Proceedings of the fourth Berkeley symposium on mathematical statistics and probability, volume 1: contributions to the theory of statistics , volume=. 1961 , organization=

1961
[58]

arXiv preprint arXiv:2107.03374 , year=

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

Pith/arXiv arXiv
[59]

arXiv preprint arXiv:2403.07974 , year=

Livecodebench: Holistic and contamination free evaluation of large language models for code , author=. arXiv preprint arXiv:2403.07974 , year=

Pith/arXiv arXiv
[60]

Wei Zhao, Zhe Li, Yige Li, Ye Zhang, and Junfeng Sun , year=

American invitational mathematics examination (aime) 2025 , author=. Wei Zhao, Zhe Li, Yige Li, Ye Zhang, and Junfeng Sun , year=

2025
[61]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025
[62]

Advances in neural information processing systems , volume=

Solving quantitative reasoning problems with language models , author=. Advances in neural information processing systems , volume=
[63]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[64]

arXiv preprint arXiv:1412.6980 , year=

Adam: A method for stochastic optimization , author=. arXiv preprint arXiv:1412.6980 , year=

Pith/arXiv arXiv
[65]

Science , volume=

Competition-level code generation with alphacode , author=. Science , volume=. 2022 , publisher=

2022
[66]

arXiv preprint arXiv:2511.07317 , year=

Rlve: Scaling up reinforcement learning for language models with adaptive verifiable environments , author=. arXiv preprint arXiv:2511.07317 , year=

Pith/arXiv arXiv
[67]

arXiv preprint arXiv:2506.09050 , year=

Ale-bench: A benchmark for long-horizon objective-driven algorithm engineering , author=. arXiv preprint arXiv:2506.09050 , year=

arXiv
[68]

2024 , eprint=

Can Language Models Solve Olympiad Programming? , author=. 2024 , eprint=

2024
[69]

arXiv preprint arXiv:2509.24261 , year=

Risk-sensitive rl for alleviating exploration dilemmas in large language models , author=. arXiv preprint arXiv:2509.24261 , year=

arXiv
[70]

arXiv preprint arXiv:2601.16175 , year=

Learning to discover at test time , author=. arXiv preprint arXiv:2601.16175 , year=

Pith/arXiv arXiv
[71]

arXiv preprint arXiv:1711.05101 , year=

Decoupled weight decay regularization , author=. arXiv preprint arXiv:1711.05101 , year=

Pith/arXiv arXiv

[1] [1]

2025 , url =

Miles: Enterprise-Grade Reinforcement Learning for Large-Scale Model Post-Training , author =. 2025 , url =

2025

[2] [2]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

[3] [3]

arXiv preprint arXiv:1707.06347 , year=

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

Pith/arXiv arXiv

[4] [4]

2024 , eprint=

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code , author=. 2024 , eprint=

2024

[5] [5]

Hugging Face repository , howpublished =

CodeForces , author=. Hugging Face repository , howpublished =. 2025 , publisher =

2025

[6] [6]

Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals

Yujia Li and David Choi and Junyoung Chung and Nate Kushman and Julian Schrittwieser and R. Competition-level code generation with AlphaCode , journal =. 2022 , doi =. https://www.science.org/doi/pdf/10.1126/science.abq1158 , abstract =

work page doi:10.1126/science.abq1158 2022

[7] [7]

2024 , eprint=

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools , author=. 2024 , eprint=

2024

[8] [8]

arXiv preprint arXiv:2506.20512 , year=

Octothinker: Mid-training incentivizes reinforcement learning scaling , author=. arXiv preprint arXiv:2506.20512 , year=

arXiv

[9] [9]

arXiv preprint arXiv:2512.07783 , year=

On the interplay of pre-training, mid-training, and rl on reasoning language models , author=. arXiv preprint arXiv:2512.07783 , year=

arXiv

[10] [10]

arXiv preprint arXiv:2502.21321 , year=

Llm post-training: A deep dive into reasoning large language models , author=. arXiv preprint arXiv:2502.21321 , year=

arXiv

[11] [11]

arXiv preprint arXiv:2506.01939 , year=

Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning , author=. arXiv preprint arXiv:2506.01939 , year=

Pith/arXiv arXiv

[12] [12]

arXiv preprint arXiv:2504.13837 , year=

Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? , author=. arXiv preprint arXiv:2504.13837 , year=

Pith/arXiv arXiv

[13] [13]

arXiv preprint arXiv:2505.22617 , year=

The entropy mechanism of reinforcement learning for reasoning language models , author=. arXiv preprint arXiv:2505.22617 , year=

Pith/arXiv arXiv

[14] [14]

arXiv preprint arXiv:2506.14758 , year=

Reasoning with exploration: An entropy perspective , author=. arXiv preprint arXiv:2506.14758 , year=

Pith/arXiv arXiv

[15] [15]

arXiv preprint arXiv:2411.15124 , year=

Tulu 3: Pushing frontiers in open language model post-training , author=. arXiv preprint arXiv:2411.15124 , year=

Pith/arXiv arXiv

[16] [16]

arXiv preprint arXiv:2402.03300 , year=

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

Pith/arXiv arXiv

[17] [17]

arXiv preprint arXiv:2503.14476 , year=

Dapo: An open-source llm reinforcement learning system at scale , author=. arXiv preprint arXiv:2503.14476 , year=

Pith/arXiv arXiv

[18] [18]

arXiv preprint arXiv:2510.13554 , year=

Attention illuminates llm reasoning: The preplan-and-anchor rhythm enables fine-grained policy optimization , author=. arXiv preprint arXiv:2510.13554 , year=

Pith/arXiv arXiv

[19] [19]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Math-shepherd: Verify and reinforce llms step-by-step without human annotations , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[20] [20]

The Twelfth International Conference on Learning Representations , year=

Let's verify step by step , author=. The Twelfth International Conference on Learning Representations , year=

[21] [21]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Learning planning-based reasoning by trajectories collection and process reward synthesizing , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

2024

[22] [22]

arXiv preprint arXiv:2502.14276 , year=

Steca: Step-level trajectory calibration for llm agent learning , author=. arXiv preprint arXiv:2502.14276 , year=

arXiv

[23] [23]

arXiv preprint arXiv:2505.10978 , year=

Group-in-group policy optimization for llm agent training , author=. arXiv preprint arXiv:2505.10978 , year=

Pith/arXiv arXiv

[24] [24]

arXiv preprint arXiv:2406.11176 , year=

Watch every step! llm agent learning via iterative step-level process refinement , author=. arXiv preprint arXiv:2406.11176 , year=

arXiv

[25] [25]

arXiv preprint arXiv:2501.12948 , year=

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

Pith/arXiv arXiv

[26] [26]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Opencoder: The open cookbook for top-tier code large language models , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[27] [27]

arXiv preprint arXiv:2505.21297 , year=

rStar-Coder: Scaling Competitive Code Reasoning with a Large-Scale Verified Dataset , author=. arXiv preprint arXiv:2505.21297 , year=

arXiv

[28] [28]

5-coder technical report , author=

Qwen2. 5-coder technical report , author=. arXiv preprint arXiv:2409.12186 , year=

Pith/arXiv arXiv

[29] [29]

arXiv preprint arXiv:2505.22648 , year=

Webdancer: Towards autonomous information seeking agency , author=. arXiv preprint arXiv:2505.22648 , year=

arXiv

[30] [30]

arXiv preprint arXiv:2507.02592 , year=

WebSailor: Navigating Super-human Reasoning for Web Agent , author=. arXiv preprint arXiv:2507.02592 , year=

Pith/arXiv arXiv

[31] [31]

arXiv preprint arXiv:2510.16476 , year=

NP-Engine: Empowering Optimization Reasoning in Large Language Models with Verifiable Synthetic NP Problems , author=. arXiv preprint arXiv:2510.16476 , year=

arXiv

[32] [32]

arXiv preprint arXiv:2509.16865 , year=

Large language models as end-to-end combinatorial optimization solvers , author=. arXiv preprint arXiv:2509.16865 , year=

arXiv

[33] [33]

arXiv preprint arXiv:2504.11239 , year=

Nondeterministic Polynomial-time Problem Challenge: An Ever-Scaling Reasoning Benchmark for LLMs , author=. arXiv preprint arXiv:2504.11239 , year=

arXiv

[34] [34]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Nphardeval: Dynamic benchmark on reasoning ability of large language models via complexity classes , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[35] [35]

arXiv preprint arXiv:2410.13213 , year=

LLMOPT: Learning to Define and Solve General Optimization Problems from Scratch , author=. arXiv preprint arXiv:2410.13213 , year=

arXiv

[36] [36]

arXiv preprint arXiv:2506.15196 , year=

HeurAgenix: Leveraging LLMs for Solving Complex Combinatorial Optimization Challenges , author=. arXiv preprint arXiv:2506.15196 , year=

arXiv

[37] [37]

arXiv preprint arXiv:2506.07972 , year=

HeuriGym: An Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial Optimization , author=. arXiv preprint arXiv:2506.07972 , year=

arXiv

[38] [38]

Advances in neural information processing systems , volume=

Reevo: Large language models as hyper-heuristics with reflective evolution , author=. Advances in neural information processing systems , volume=

[39] [39]

arXiv preprint arXiv:2506.11057 , year=

STRCMP: Integrating Graph Structural Priors with Language Models for Combinatorial Optimization , author=. arXiv preprint arXiv:2506.11057 , year=

arXiv

[40] [40]

Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V

Efficient heuristics generation for solving combinatorial optimization problems using large language models , author=. Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2 , pages=

[41] [41]

arXiv preprint arXiv:2508.20373 , year=

Graph-r1: Unleashing llm reasoning with np-hard graph problems , author=. arXiv preprint arXiv:2508.20373 , year=

arXiv

[42] [42]

arXiv preprint arXiv:2505.03335 , year=

Absolute zero: Reinforced self-play reasoning with zero data , author=. arXiv preprint arXiv:2505.03335 , year=

Pith/arXiv arXiv

[43] [43]

arXiv preprint arXiv:2506.03136 , year=

Co-evolving llm coder and unit tester via reinforcement learning , author=. arXiv preprint arXiv:2506.03136 , year=

arXiv

[44] [44]

arXiv preprint arXiv:2505.20347 , year=

SeRL: Self-Play Reinforcement Learning for Large Language Models with Limited Data , author=. arXiv preprint arXiv:2505.20347 , year=

arXiv

[45] [45]

arXiv preprint arXiv:2509.23863 , year=

Spell: Self-play reinforcement learning for evolving long-context language models , author=. arXiv preprint arXiv:2509.23863 , year=

arXiv

[46] [46]

arXiv preprint arXiv:2508.00410 , year=

Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models , author=. arXiv preprint arXiv:2508.00410 , year=

arXiv

[47] [47]

arXiv preprint arXiv:2506.08745 , year=

Consistent Paths Lead to Truth: Self-Rewarding Reinforcement Learning for LLM Reasoning , author=. arXiv preprint arXiv:2506.08745 , year=

arXiv

[48] [48]

arXiv preprint arXiv:2510.18821 , year=

Search self-play: Pushing the frontier of agent capability without supervision , author=. arXiv preprint arXiv:2510.18821 , year=

Pith/arXiv arXiv

[49] [49]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=

[50] [50]

arXiv preprint arXiv:2410.11287 , year=

Process reward model with q-value rankings , author=. arXiv preprint arXiv:2410.11287 , year=

arXiv

[51] [51]

arXiv preprint arXiv:2410.08146 , year=

Rewarding progress: Scaling automated process verifiers for llm reasoning , author=. arXiv preprint arXiv:2410.08146 , year=

Pith/arXiv arXiv

[52] [52]

arXiv preprint arXiv:2501.07301 , year=

The lessons of developing process reward models in mathematical reasoning , author=. arXiv preprint arXiv:2501.07301 , year=

Pith/arXiv arXiv

[53] [53]

arXiv preprint arXiv:2503.21295 , year=

R-prm: Reasoning-driven process reward modeling , author=. arXiv preprint arXiv:2503.21295 , year=

arXiv

[54] [54]

arXiv preprint arXiv:2412.11006 , year=

Entropy-regularized process reward model , author=. arXiv preprint arXiv:2412.11006 , year=

arXiv

[55] [55]

arXiv preprint arXiv:2505.23433 , year=

Diversity-Aware Policy Optimization for Large Language Model Reasoning , author=. arXiv preprint arXiv:2505.23433 , year=

arXiv

[56] [56]

SIAM Review , volume=

Problem 66-11, Moving furniture through a hallway , author=. SIAM Review , volume=. 1966 , publisher=

1966

[57] [57]

Proceedings of the fourth Berkeley symposium on mathematical statistics and probability, volume 1: contributions to the theory of statistics , volume=

On measures of entropy and information , author=. Proceedings of the fourth Berkeley symposium on mathematical statistics and probability, volume 1: contributions to the theory of statistics , volume=. 1961 , organization=

1961

[58] [58]

arXiv preprint arXiv:2107.03374 , year=

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

Pith/arXiv arXiv

[59] [59]

arXiv preprint arXiv:2403.07974 , year=

Livecodebench: Holistic and contamination free evaluation of large language models for code , author=. arXiv preprint arXiv:2403.07974 , year=

Pith/arXiv arXiv

[60] [60]

Wei Zhao, Zhe Li, Yige Li, Ye Zhang, and Junfeng Sun , year=

American invitational mathematics examination (aime) 2025 , author=. Wei Zhao, Zhe Li, Yige Li, Ye Zhang, and Junfeng Sun , year=

2025

[61] [61]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025

[62] [62]

Advances in neural information processing systems , volume=

Solving quantitative reasoning problems with language models , author=. Advances in neural information processing systems , volume=

[63] [63]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[64] [64]

arXiv preprint arXiv:1412.6980 , year=

Adam: A method for stochastic optimization , author=. arXiv preprint arXiv:1412.6980 , year=

Pith/arXiv arXiv

[65] [65]

Science , volume=

Competition-level code generation with alphacode , author=. Science , volume=. 2022 , publisher=

2022

[66] [66]

arXiv preprint arXiv:2511.07317 , year=

Rlve: Scaling up reinforcement learning for language models with adaptive verifiable environments , author=. arXiv preprint arXiv:2511.07317 , year=

Pith/arXiv arXiv

[67] [67]

arXiv preprint arXiv:2506.09050 , year=

Ale-bench: A benchmark for long-horizon objective-driven algorithm engineering , author=. arXiv preprint arXiv:2506.09050 , year=

arXiv

[68] [68]

2024 , eprint=

Can Language Models Solve Olympiad Programming? , author=. 2024 , eprint=

2024

[69] [69]

arXiv preprint arXiv:2509.24261 , year=

Risk-sensitive rl for alleviating exploration dilemmas in large language models , author=. arXiv preprint arXiv:2509.24261 , year=

arXiv

[70] [70]

arXiv preprint arXiv:2601.16175 , year=

Learning to discover at test time , author=. arXiv preprint arXiv:2601.16175 , year=

Pith/arXiv arXiv

[71] [71]

arXiv preprint arXiv:1711.05101 , year=

Decoupled weight decay regularization , author=. arXiv preprint arXiv:1711.05101 , year=

Pith/arXiv arXiv