arxiv: 2605.15012 · v1 · submitted 2026-05-14 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 1 theorem link

· Lean Theorem

Boosting Reinforcement Learning with Verifiable Rewards via Randomly Selected Few-Shot Guidance

Kai Yan , Alexander G. Schwing , Yu-Xiong Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 03:14 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords reinforcement learningverifiable rewardsfew-shot guidancelarge language modelssample efficiencysupervised fine-tuningon-policy learning

0 comments

The pith

FEST boosts RLVR performance using only 128 randomly selected demonstrations from SFT data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes FEST, a few-shot demonstration-guided algorithm for reinforcement learning with verifiable rewards. It demonstrates that randomly selecting just 128 examples from a supervised fine-tuning dataset suffices to achieve strong results on challenging tasks like math and coding. The success relies on combining supervised signals, on-policy signals, and decaying weights to avoid overfitting during repeated training epochs. This approach significantly reduces the amount of data needed compared to standard methods while matching their performance.

Core claim

FEST attains compelling results with only 128 demonstrations randomly selected from an SFT dataset. Three components are vital: the supervised signal, the on-policy signal, and decaying weights on the few-shot SFT dataset to prevent overfitting from multiple-epoch training. On several benchmarks, it outperforms baselines with magnitudes less SFT data and even matches their performance with the full dataset.

What carries the argument

The FEST algorithm that integrates few-shot supervised guidance with on-policy RL signals under a decaying weight schedule on the demonstration data.

If this is right

RLVR methods can succeed with far less supervised data than previously required.
Random selection of demonstrations is sufficient for effective guidance.
Decaying weights enable multiple epochs of training on small datasets without overfitting.
Performance on math and coding tasks can match full-dataset approaches using minimal examples.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This technique could lower the barrier for training reasoning-capable LLMs by reducing data acquisition costs.
It may inspire similar few-shot guidance strategies in other reinforcement learning domains beyond language models.
Exploring adaptive selection or weighting beyond random choice could further improve results.

Load-bearing premise

Randomly chosen demonstrations from an SFT dataset will provide effective guidance when mixed with on-policy signals and subject to decaying weights.

What would settle it

An experiment showing that on a difficult new benchmark, FEST with 128 random demos performs no better than pure RLVR or requires non-random selection to match full SFT results.

Figures

Figures reproduced from arXiv: 2605.15012 by Alexander G. Schwing, Kai Yan, Yu-Xiong Wang.

**Figure 2.** Figure 2: Performance scalability across varying shot counts. Dashed lines represent baseline results utilizing the full 46K SFT dataset. While FEST-GRPO provides higher robustness in the extreme few-shot case (64 shots), FEST-DPO exhibits stronger scaling ability with more data. To evaluate the scalability of our approach across varying sizes of DE, we further test our method with 64, 256, and 512 examples. The … view at source ↗

**Figure 3.** Figure 3: Evolution of SFT trajectory ratios and corresponding Pass@1 performance during HPT [PITH_FULL_IMAGE:figures/full_fig_p021_3.png] view at source ↗

**Figure 4.** Figure 4: Gradient norm comparison between DPO and GRPO objectives when applied independently. [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗

**Figure 5.** Figure 5: Training set accuracy profiles for ReLIFT-G and HPT-G. Both variants exhibit significant [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗

**Figure 6.** Figure 6: Implicit advantage z as a function of β. Panel (a) shows that average z scales approximately linearly with β, i.e., the log-ratio difference is nearly consistent and our assumption holds for β ∈ [0.001, 0.1]. Panel (b) reveals that higher β values result in a wider distribution of z, where a few examples with very low z receive intense “switch-like” signals. Conversely, lower β values yield a more concentr… view at source ↗

**Figure 7.** Figure 7: Reward curves on DE during training. Results for the “higher group” (LUFFY, ReLIFT-G, HPT-G, and RL-G) are excluded from these plots as their direct RL training on DE leads to near 100% training accuracy, masking meaningful comparative dynamics. The performance implications of this overfitting are discussed in Appendix D.1 and Tab. 3. 0 100 200 300 400 500 600 Step 0.1 0.2 0.3 0.4 0.5 0.6 Reward (Avg@8) RL… view at source ↗

**Figure 8.** Figure 8: Smoothed reward curves on DI utilizing a time-weighted exponential moving average. 0 100 200 300 400 500 600 Step 0.250 0.275 0.300 0.325 0.350 0.375 0.400 0.425 Pass@1 RL RL-G LUFFY CHORDHPT HPT-G ReLIFT ReLIFT-G FEST-DPO FEST-GRPO (a) DI (RL training set) accuracy 100 200 300 400 500 600 Step 0.34 0.36 0.38 0.40 0.42 Pass@1 RL RL-G LUFFY CHORDHPT HPT-G ReLIFT ReLIFT-G FEST-DPO FEST-GRPO (b) zoomed-in c… view at source ↗

**Figure 9.** Figure 9: Smoothed Pass@1 performance on the test set through the training process. [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗

**Figure 10.** Figure 10: Training and test set performance curves for the LIMOv2-8192 experiment described in [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗

**Figure 11.** Figure 11: Test set performance comparison between FEST, standalone SFT, and SPIN. Our method [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗

read the original abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has achieved great success in developing Large Language Models (LLMs) with chain-of-thought rollouts for many tasks such as math and coding. Nevertheless, RLVR struggles with sample efficiency on difficult problems where correct rollouts are hard to generate. Prior works propose to address this issue via demonstration-guided RLVR, i.e., to conduct Supervised FineTuning (SFT) when RL fails; however, SFT often requires a lot of data, which can be expensive to acquire. In this paper, we propose FEST, a FEw-ShoT demonstration-guided RLVR algorithm. It attains compelling results with only 128 demonstrations randomly selected from an SFT dataset. We find that three components are vital for the success: supervised signal, on-policy signal, and decaying weights on the few-shot SFT dataset to prevent overfitting from multiple-epoch training. On several benchmarks, FEST outperforms baselines with magnitudes less SFT data, even matching their performance with full dataset.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FEST's mix of random 128-shot selection, supervised signal, on-policy signal, and decaying weights can match full-SFT RLVR performance on reasoning tasks, but the abstract gives almost no experimental details so the robustness claim is hard to judge.

read the letter

FEST claims that randomly selecting just 128 demonstrations from an SFT dataset, then training with a mix of supervised signal from those shots, on-policy RL signal, and decaying weights on the few-shot data, can match the performance of methods that use the entire SFT set on reasoning benchmarks. The new part is the specific combination: random selection instead of curated examples, plus the three components they identify as vital, with the decaying schedule to handle multiple epochs without overfitting. That addresses a practical pain point in RLVR for LLMs on math and coding, where generating correct rollouts is hard and data is costly. The paper does well at explaining why each component matters and at showing empirical gains over baselines that rely on more data. It keeps the focus on verifiable rewards and chain-of-thought, which fits the current scaling efforts. The main soft spot is the missing experimental detail. The abstract does not list the baselines, the exact benchmarks, statistical significance, or any results on different random draws of the 128 examples. The stress-test concern is real here: if example difficulty varies in the SFT data, a single reported run might overstate how reliable the random selection is. Without ablations showing performance across multiple independent selections or seed variance, the central claim rests on thinner evidence than it should. If the full paper includes those checks and they hold, the result strengthens. Otherwise it needs that work before the method can be trusted as general. This paper is for researchers building efficient RL pipelines for LLM reasoning. Someone looking for ways to reduce demonstration data would find the recipe useful to try, provided they add their own robustness tests. I would recommend sending it for peer review. The core idea is worth a closer look and proper validation.

Referee Report

2 major / 2 minor

Summary. The paper proposes FEST, a few-shot demonstration-guided RLVR algorithm for LLMs that uses only 128 randomly selected demonstrations from an SFT dataset. It claims that combining supervised signals, on-policy signals, and decaying weights on the few-shot data prevents overfitting during multi-epoch training and yields compelling results on math and coding benchmarks, outperforming baselines that require magnitudes more SFT data and even matching full-dataset performance.

Significance. If the results hold under proper robustness checks, the work could meaningfully advance sample-efficient RLVR by showing that carefully weighted few-shot guidance from minimal random subsets can substitute for large-scale SFT, reducing data acquisition costs while maintaining or exceeding performance. The emphasis on the interplay of supervised, on-policy, and decaying-weight components offers actionable guidance for practitioners working on verifiable-reward fine-tuning.

major comments (2)

[Experimental Results] The central claim that 'randomly selected' 128 demonstrations suffice depends on the assumption that subset variance is negligible. The experimental section reports results for a single (or unreported number of) random draw(s) but provides no statistics over multiple independent random selections of the 128 examples, no standard deviation across seeds, and no ablation isolating subset quality. This is load-bearing: if example difficulty or quality varies within the SFT pool, the reported gains may reflect a favorable draw rather than a general property of FEST. Please add results from at least five independent random 128-subsets with mean and std-dev metrics.
[Ablation Studies] The paper states that supervised signal, on-policy signal, and decaying weights are 'vital,' yet the ablation tables do not quantify the performance drop when each component is removed individually while keeping the others fixed, nor do they isolate the effect of the decaying-weight schedule on multi-epoch overfitting. Without these controls, the necessity of all three components for the few-shot regime remains incompletely supported.

minor comments (2)

[Abstract] The abstract refers to 'several benchmarks' and 'outperforms baselines' without naming the specific tasks, baselines, or quantitative deltas; moving at least the headline numbers into the abstract would improve readability.
[Method] Notation for the decaying weight schedule (e.g., the functional form and hyper-parameters) should be defined once in a dedicated subsection and then referenced consistently in equations and figures.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback. We agree that additional robustness checks and more granular ablations will strengthen the manuscript and plan to incorporate them in the revision.

read point-by-point responses

Referee: [Experimental Results] The central claim that 'randomly selected' 128 demonstrations suffice depends on the assumption that subset variance is negligible. The experimental section reports results for a single (or unreported number of) random draw(s) but provides no statistics over multiple independent random selections of the 128 examples, no standard deviation across seeds, and no ablation isolating subset quality. This is load-bearing: if example difficulty or quality varies within the SFT pool, the reported gains may reflect a favorable draw rather than a general property of FEST. Please add results from at least five independent random 128-subsets with mean and std-dev metrics.

Authors: We acknowledge that reporting results from only a single random draw leaves the claim vulnerable to subset-specific effects. We will rerun the full evaluation pipeline on at least five independent random selections of 128 examples, report mean performance together with standard deviation, and include these statistics in the revised experimental section. revision: yes
Referee: [Ablation Studies] The paper states that supervised signal, on-policy signal, and decaying weights are 'vital,' yet the ablation tables do not quantify the performance drop when each component is removed individually while keeping the others fixed, nor do they isolate the effect of the decaying-weight schedule on multi-epoch overfitting. Without these controls, the necessity of all three components for the few-shot regime remains incompletely supported.

Authors: We agree that the current ablations do not fully isolate each factor. In the revision we will add controlled experiments that (i) remove the supervised signal, on-policy signal, and decaying-weight schedule one at a time while holding the other two fixed, and (ii) compare fixed-weight versus decaying-weight schedules across multiple training epochs to quantify the overfitting-prevention effect. These results will be presented in an expanded ablation table. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical results rest on external comparisons

full rationale

The paper introduces FEST as an empirical algorithm combining supervised fine-tuning signals, on-policy RL signals, and decaying weights on a small random subset of 128 SFT demonstrations. Its central claims are supported by reported benchmark comparisons showing outperformance relative to baselines that use more data. No equations, uniqueness theorems, or derivations are present that reduce by construction to fitted inputs or self-citations. The three vital components are identified from experimental findings rather than being presupposed in the method definition. The approach is self-contained against external benchmarks, with no load-bearing self-citation chains or ansatzes smuggled via prior work.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim depends on standard RL assumptions plus the empirical claim that the three listed components suffice when applied to randomly chosen few-shot data.

free parameters (2)

number of demonstrations = 128
Fixed at 128 and randomly selected from SFT dataset; the exact count is a design choice that the results depend on.
decaying weight schedule
Weights applied to the few-shot SFT data that must decay over epochs to prevent overfitting; the schedule is not specified in the abstract.

axioms (2)

domain assumption Demonstration guidance from SFT data can usefully supplement RLVR when correct rollouts are rare.
Invoked when stating that prior works use SFT after RL fails and that FEST improves on this.
domain assumption On-policy signals and supervised signals can be productively combined in the same training loop.
Stated as one of the three vital components without further justification in the abstract.

pith-pipeline@v0.9.0 · 5486 in / 1466 out tokens · 62425 ms · 2026-05-15T03:14:12.279889+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

109 extracted references · 109 canonical work pages · 15 internal anchors

[1]

and Iov, I

Afanasyev, M. and Iov, I. Slime: Stabilized likelihood implicit margin enforcement for preference optimization.arXiv preprint arXiv:2602.02383, 2026

work page arXiv 2026
[2]

Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms

Ahmadian, A., Cremer, C., Gallé, M., Fadaee, M., Kreutzer, J., Pietquin, O., Üstün, A., and Hooker, S. Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms. InACL, 2024

work page 2024
[3]

Big-math: A large-scale, high-quality math dataset for reinforcement learning in language models.arXiv preprint arXiv:2502.17387, 2025

Albalak, A., Phung, D., Lile, N., Rafailov, R., Gandhi, K., Castricato, L., Singh, A., Blagden, C., Xiang, V ., Mahan, D., et al. Big-math: A large-scale, high-quality math dataset for reinforcement learning in language models.arXiv preprint arXiv:2502.17387, 2025

work page arXiv 2025
[4]

Global overview of Imitation Learning

Attia, A. and Dayan, S. Global overview of imitation learning.arXiv preprint arXiv:1801.06503, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[5]

The ai startup fueling chatgpt’s expertise is now valued at $10 bil- lion.The Wall Street Journal, 2026

Au-Yeung, A. The ai startup fueling chatgpt’s expertise is now valued at $10 bil- lion.The Wall Street Journal, 2026. URL https://www.wsj.com/tech/ai/ the-ai-startup-fueling-chatgpts-expertise-is-now-valued-at-10-billion-f1281e56

work page 2026
[6]

Online preference alignment for language models via count-based exploration

Bai, C., Zhang, Y ., Qiu, S., Zhang, Q., Xu, K., and Li, X. Online preference alignment for language models via count-based exploration. InICLR, 2026

work page 2026
[7]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Bai, Y ., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[8]

Scheduled sampling for sequence prediction with recurrent neural networks

Bengio, S., Vinyals, O., Jaitly, N., and Shazeer, N. Scheduled sampling for sequence prediction with recurrent neural networks. InNIPS, 2015

work page 2015
[9]

Data diversity matters for robust instruction tuning

Bukharin, A., Li, S., Wang, Z., Yang, J., Yin, B., Li, X., Zhang, C., Zhao, T., and Jiang, H. Data diversity matters for robust instruction tuning. InFindings of EMNLP, 2024

work page 2024
[10]

Instruction mining: Instruction data selection for tuning large language models

Cao, Y ., Kang, Y ., Wang, C., and Sun, L. Instruction mining: Instruction data selection for tuning large language models. InCOLM, 2024

work page 2024
[11]

H., Chen, X., Zhang, Q., Ranganath, R., and Cho, K

Chen, A., Malladi, S., Zhang, L. H., Chen, X., Zhang, Q., Ranganath, R., and Cho, K. Preference learning algorithms do not learn preference rankings. InNeurIPS, 2024

work page 2024
[12]

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

Chen, A., Li, A., Gong, B., Jiang, B., Fei, B., Yang, B., Shan, B., Yu, C., Wang, C., Zhu, C., et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention.arXiv preprint arXiv:2506.13585, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Sft or rl? an early investigation into training r1-like reasoning large vision-language models.TMLR, 2025

Chen, H., Tu, H., Wang, F., Liu, H., Tang, X., Du, X., Zhou, Y ., and Xie, C. Sft or rl? an early investigation into training r1-like reasoning large vision-language models.TMLR, 2025

work page 2025
[14]

Step-wise adaptive integration of supervised fine-tuning and reinforcement learning for task-specific llms.arXiv preprint arXiv:2505.13026, 2025

Chen, J., Liu, F., Liu, N., Luo, Y ., Qin, E., Zheng, H., Dong, T., Zhu, H., Meng, Y ., and Wang, X. Step-wise adaptive integration of supervised fine-tuning and reinforcement learning for task-specific llms.arXiv preprint arXiv:2505.13026, 2025

work page arXiv 2025
[15]

Self-play fine-tuning converts weak language models to strong language models

Chen, Z., Deng, Y ., Yuan, H., Ji, K., and Gu, Q. Self-play fine-tuning converts weak language models to strong language models. InICML, 2024

work page 2024
[16]

X., Zhang, Z., and Wei, F

Cheng, D., Huang, S., Zhu, X., Dai, B., Zhao, W. X., Zhang, Z., and Wei, F. Reasoning with exploration: An entropy perspective. InAAAI, 2026

work page 2026
[17]

H., Oh, J., Kim, M., and Lee, B.-J

Cho, J. H., Oh, J., Kim, M., and Lee, B.-J. Rethinking dpo: The role of rejected responses in preference misalignment. InEMNLP, 2025

work page 2025
[18]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. 10

work page internal anchor Pith review Pith/arXiv arXiv 2021
[19]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Anchored preference optimization and contrastive revisions: Addressing underspecification in alignment

D’Oosterlinck, K., Xu, W., Develder, C., Demeester, T., Singh, A., Potts, C., Kiela, D., and Mehri, S. Anchored preference optimization and contrastive revisions: Addressing underspecification in alignment. InACL, 2025

work page 2025
[21]

Dornis, T. W. and Stober, S. Generative ai training and copyright law.arXiv preprint arXiv:2502.15858, 2025

work page arXiv 2025
[22]

Rlhf in an sft way: From optimal solution to reward-weighted alignment.TMLR, 2026

Du, Y ., Li, Z., Cheng, P., Chen, Z., Xie, Y ., Wan, X., and Gao, A. Rlhf in an sft way: From optimal solution to reward-weighted alignment.TMLR, 2026

work page 2026
[23]

Kto: Model alignment as prospect theoretic optimization

Ethayarajh, K., Xu, W., Muennighoff, N., Jurafsky, D., and Kiela, D. Kto: Model alignment as prospect theoretic optimization. InICML, 2024

work page 2024
[24]

Serl: Self-play reinforcement learning for large language models with limited data

Fang, W., Liu, S., Zhou, Y ., Zhang, K., Zheng, T., Chen, K., Song, M., and Tao, D. Serl: Self-play reinforcement learning for large language models with limited data. InNeurIPS, 2025

work page 2025
[25]

Towards analyzing and understanding the limitations of dpo: A theoretical perspective.arXiv preprint arXiv:2404.04626, 2024

Feng, D., Qin, B., Huang, C., Zhang, Z., and Lei, W. Towards analyzing and understanding the limitations of dpo: A theoretical perspective.arXiv preprint arXiv:2404.04626, 2024

work page arXiv 2024
[26]

Srft: A single-stage method with supervised and reinforcement fine-tuning for reasoning

Fu, Y ., Chen, T., Chai, J., Wang, X., Tu, S., Yin, G., Lin, W., Zhang, Q., Zhu, Y ., and Zhao, D. Srft: A single-stage method with supervised and reinforcement fine-tuning for reasoning. In ICLR, 2026

work page 2026
[27]

and Gu, S

Fujimoto, S. and Gu, S. S. A minimalist approach to offline reinforcement learning. In NeurIPS, 2021

work page 2021
[28]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Proximalized preference optimization for diverse feedback types: A decomposed perspective on dpo

Guo, K., Li, Y ., and Chen, Z. Proximalized preference optimization for diverse feedback types: A decomposed perspective on dpo. InNeurIPS, 2025

work page 2025
[30]

Direct language model alignment from online ai feedback.arXiv preprint arXiv:2402.04792, 2024

Guo, S., Zhang, B., Liu, T., Liu, T., Khalman, M., Llinares, F., Rame, A., Mesnard, T., Zhao, Y ., Piot, B., et al. Direct language model alignment from online ai feedback.arXiv preprint arXiv:2402.04792, 2024

work page arXiv 2024
[31]

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems

He, C., Luo, R., Bai, Y ., Hu, S., Thai, Z., Shen, J., Hu, J., Han, X., Huang, Y ., Zhang, Y ., et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. InACL, 2024

work page 2024
[32]

Unifying stable optimization and reference regularization in rlhf

He, L., Qu, Q., Zhao, H., Wan, S., Wang, D., Yao, L., and Liu, T. Unifying stable optimization and reference regularization in rlhf. InICLR, 2026

work page 2026
[33]

Measuring mathematical problem solving with the math dataset

Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring mathematical problem solving with the math dataset. InNeurIPS, 2021

work page 2021
[34]

and Yang, L

Huang, Y . and Yang, L. F. Winning gold at imo 2025 with a model-agnostic verification-and- refinement pipeline. InMATH-AI Workshop at NeurIPS, 2025

work page 2025
[35]

M., and Titov, I

Huang, Z., Cheng, T., Qiu, Z., Wang, Z., Xu, Y ., Ponti, E. M., and Titov, I. Blending supervised and reinforcement fine-tuning with prefix sampling.arXiv preprint arXiv:2507.01679, 2025

work page arXiv 2025
[36]

K., Dantanarayana, J., Flautner, K., Tang, L., Kang, Y ., and Mars, J

Irugalbandara, C., Mahendra, A., Daynauth, R., Arachchige, T. K., Dantanarayana, J., Flautner, K., Tang, L., Kang, Y ., and Mars, J. Scaling down to scale up: A cost-benefit analysis of replacing openai’s llm with open source slms in production. InISPASS, 2024. 11

work page 2024
[37]

OpenAI o1 System Card

Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., Beutel, A., Carney, A., et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

Jensen, J. L. W. V . Sur les fonctions convexes et les inégalités entre les valeurs moyennes. Acta Mathematica, 1906

work page 1906
[39]

Supervised fine-tuning versus reinforcement learning: A study of post-training methods for large language models.arXiv preprint arXiv:2603.13985, 2026

Jiang, H., Zhang, W., Yao, J., Cai, H., Wang, S., and Song, R. Supervised fine-tuning versus reinforcement learning: A study of post-training methods for large language models.arXiv preprint arXiv:2603.13985, 2026

work page arXiv 2026
[40]

CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment

Jiang, X., Dong, Y ., Liu, M., Deng, H., Wang, T., Tao, Y ., Cao, R., Li, B., Jin, Z., Jiao, W., et al. Coderl+: Improving code generation via reinforcement with execution semantics alignment. arXiv preprint arXiv:2510.18471, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

X., Li, M., Qin, C., Wang, P., Savarese, S., et al

Ke, Z., Jiao, F., Ming, Y ., Nguyen, X.-P., Xu, A., Long, D. X., Li, M., Qin, C., Wang, P., Savarese, S., et al. A survey of frontiers in llm reasoning: Inference scaling, learning to reason, and agentic systems.TMLR, 2025

work page 2025
[42]

and Alatan, A

Köksal, A. and Alatan, A. A. Few-shot vision-language reasoning for satellite imagery via verifiable rewards. InICCV, 2025

work page 2025
[43]

H., Gonzalez, J

Kwon, W., Li, Z., Zhuang, S., Sheng, Y ., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

work page 2023
[44]

and Team, H

Kydlíˇcek, H. and Team, H. F. Math-verify: A library for verifying mathematical answers,

work page
[45]

GitHub repository

URLhttps://github.com/huggingface/Math-Verify. GitHub repository

work page
[46]

Solving quantitative reasoning problems with language models

Lewkowycz, A., Andreassen, A., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V ., Slone, A., Anil, C., Schlag, I., Gutman-Solo, T., et al. Solving quantitative reasoning problems with language models. InNeurIPS, 2022

work page 2022
[47]

Q., Shen, Z., et al

Li, J., Beeching, E., Tunstall, L., Lipkin, B., Soletskyi, R., Huang, S., Rasul, K., Yu, L., Jiang, A. Q., Shen, Z., et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face Repository, 2024

work page 2024
[48]

Confidence is all you need: Few-shot rl fine-tuning of language models.arXiv preprint arXiv:2506.06395, 2025

Li, P., Skripkin, M., Zubrey, A., Kuznetsov, A., and Oseledets, I. Confidence is all you need: Few-shot rl fine-tuning of language models.arXiv preprint arXiv:2506.06395, 2025

work page arXiv 2025
[49]

Limr: Less is more for rl scaling.arXiv preprint arXiv:2502.11886, 2025

Li, X., Zou, H., and Liu, P. Limr: Less is more for rl scaling.arXiv preprint arXiv:2502.11886, 2025

work page arXiv 2025
[50]

Empowering small vlms to think with dynamic memorization and exploration

Liu, J., Deng, Y ., and Chen, L. Empowering small vlms to think with dynamic memorization and exploration. InICLR, 2026

work page 2026
[51]

Uft: Unifying supervised and reinforcement fine-tuning

Liu, M., Farina, G., and Ozdaglar, A. Uft: Unifying supervised and reinforcement fine-tuning. InNeurIPS, 2025

work page 2025
[52]

Superrl: Reinforcement learning with supervision to boost language model reasoning.arXiv preprint arXiv:2506.01096, 2025

Liu, Y ., Li, S., Cao, L., Xie, Y ., Zhou, M., Dong, H., Ma, X., Han, S., and Zhang, D. Superrl: Reinforcement learning with supervision to boost language model reasoning.arXiv preprint arXiv:2506.01096, 2025

work page arXiv 2025
[53]

S., and Lin, M

Liu, Z., Chen, C., Li, W., Qi, P., Pang, T., Du, C., Lee, W. S., and Lin, M. Understanding r1-zero-like training: A critical perspective. InCOLM, 2025

work page 2025
[54]

and Hutter, F

Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. InICLR, 2019

work page 2019
[55]

Towards a unified view of large language model post-training.arXiv preprint arXiv:2509.04419, 2025

Lv, X., Zuo, Y ., Sun, Y ., Liu, H., Wei, Y ., Chen, Z., Zhu, X., Zhang, K., Wang, B., Ding, N., et al. Towards a unified view of large language model post-training.arXiv preprint arXiv:2509.04419, 2025

work page arXiv 2025
[56]

H., Niu, J., Shen, C., He, R., Li, Y ., et al

Ma, L., Liang, H., Qiang, M., Tang, L., Ma, X., Wong, Z. H., Niu, J., Shen, C., He, R., Li, Y ., et al. Learning what reinforcement learning can’t: Interleaved online fine-tuning for hardest questions. InICLR, 2026. 12

work page 2026
[57]

Gradient imbalance in direct preference optimization.arXiv preprint arXiv:2502.20847, 2025

Ma, Q., Shi, J., Jin, C., Hwang, J.-N., Belongie, S., and Li, L. Gradient imbalance in direct preference optimization.arXiv preprint arXiv:2502.20847, 2025

work page arXiv 2025
[58]

Leveraging online olympiad-level math problems for LLMs training and contamination-resistant evaluation

Mahdavi, S., Li, M., Liu, K., Thrampoulidis, C., Sigal, L., and Liao, R. Leveraging online olympiad-level math problems for LLMs training and contamination-resistant evaluation. In ICML, 2025

work page 2025
[59]

Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems.arXiv preprint arXiv:2412.09413, 2024

Min, Y ., Chen, Z., Jiang, J., Chen, J., Deng, J., Hu, Y ., Tang, Y ., Wang, J., Cheng, X., Song, H., et al. Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems.arXiv preprint arXiv:2412.09413, 2024

work page arXiv 2024
[60]

L., Fei-Fei, L., Hajishirzi, H., Zettlemoyer, L., Liang, P., Candès, E., and Hashimoto, T

Muennighoff, N., Yang, Z., Shi, W., Li, X. L., Fei-Fei, L., Hajishirzi, H., Zettlemoyer, L., Liang, P., Candès, E., and Hashimoto, T. B. s1: Simple test-time scaling. InEMNLP, 2025

work page 2025
[61]

Integral probability metrics and their generating classes of functions.Advances in Applied Probability, 1997

Müller, A. Integral probability metrics and their generating classes of functions.Advances in Applied Probability, 1997

work page 1997
[62]

How gpt-5 helped mathematician ernest ryu solve a 40-year-old open problem, 2025

OpenAI. How gpt-5 helped mathematician ernest ryu solve a 40-year-old open problem, 2025. URLhttps://openai.com/index/gpt-5-mathematical-discovery/

work page 2025
[63]

Training language models to follow instructions with human feedback

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback. InNeurIPS, 2022

work page 2022
[64]

Smaug: Fixing failure modes of preference optimisation with dpo-positive.arXiv preprint arXiv:2402.13228, 2024

Pal, A., Karkhanis, D., Dooley, S., Roberts, M., Naidu, S., and White, C. Smaug: Fixing failure modes of preference optimisation with dpo-positive.arXiv preprint arXiv:2402.13228, 2024

work page arXiv 2024
[65]

Trinity-rft: A general-purpose and unified framework for reinforcement fine-tuning of large language models, 2025

Pan, X., Chen, Y ., Chen, Y ., Sun, Y ., Chen, D., Zhang, W., Xie, Y ., Huang, Y ., Zhang, Y ., Gao, D., Li, Y ., Ding, B., and Zhou, J. Trinity-rft: A general-purpose and unified framework for reinforcement fine-tuning of large language models, 2025. URL https: //arxiv.org/abs/2505.17826

work page arXiv 2025
[66]

What matters in data for dpo? InNeurIPS, 2025

Pan, Y ., Cai, Z., Chen, G., Zhong, H., and Wang, C. What matters in data for dpo? InNeurIPS, 2025

work page 2025
[67]

Phan, L., Gatti, A., Han, Z., Li, N., Hu, J., Zhang, H., Zhang, C. B. C., Shaaban, M., Ling, J., Shi, S., et al. Humanity’s last exam.arXiv preprint arXiv:2501.14249, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[68]

D., Ermon, S., and Finn, C

Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., and Finn, C. Direct preference optimization: Your language model is secretly a reward model. InNeurIPS, 2023

work page 2023
[69]

L., Bellemare, M

Roux, N. L., Bellemare, M. G., Lebensold, J., Bergeron, A., Greaves, J., Fréchette, A., Pelletier, C., Thibodeau-Laufer, E., Toth, S., and Work, S. Tapered off-policy reinforce: Stable and efficient reinforcement learning for llms. InNeurIPS, 2025

work page 2025
[70]

High-dimensional continuous control using generalized advantage estimation

Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. High-dimensional continuous control using generalized advantage estimation. InICLR, 2016

work page 2016
[71]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimiza- tion algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[72]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[73]

Outra- geously large neural networks: The sparsely-gated mixture-of-experts layer

Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. Outra- geously large neural networks: The sparsely-gated mixture-of-experts layer. InICLR, 2017

work page 2017
[74]

Hybridflow: A flexible and efficient rlhf framework

Sheng, G., Zhang, C., Ye, Z., Wu, X., Zhang, W., Zhang, R., Peng, Y ., Lin, H., and Wu, C. Hybridflow: A flexible and efficient rlhf framework. InEuroSys, 2025

work page 2025
[75]

Ai models collapse when trained on recursively generated data.Nature, 2024

Shumailov, I., Shumaylov, Z., Zhao, Y ., et al. Ai models collapse when trained on recursively generated data.Nature, 2024. 13

work page 2024
[76]

Fspo: Few-shot preference optimization of synthetic preference data in llms elicits effective personalization to real users

Singh, A., Hsu, S., Hsu, K., Mitchell, E., Ermon, S., Hashimoto, T., Sharma, A., and Finn, C. Fspo: Few-shot preference optimization of synthetic preference data in llms elicits effective personalization to real users. InICLR, 2026

work page 2026
[77]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 2024

Su, J., Ahmed, M., Lu, Y ., Pan, S., Bo, W., and Liu, Y . Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 2024

work page 2024
[78]

Sopo: Text-to-motion generation using semi-online preference optimization

Tan, X., Wang, H., Geng, X., and Zhou, P. Sopo: Text-to-motion generation using semi-online preference optimization. InNeurIPS, 2025

work page 2025
[79]

Secrets of rlhf in large language models part ii: Reward modeling.arXiv preprint arXiv:2401.06080, 2024

Wang, B., Zheng, R., Chen, L., Liu, Y ., Dou, S., Huang, C., Shen, W., Jin, S., Zhou, E., Shi, C., et al. Secrets of rlhf in large language models part ii: Reward modeling.arXiv preprint arXiv:2401.06080, 2024

work page arXiv 2024
[80]

W., Foo, C.-S., and Low, B

Wang, J., Lin, X., Qiao, R., Koh, P. W., Foo, C.-S., and Low, B. K. H. NICE data selection for instruction tuning in LLMs with non-differentiable evaluation metric. InICML, 2025

work page 2025

Showing first 80 references.