pith. machine review for the scientific record. sign in

arxiv: 2605.09188 · v1 · submitted 2026-05-09 · 💻 cs.LG · cs.AI

Recognition: no theorem link

DARE: Difficulty-Adaptive Reinforcement Learning with Co-Evolved Difficulty Estimation

Can Jin, Dimitris N. Metaxas, Lei Li, Runxue Bao, Shiyu Zhao, Yang Zhou, Yanting Yang, Yaochen Xie, Zhepeng Wang, Zihan Dong

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:06 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords reinforcement learninglarge language modelsdifficulty adaptationpolicy optimizationreasoning improvementtraining efficiencyinference efficiency
0
0 comments X

The pith

Reinforcement learning for large language models becomes more efficient when difficulty estimation co-evolves with the policy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Reinforcement learning improves reasoning in large language models but wastes samples on weak signals from rollouts. Existing difficulty-aware selection methods suffer because estimates turn inaccurate once the policy drifts, selection alone adds little to final performance, and inference costs stay unchanged. The paper claims that jointly updating the difficulty estimator with the policy, using self-normalized importance sampling to track changes, a symmetric Beta distribution to keep sampling diverse difficulties, and separate training rules with adjusted compute for each difficulty tier, fixes these problems. If correct, the result is faster training, stronger final models, and quicker inference because the model learns to give short answers to easy prompts and more accurate answers to hard ones.

Core claim

DARE co-evolves difficulty estimation with the policy via self-normalized importance sampling, maintains diverse difficulty coverage through a symmetric Beta sampling distribution, and applies tailored training strategies across difficulty tiers with adaptive compute allocation. Extensive experiments show this unified approach consistently outperforms existing methods in training efficiency, final effectiveness, and inference efficiency, producing more concise responses on easy tasks while improving correctness on hard ones.

What carries the argument

Co-evolution of difficulty estimation and policy via self-normalized importance sampling, augmented by symmetric Beta sampling for difficulty diversity and adaptive compute allocation across difficulty tiers.

If this is right

  • Training efficiency rises because compute focuses on prompts that still provide strong learning signals.
  • Final model performance improves specifically on hard reasoning tasks.
  • Inference efficiency increases because models produce shorter outputs on easy inputs without loss of quality.
  • The gains appear across multiple model sizes and reasoning domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Difficulty estimation may need to be updated continuously rather than treated as a static preprocessing step in long RL runs.
  • The same joint-evolution pattern could apply to other adaptive sampling problems where agent skill changes over time.
  • Lower inference costs from learned conciseness may allow wider deployment of reasoning models under tight resource limits.

Load-bearing premise

That updating the difficulty estimator together with the policy will keep the estimates accurate and beneficial even as the policy drifts and improves over training.

What would settle it

An experiment in which, after a substantial policy update, the estimated difficulties no longer predict actual rollout success rates on the same prompts and overall performance drops to the level of non-adaptive baselines.

Figures

Figures reproduced from arXiv: 2605.09188 by Can Jin, Dimitris N. Metaxas, Lei Li, Runxue Bao, Shiyu Zhao, Yang Zhou, Yanting Yang, Yaochen Xie, Zhepeng Wang, Zihan Dong.

Figure 1
Figure 1. Figure 1: Difficulty estimation results for Qwen2.5-Math-1.5B during and after GRPO training [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Training-step and wall-clock efficiency on MATH-500 with Qwen2.5-Math-1.5B trained [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of DARE. At each epoch, DARE updates a prompt-wise replay buffer, estimates current prompt difficulty with SNIS, and uses the estimate for dynamic prompt selection and difficulty￾adaptive policy optimization. where ok is a generated response, rk ∈ {0, 1} is the binary outcome reward, and πθbeh,k is the behavior policy that generated ok. For prompts without buffered rollouts, we use an embedding-ba… view at source ↗
Figure 4
Figure 4. Figure 4: Training performance on MATH-500 (left) and GSM8K (right) with Qwen2.5-Math-7B [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Case study of stratified generation behavior. For the easy Level-1 problem, the model [PITH_FULL_IMAGE:figures/full_fig_p026_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Reasoning entropy versus normalized difficulty across multiple models and datasets, [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Training curves on MATH-500 (left), GSM8K (center), and MATH-500 vs. training time [PITH_FULL_IMAGE:figures/full_fig_p028_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Training curves on code-generation benchmarks for Qwen2.5-Coder-1.5B [ [PITH_FULL_IMAGE:figures/full_fig_p029_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Left: Accuracy under constrained rollout budgets on AIMO-AMC and MATH-500 Level 5. Our difficulty estimation method achieves consistently higher accuracy, with the largest gains at small budgets. Right: Venn diagram of problems solved at K=64. Our difficulty estimation covers most problems solved by any baseline while uniquely solving additional examples. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_9.png] view at source ↗
read the original abstract

Reinforcement learning improves the reasoning ability of large language models but remains costly and sample-inefficient, as many rollouts provide weak learning signals. Difficulty-aware data selection methods attempt to address this by prioritizing moderately difficult prompts, yet our analysis reveals three limitations: difficulty estimates become inaccurate under policy drift, data selection alone yields limited final-performance gains, and inference efficiency remains largely unchanged. These findings suggest that efficient and effective RL requires more than filtering by difficulty: the policy should learn to solve hard tasks while producing concise responses for easy ones. To this end, we propose **Dare**, a unified framework that co-evolves difficulty estimation with the policy via self-normalized importance sampling, maintains diverse difficulty coverage through a symmetric Beta sampling distribution, and applies tailored training strategies across difficulty tiers with adaptive compute allocation. Extensive experiments across multiple models and domains demonstrate that **Dare** consistently outperforms existing methods in training efficiency, final effectiveness, and inference efficiency, producing more concise responses on easy tasks while improving correctness on hard ones. Code is available at https://github.com/EtaYang10th/DARE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes DARE, a unified framework for difficulty-adaptive RL in LLMs. It co-evolves difficulty estimation with the policy via self-normalized importance sampling (SNIS), maintains diverse coverage using a symmetric Beta sampling distribution, and applies tailored training strategies across difficulty tiers with adaptive compute allocation. The central claim is that this addresses three limitations of prior difficulty-aware selection methods (inaccurate estimates under drift, limited final gains from selection alone, and unchanged inference efficiency) and yields consistent improvements in training efficiency, final effectiveness, and inference efficiency across models and domains, with more concise outputs on easy tasks and higher correctness on hard ones. Code is released.

Significance. If the results hold, the work could advance sample-efficient RL for LLMs by making difficulty adaptation dynamic rather than static. The explicit code release supports reproducibility and community verification, which strengthens the contribution beyond typical empirical claims in the area.

major comments (2)
  1. [Method] Method section (co-evolution via SNIS): the framework relies on self-normalized importance sampling to update difficulty estimates as the policy changes, yet no variance-reduction steps (clipping, control variates, or effective-sample-size monitoring) are described. Because the paper itself identifies policy drift as the reason prior methods fail, it is necessary to show that the batch-normalized weights w_i = π_new(a_i|s_i)/π_old(a_i|s_i) retain sufficient effective sample size during typical LLM-RL updates; otherwise the tailored per-tier strategies rest on noisy labels.
  2. [Experiments] Experiments section: the abstract asserts consistent outperformance in training efficiency, final performance, and inference efficiency, but the provided text supplies no quantitative metrics, ablation tables, error bars, or statistical tests. Without these, the load-bearing claim that co-evolution plus Beta sampling plus adaptive compute produces gains beyond simple data filtering cannot be evaluated.
minor comments (2)
  1. [Abstract] Abstract: adding one or two concrete performance deltas (e.g., relative token reduction on easy tasks or accuracy lift on hard tasks) would make the claims more immediately verifiable.
  2. Notation: the symmetric Beta distribution is introduced for diversity; a short equation or pseudocode line showing how its parameters are set relative to the current difficulty estimates would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and describe the revisions we will make to strengthen the presentation and address the concerns raised.

read point-by-point responses
  1. Referee: [Method] Method section (co-evolution via SNIS): the framework relies on self-normalized importance sampling to update difficulty estimates as the policy changes, yet no variance-reduction steps (clipping, control variates, or effective-sample-size monitoring) are described. Because the paper itself identifies policy drift as the reason prior methods fail, it is necessary to show that the batch-normalized weights w_i = π_new(a_i|s_i)/π_old(a_i|s_i) retain sufficient effective sample size during typical LLM-RL updates; otherwise the tailored per-tier strategies rest on noisy labels.

    Authors: We agree that an explicit analysis of weight stability is warranted given our discussion of policy drift in prior methods. In the revised manuscript we will add a new subsection (or appendix) reporting the effective sample size of the SNIS weights throughout training on the main benchmarks. We will also include a short discussion explaining why the symmetric Beta sampling distribution, combined with the gradual co-evolution schedule, keeps ESS sufficiently high without requiring additional clipping or control variates in our setting. revision: yes

  2. Referee: [Experiments] Experiments section: the abstract asserts consistent outperformance in training efficiency, final performance, and inference efficiency, but the provided text supplies no quantitative metrics, ablation tables, error bars, or statistical tests. Without these, the load-bearing claim that co-evolution plus Beta sampling plus adaptive compute produces gains beyond simple data filtering cannot be evaluated.

    Authors: The full manuscript already contains quantitative results, ablation studies, and multi-model comparisons in Sections 4 and 5. To make these claims more readily evaluable, we will revise the experiments section to include error bars on all plots, report statistical significance tests for key comparisons, and expand the ablation tables to directly contrast the full DARE framework against simple difficulty-based filtering baselines. These additions will be placed in the main text rather than the appendix. revision: yes

Circularity Check

0 steps flagged

No significant circularity in DARE derivation

full rationale

The paper identifies three limitations of prior difficulty-aware RL methods and proposes DARE as a unified framework combining co-evolution of difficulty estimates with the policy via self-normalized importance sampling (a standard technique), symmetric Beta sampling for coverage, and tier-specific training with adaptive compute. These components are presented as independent mechanisms whose value is demonstrated through experiments on multiple models and domains rather than any reduction to fitted parameters, self-definitions, or self-citation chains. No equations or claims in the provided text equate predictions or results to inputs by construction, and the approach remains externally falsifiable via empirical outcomes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The abstract does not list explicit free parameters or invented entities; the method appears to rely on standard RL assumptions plus the new sampling and co-evolution mechanisms whose hyperparameters are not detailed here.

axioms (1)
  • standard math Standard assumptions of reinforcement learning (Markov decision process, reward signals from correctness)
    Implicit in any RL-for-LLM setup; invoked by the problem statement.

pith-pipeline@v0.9.0 · 5523 in / 1245 out tokens · 61395 ms · 2026-05-12T03:06:49.126117+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · 17 internal anchors

  1. [1]

    AI-MO validation AIME dataset

    AI-MO. AI-MO validation AIME dataset. Hugging Face Datasets, 2024. URL https: //huggingface.co/datasets/AI-MO/aimo-validation-aime

  2. [2]

    AI-MO validation AMC dataset

    AI-MO. AI-MO validation AMC dataset. Hugging Face Datasets, 2024. URL https: //huggingface.co/datasets/AI-MO/aimo-validation-amc

  3. [3]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021. URL https://arxiv. org/abs/2108.07732

  4. [4]

    Online difficulty filtering for reasoning oriented reinforcement learning

    Sanghwan Bae, Jiwoo Hong, Min Young Lee, Hanbyul Kim, JeongYeon Nam, and Donghyun Kwak. Online difficulty filtering for reasoning oriented reinforcement learning. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2026. URLhttps://arxiv.org/abs/2504.03380

  5. [5]

    SmolLM3: Smol, multilingual, long-context reasoner

    Elie Bakouch, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, Lewis Tunstall, Car- los Miguel Patiño, Edward Beeching, Aymeric Roucher, Aksel Joonas Reedi, Quentin Gal- louédec, Kashif Rasul, Nathan Habib, Clémentine Fourrier, and Thomas Wolf. SmolLM3: Smol, multilingual, long-context reasoner. Hugging Face Blog, 2025. URL https://huggingface. co/blog/smollm3

  6. [6]

    The role of deductive and inductive reasoning in large language models

    Chengkun Cai, Xu Zhao, Haoliang Liu, Zhongyu Jiang, Tianfang Zhang, Zongkai Wu, Jenq- Neng Hwang, and Lei Li. The role of deductive and inductive reasoning in large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16780–16790, 2025

  7. [7]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021. URL https: //arxiv.org/abs/2107.03374

  8. [8]

    Self-evolving curriculum for LLM reasoning.arXiv preprint arXiv:2505.14970, 2025

    Xiaoyin Chen, Jiarui Lu, Minsu Kim, Dinghuai Zhang, Jian Tang, Alexandre Piché, Nicolas Gontier, Yoshua Bengio, and Ehsan Kamalloo. Self-evolving curriculum for LLM reasoning. arXiv preprint arXiv:2505.14970, 2025. URLhttps://arxiv.org/abs/2505.14970

  9. [9]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  10. [10]

    Boosting the generalization and reasoning of vision language models with curriculum reinforcement learning

    Huilin Deng, Ding Zou, Rui Ma, Hongchen Luo, Yang Cao, and Yu Kang. Boosting the generalization and reasoning of vision language models with curriculum reinforcement learning. InFindings of the Association for Computational Linguistics: EMNLP 2025, 2025. URL https://arxiv.org/abs/2503.07065

  11. [11]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  12. [12]

    OlympiadBench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems

    Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. OlympiadBench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Associa- tion for Compu...

  13. [13]

    arXiv preprint arXiv:2504.11456 , year=

    Zhiwei He, Tian Liang, Jiahao Xu, Qiuzhi Liu, Xingyu Chen, Yue Wang, Linfeng Song, Dian Yu, Zhenwen Liang, Wenxuan Wang, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. DeepMath-103K: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning.arXiv preprint arXiv:2504.11456, 2025. 11

  14. [14]

    Measuring Coding Challenge Competence With APPS

    Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with APPS. InAdvances in Neural Information Processing Systems (Datasets and Benchmarks Track), 2021. URLhttps://arxiv.org/abs/2105.09938

  15. [15]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. Advances in Neural Information Processing Systems (Datasets and Benchmarks Track), 2021. URLhttps://arxiv.org/abs/2103.03874

  16. [16]

    Open-Reasoner-Zero: An Open Source Approach to Scaling up Reinforcement Learning on the Base Model

    Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-Reasoner-Zero: An Open Source Approach to Scaling up Reinforcement Learning on the Base Model. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

  17. [17]

    V ADE: Variance-aware dynamic sampling via online sample-level difficulty estimation for multimodal RL.arXiv preprint arXiv:2511.18902, 2025

    Zengjie Hu, Jiantao Qiu, Tianyi Bai, Haojin Yang, Binhang Yuan, Qi Jing, Conghui He, and Wentao Zhang. V ADE: Variance-aware dynamic sampling via online sample-level difficulty estimation for multimodal RL.arXiv preprint arXiv:2511.18902, 2025. URL https://arxiv. org/abs/2511.18902

  18. [18]

    Qwen2.5-Coder Technical Report

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Ji- ajun Zhang, Bowen Yu, Kai Dang, An Yang, Rui Men, Fei Huang, Xingzhang Ren, Xu- ancheng Ren, Jingren Zhou, and Junyang Lin. Qwen2.5-Coder technical report.arXiv preprint arXiv:2409.12186, 2024. URLhttps://arxiv.org/abs/2409.12186

  19. [19]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida I. Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contamination-free evaluation of large language models for code. InInternational Conference on Learning Representations (ICLR), 2025. URLhttps://arxiv.org/abs/2403.07974

  20. [20]

    Your reward function for rl is your best prm for search: Unifying rl and search-based tts.arXiv preprint arXiv:2508.14313, 2025

    Can Jin, Yang Zhou, Qixin Zhang, Hongwu Peng, Di Zhang, Zihan Dong, Marco Pavone, Ligong Han, Zhang-Wei Hong, Tong Che, et al. Your reward function for rl is your best prm for search: Unifying rl and search-based tts.arXiv preprint arXiv:2508.14313, 2025

  21. [21]

    Reasoning over precedents alongside statutes: Case- augmented deliberative alignment for llm safety.arXiv preprint arXiv:2601.08000, 2026

    Can Jin, Rui Wu, Tong Che, Qixin Zhang, Hongwu Peng, Jiahui Zhao, Zhenting Wang, Wenqi Wei, Ligong Han, Zhao Zhang, et al. Reasoning over precedents alongside statutes: Case- augmented deliberative alignment for llm safety.arXiv preprint arXiv:2601.08000, 2026

  22. [22]

    Revisiting generalization across difficulty levels: It’s not so easy

    Yeganeh Kordi, Nihal V Nayak, Max Zuo, Ilana Nguyen, and Stephen Bach. Revisiting generalization across difficulty levels: It’s not so easy. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7014–7042, 2026

  23. [23]

    Probing the difficulty perception mechanism of large language models

    Sunbowen Lee, Qingyu Yin, Chak Tou Leong, Jialiang Zhang, Yicheng Gong, Shiwen Ni, Min Yang, and Xiaoyu Shen. Probing the difficulty perception mechanism of large language models. arXiv preprint arXiv:2510.05969, 2025. URLhttps://arxiv.org/abs/2510.05969

  24. [24]

    Writing-RL: Advancing Long-form Writing via Adaptive Curriculum Reinforcement Learning

    Xuanyu Lei, Chenliang Li, Yuning Wu, Kaiming Liu, Weizhou Shen, Peng Li, Ming Yan, Ji Zhang, Fei Huang, and Yang Liu. Writing-RL: Advancing long-form writing via adaptive curriculum reinforcement learning.arXiv preprint arXiv:2506.05760, 2025. URL https: //arxiv.org/abs/2506.05760

  25. [25]

    Solving quantitative reasoning problems with language models.Advances in Neural Information Processing Systems, 35:3843–3857, 2022

    Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models.Advances in Neural Information Processing Systems, 35:3843–3857, 2022

  26. [27]

    URLhttps://arxiv.org/abs/2312.14852

  27. [28]

    In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3102–3116

    Xuefeng Li, Haoyang Zou, and Pengfei Liu. LIMR: Less is more for RL scaling.arXiv preprint arXiv:2502.11886, 2025. URLhttps://arxiv.org/abs/2502.11886. 12

  28. [29]

    Knapsack RL: Unlocking exploration of LLMs via optimizing budget allocation

    Ziniu Li, Congliang Chen, Tianyun Yang, Tian Ding, Ruoyu Sun, Ge Zhang, Wenhao Huang, and Zhi-Quan Luo. Knapsack RL: Unlocking exploration of LLMs via optimizing budget allocation. arXiv preprint arXiv:2509.25849, 2025. URLhttps://arxiv.org/abs/2509.25849

  29. [30]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023

  30. [31]

    Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

    Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by ChatGPT really correct? rigorous evaluation of large language models for code generation. In Advances in Neural Information Processing Systems, 2023. URL https://arxiv.org/abs/ 2305.01210

  31. [32]

    Leveraging explanation to improve generalization of meta rein- forcement learning

    Shicheng Liu and Minghui Zhu. Leveraging explanation to improve generalization of meta rein- forcement learning. InThe Fourteenth International Conference on Learning Representations, 2026

  32. [33]

    Explainable reinforce- ment learning from human feedback to improve alignment.Advances in Neural Information Processing Systems, 38:138261–138291, 2026

    Shicheng Liu, Siyuan Xu, Wenjie Qiu, Hangfan Zhang, and Minghui Zhu. Explainable reinforce- ment learning from human feedback to improve alignment.Advances in Neural Information Processing Systems, 38:138261–138291, 2026

  33. [34]

    DiffAdapt: Difficulty-Adaptive Reasoning for Token-Efficient LLM Inference

    Xiang Liu, Xuming Hu, Xiaowen Chu, and Eunsol Choi. DiffAdapt: Difficulty-adaptive reasoning for token-efficient LLM inference. InProceedings of the Fourteenth International Conference on Learning Representations (ICLR), 2026. URL https://arxiv.org/abs/ 2510.19669

  34. [35]

    Deepscaler: Surpassing o1-preview with a 1.5 b model by scaling rl.Notion Blog, 3(5), 2025

    Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Tianjun Zhang, Li Erran Li, et al. Deepscaler: Surpassing o1-preview with a 1.5 b model by scaling rl.Notion Blog, 3(5), 2025

  35. [36]

    AIME 2024 and 2025 problems

    Mathematical Association of America. AIME 2024 and 2025 problems. Amer- ican Invitational Mathematics Examination, 2025. URL https://maa.org/ maa-invitational-competitions/

  36. [37]

    EDCO: Dynamic curriculum orchestration for domain-specific large language model fine-tuning.arXiv preprint arXiv:2601.03725, 2026

    Jing-Cheng Pang, Liu Sun, Chang Zhou, Xian Tang, Haichuan Ma, Kun Jiang, Jianlong Wang, Kai Zhang, Sijie Wu, Haoran Cai, Chenwei Wu, Xubin Li, and Xin Chen. EDCO: Dynamic curriculum orchestration for domain-specific large language model fine-tuning.arXiv preprint arXiv:2601.03725, 2026. URLhttps://arxiv.org/abs/2601.03725

  37. [38]

    Boci Peng, Yun Zhu, Yongchao Liu, Xiaohe Bo, Haizhou Shi, Chuntao Hong, Yan Zhang, and Siliang Tang

    Shubham Parashar, Shurui Gui, Xiner Li, Hongyi Ling, Sushil Vemuri, Blake Olson, Eric Li, Yu Zhang, James Caverlee, Dileep Kalathil, and Shuiwang Ji. Curriculum reinforcement learning from easy to hard tasks improves LLM reasoning. InProceedings of the Fourteenth International Conference on Learning Representations (ICLR), 2026. URL https://arxiv. org/abs...

  38. [39]

    Yun Qu, Qi Wang, Yixiu Mao, Vincent Tao Hu, Björn Ommer, and Xiangyang Ji. Can prompt difficulty be online predicted for accelerating RL finetuning of reasoning models? In Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2026. URLhttps://arxiv.org/abs/2507.04632

  39. [40]

    David Rein, Betty Li, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022, 2023. URLhttps://arxiv.org/abs/2311.12022

  40. [41]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  41. [42]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 13

  42. [43]

    BOTS: A unified framework for Bayesian online task selection in LLM reinforcement finetuning

    Qianli Shen, Daoyuan Chen, Yilun Huang, Zhenqing Ling, Yaliang Li, Bolin Ding, and Jingren Zhou. BOTS: A unified framework for Bayesian online task selection in LLM reinforcement finetuning. InProceedings of the Fourteenth International Conference on Learning Representa- tions (ICLR), 2026. URLhttps://arxiv.org/abs/2510.26374

  43. [44]

    Intrinsic entropy of context length scaling in llms

    Jingzhe Shi, Qinwei Ma, Hongyi Liu, Hang Zhao, Jenq-Neng Hwang, and Lei Li. Intrinsic entropy of context length scaling in llms. InThe Fourteenth International Conference on Learning Representations, 2026

  44. [45]

    Efficient Reinforcement Finetuning via Adaptive Curriculum Learning.ArXiv, abs/2504.05520, 2025

    Taiwei Shi, Yiyang Wu, Linxin Song, Tianyi Zhou, and Jieyu Zhao. Efficient reinforcement finetuning via adaptive curriculum learning.arXiv preprint arXiv:2504.05520, 2025. URL https://arxiv.org/abs/2504.05520

  45. [46]

    Improving data efficiency for llm reinforcement fine-tuning through difficulty-targeted online data selection and rollout replay, 2026

    Yifan Sun, Jingyan Shen, Yibin Wang, Tianyu Chen, Zhendong Wang, Mingyuan Zhou, and Huan Zhang. Improving data efficiency for LLM reinforcement fine-tuning through difficulty- targeted online data selection and rollout replay. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. URLhttps://arxiv.org/abs/2506.05316

  46. [47]

    Toward trustworthy difficulty assessments: Large language models as judges in programming and synthetic tasks.arXiv preprint arXiv:2511.18597, 2025

    HM Tabib and Jaber Ahmed Deedar. Toward trustworthy difficulty assessments: Large language models as judges in programming and synthetic tasks.arXiv preprint arXiv:2511.18597, 2025

  47. [48]

    Towards high data efficiency in reinforcement learning with verifiable reward

    Xinyu Tang, Zhenduo Zhang, Yurou Liu, Wayne Xin Zhao, Zujie Wen, Zhiqiang Zhang, and Jun Zhou. Towards high data efficiency in reinforcement learning with verifiable reward. In Proceedings of the Fourteenth International Conference on Learning Representations (ICLR),

  48. [49]

    URLhttps://arxiv.org/abs/2509.01321

  49. [50]

    Light-R1: Curriculum SFT, DPO and RL for long COT from scratch and beyond

    Liang Wen, Yunke Cai, Fenrui Xiao, Xin He, Qi An, Zhenyu Duan, Yimin Du, Junchen Liu, Lifu Tang, Xiaowei Lv, Haosheng Zou, Yongchao Deng, Shousheng Jia, and Xiangzheng Zhang. Light-R1: Curriculum SFT, DPO and RL for long COT from scratch and beyond. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL): Industry T...

  50. [51]

    An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement.arXiv preprint arXiv:2409.12122, 2024

  51. [52]

    Zhe Yang, Yichang Zhang, Tianyu Liu, Jian Yang, Junyang Lin, Chang Zhou, and Zhifang Sui. Can large language models always solve easy problems if they can solve harder ones? In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1531–1555, 2024

  52. [53]

    arXiv preprint arXiv:2502.03387 , year=

    Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, and Pengfei Liu. LIMO: Less is more for reasoning. InConference on Language Modeling (COLM), 2025. URL https: //arxiv.org/abs/2502.03387

  53. [54]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

  54. [55]

    Mitigating forgetting between supervised and reinforcement learning yields stronger reasoners

    Xiangchi Yuan, Xiang Chen, Tong Yu, Dachuan Shi, Can Jin, Wenke Lee, and Saayan Mitra. Mitigating forgetting between supervised and reinforcement learning yields stronger reasoners. arXiv preprint arXiv:2510.04454, 2025

  55. [56]

    arXiv preprint arXiv:2601.13572 , year=

    Xiangchi Yuan, Dachuan Shi, Chunhui Zhang, Zheyuan Liu, Shenglong Yao, Soroush V osoughi, and Wenke Lee. Behavior knowledge merge in reinforced agentic models.arXiv preprint arXiv:2601.13572, 2026

  56. [57]

    Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? arXiv preprint arXiv:2504.13837, 2025. 14

  57. [58]

    CurES: From gradient analysis to efficient curriculum learning for reasoning LLMs

    Yongcheng Zeng, Zexu Sun, Bokai Ji, Erxue Min, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Haifeng Zhang, Xu Chen, and Jun Wang. CurES: From gradient analysis to efficient curriculum learning for reasoning LLMs. InProceedings of the Fourteenth International Conference on Learning Representations (ICLR), 2026. URLhttps://arxiv.org/abs/2510.01037

  58. [59]

    Speed-rl: Faster training of reasoning models via online curriculum learning.arXiv preprint arXiv:2506.09016, 2025

    Ruiqi Zhang, Daman Arora, Song Mei, and Andrea Zanette. SPEED-RL: Faster training of reasoning models via online curriculum learning. InICML 2025 Workshop on AI for Math (AI4Math@ICML25), 2025. URLhttps://arxiv.org/abs/2506.09016

  59. [60]

    UFO-RL: Uncertainty-focused optimization for efficient reinforcement learning data selection

    Yang Zhao, Kai Xiong, Xiao Ding, Li Du, Yangou Ouyang, Zhouhao Sun, Jiannan Guan, Wenbin Zhang, Bin Liu, Dong Hu, Bing Qin, and Ting Liu. UFO-RL: Uncertainty-focused optimization for efficient reinforcement learning data selection. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. URLhttps://arxiv.org/abs/2505.12457

  60. [61]

    Bartoldson, Bhavya Kailkhura, Fan Lai, Jiawei Zhao, and Beidi Chen

    Haizhong Zheng, Yang Zhou, Brian R. Bartoldson, Bhavya Kailkhura, Fan Lai, Jiawei Zhao, and Beidi Chen. Act only when it pays: Efficient reinforcement learning for LLM reasoning via selective rollouts. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. URLhttps://arxiv.org/abs/2506.02177

  61. [62]

    Mˆ 3-bench: Multi-modal, multi-hop, multi-threaded tool-using mllm agent benchmark.arXiv preprint arXiv:2511.17729, 2025

    Yang Zhou, Mingyu Zhao, Zhenting Wang, Difei Gu, Bangwei Guo, Ruosong Ye, Ligong Han, Can Jin, and Dimitris N Metaxas. Mˆ 3-bench: Multi-modal, multi-hop, multi-threaded tool-using mllm agent benchmark.arXiv preprint arXiv:2511.17729, 2025

  62. [63]

    Next Strategy

    Yang Zhou, Shiyu Zhao, Yuxiao Chen, Zhenting Wang, Can Jin, and Dimitris N Metaxas. Led: Llm enhanced open-vocabulary object detection without human curated data generation.arXiv preprint arXiv:2503.13794, 2025

  63. [64]

    The LLM already knows: Estimating LLM-perceived question difficulty via hidden representations

    Yubo Zhu, Dongrui Liu, Zecheng Lin, Wei Tong, Sheng Zhong, and Jing Shao. The LLM already knows: Estimating LLM-perceived question difficulty via hidden representations. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1160–1176, 2025. doi: 10.18653/v1/2025.emnlp-main.61. URL https: //arxiv.org/abs/2...

  64. [65]

    [Top-scoring fragment 1]

  65. [66]

    [Top-scoring fragment 2]

  66. [67]

    [Top-scoring fragment 3] 19 Algorithm 1DARE: Difficulty-Adaptive Reinforcement Learning with Co-Evolved Difficulty Esti- mation Require: Training set D, reference set Dref, initial policy πθ, replay buffer B ← ∅ , Beta concen- tration κ, thresholds deasy, dhard, rollout counts G, Geasy, Ghard, replay mix ratio σ, clipping boundc 1:/* Cold-start: initializ...

  67. [68]

    We can also verify with code:(3**1)**6 == 3**6 returns True.1 Short rollout (219 tokens,✓): Using the power rule (am)n =a mn: (3k)6 = 3 6k

    The solution is correct. We can also verify with code:(3**1)**6 == 3**6 returns True.1 Short rollout (219 tokens,✓): Using the power rule (am)n =a mn: (3k)6 = 3 6k. So 36k = 3 6 ⇒6k= 6⇒k= 1 . Substituting back:(3 1)6 = 3 6 =⇒3 6 = 3 6.1 Level 5 (Hard)Short wrong vs. long correct rollout Problem:There are 360 people in a school. 15 take calculus, physics, ...