pith. machine review for the scientific record. sign in

arxiv: 2605.08441 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards

Haoyu Hu, Nori Jacoby, Xuandong Zhao, Xuhai "Orson'' Xu

Pith reviewed 2026-05-12 01:53 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords reinforcement learningtoken budgetverifiable rewardsrollout allocationearly stoppingmath reasoningtraining efficiency
0
0 comments X

The pith

Jointly deciding which prompts receive rollouts and when to stop each rollout improves reasoning accuracy and cuts training time under a fixed token budget.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that reinforcement learning with verifiable rewards wastes tokens when it either spreads rollouts evenly or runs every rollout to completion. Instead, a lightweight estimate of how much a prompt will improve the model can set the number of rollouts per prompt, while an early-stop signal based on answer markers can truncate uninformative generations and reweight the remaining signal. On a 1.7-billion-parameter model trained on math problems, this dual control reaches higher accuracy than full-budget baselines even when given only half the tokens and finishes training more than twice as fast. The same pattern holds on coding tasks and other backbones.

Core claim

DUET improves both final model performance and wall-clock training speed by allocating a shared token budget across two decisions at once: how many rollouts to generate for each prompt and how many tokens to produce within each rollout.

What carries the argument

A dual-controlled token allocation layer that uses a pre-rollout surrogate to set rollout counts per prompt and a marker-gated abort rule with importance reweighting to truncate rollouts early.

If this is right

  • The same token budget can be spent on fewer but higher-value generations and still produce a stronger policy update.
  • Training wall time drops because many rollouts terminate before reaching maximum length.
  • The performance gap over uniform-budget methods widens rather than shrinks when the overall budget is reduced.
  • The method remains effective across different model sizes and across math, coding, and some scientific question-answering domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Token efficiency can be treated as an explicit optimization objective rather than an after-the-fact compression step.
  • Similar dual-control logic might apply to other expensive sampling loops such as test-time search or synthetic data generation.
  • The widening advantage at tighter budgets suggests that uniform full-length rollouts increasingly include low-value tokens as compute is limited.

Load-bearing premise

The surrogate ranks prompt usefulness accurately enough that the chosen rollout counts improve the learning signal more than uniform allocation would, and the early-abort rule does not distort the gradient estimates.

What would settle it

A controlled experiment in which the surrogate is replaced by random prompt ranking and the resulting accuracy or speedup disappears or reverses.

Figures

Figures reproduced from arXiv: 2605.08441 by Haoyu Hu, Nori Jacoby, Xuandong Zhao, Xuhai "Orson'' Xu.

Figure 1
Figure 1. Figure 1: DUET dominates the budget–accuracy frontier on Qwen3-1.7B-base trained on MATH. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: One DUET training step. Allocate: each prompt receives an informativeness score sˆq from previous training contributions, and the cost-weighted Neyman rule sets the per-prompt rollout count nq. Generate: a marker-gated rule terminates a rollout shortly after an answer marker fires and aborts a marker-less rollout past a quantile threshold. Update: kept rollouts feed an importance-corrected, gradient-masked… view at source ↗
Figure 3
Figure 3. Figure 3: Wall-clock efficiency on Qwen3-1.7B / MATH. (a) Per-step train￾ing time across training; all cells share the same engine. (b) Speedup against full-budget GRPO across rollout bud￾gets; DUET reaches 3.4× at quarter bud￾get. ARRoL omitted from (b) because its κ knob is not a hard rollout-count cap. DUET shifts the entire speed-quality Pareto outward, not just one operating point [PITH_FULL_IMAGE:figures/full… view at source ↗
Figure 4
Figure 4. Figure 4: DUET’s emergent properties on Qwen3-1.7B / MATH. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: DUET’s allocation internals on Qwen3-1.7B / MATH. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Theory–empirics alignment on Qwen3-1.7B / MATH. [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Per-benchmark accuracy delta versus the same-budget GRPO reference on Qwen3-1.7B [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Surrogate-variant comparison on Qwen3-1.7B / MATH at [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Per-step generated tokens on Qwen3-1.7B / MATH for each method at three rollout budgets [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗
read the original abstract

Reinforcement learning with verifiable rewards (RLVR) generates hundreds of thousands of tokens per training step, with rollout generation dominating the computational cost. The overall token budget can be controlled along two main dimensions: (i) deciding which prompts to allocate rollouts to, and (ii) deciding how long each rollout should be. Prior work has generally controlled only one of these dimensions at a time. We show that jointly tuning both decisions under a shared compute budget improves both reasoning quality and wall-clock training time. We instantiate this view as \textbf{DU}al-controlled tok\textbf{E}n alloca\textbf{T}ion (DUET), a computationally efficient layer over GRPO that uses a lightweight pre-rollout surrogate of prompt informativeness to set how many rollouts each prompt receives, and a marker-gated abort rule with importance reweighting to set when to stop them. On Qwen3-1.7B trained on MATH, DUET outperforms full-budget GRPO and the other three budget-aware baseline methods. DUET's advantage further generalizes to other benchmarks across math and coding, and is on par with the best baseline on the scientific Q\&A domain, while also achieving a $1.62\times$ wall-clock speedup. More notably, using only 50\% of the token budget, DUET still outperforms all baseline methods at their full budget, achieving an even higher $2.51\times$ speedup over full-budget GRPO. We verify the high performance of DUET on other backbone LLMs, including Qwen3-4B and Llama-3.2-3B-Instruct. Notably, the gap between DUET and the strongest baseline \emph{widens} as the budget tightens, contrary to the usual pattern in which efficient methods trade off quality as compute decreases. More broadly, these results suggest that DUET budget-aware control strategies are valuable not only for accelerating training, but also for improving the quality of the learning signal.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes DUET, a dual-controlled token allocation layer over GRPO for RL with verifiable rewards. It uses a lightweight pre-rollout surrogate to decide rollout counts per prompt and a marker-gated early abort rule with importance reweighting to control rollout lengths. On Qwen3-1.7B trained on MATH, DUET is reported to outperform full-budget GRPO and three other budget-aware baselines; at 50% token budget it still beats all full-budget baselines while delivering a 2.51× wall-clock speedup. The advantage generalizes to other math and coding benchmarks and is on par with the best baseline on scientific Q&A, with the performance gap widening as the budget tightens.

Significance. If the empirical results hold under rigorous controls, the work demonstrates that jointly optimizing prompt selection and rollout length under a shared token budget can simultaneously improve reasoning quality and training efficiency in RLVR, contrary to the usual efficiency-quality trade-off. The counter-intuitive widening of the gap at tighter budgets, if reproducible, would be a notable finding for scaling reasoning models.

major comments (2)
  1. [Experimental results] The experimental results (presumably §4 and associated tables) report clear outperformance on MATH and generalization but provide no details on the number of independent runs, error bars, statistical significance, or exact hyperparameter matching for the surrogate and abort rule. This absence makes it impossible to assess whether the 2.51× speedup and quality gains are robust or sensitive to implementation choices.
  2. [Method] The importance reweighting that accompanies the marker-gated abort rule (method section) is intended to correct for the induced distribution shift, yet no derivation or explicit formula is given showing that it fully accounts for the conditional probability of reaching the abort marker. If higher-order terms in the advantage estimator are omitted, the GRPO policy gradient becomes biased, which would directly undermine the central claim that the learning signal is preserved or improved.
minor comments (2)
  1. [Method] Notation for the surrogate model and the abort marker is introduced without a clear summary table relating symbols to their roles, making the dual-control description harder to follow on first reading.
  2. [Abstract] The abstract states that DUET is 'on par with the best baseline on the scientific Q&A domain' but does not name the domain or the specific baseline; this should be clarified for precision.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and will revise the manuscript to incorporate the requested clarifications and derivations.

read point-by-point responses
  1. Referee: [Experimental results] The experimental results (presumably §4 and associated tables) report clear outperformance on MATH and generalization but provide no details on the number of independent runs, error bars, statistical significance, or exact hyperparameter matching for the surrogate and abort rule. This absence makes it impossible to assess whether the 2.51× speedup and quality gains are robust or sensitive to implementation choices.

    Authors: We agree that reproducibility details are essential for assessing robustness. Our experiments used 5 independent runs per method and budget level with distinct random seeds. Error bars denote standard deviation across runs, and we will add them to all tables and figures. Statistical significance was evaluated via paired t-tests; we will report the resulting p-values. Hyperparameters for the surrogate and abort rule were selected via a small held-out validation set and are listed in Appendix C, with identical seeds applied across all baselines for fair comparison. We will expand §4 and add a reproducibility subsection detailing these choices. revision: yes

  2. Referee: [Method] The importance reweighting that accompanies the marker-gated abort rule (method section) is intended to correct for the induced distribution shift, yet no derivation or explicit formula is given showing that it fully accounts for the conditional probability of reaching the abort marker. If higher-order terms in the advantage estimator are omitted, the GRPO policy gradient becomes biased, which would directly undermine the central claim that the learning signal is preserved or improved.

    Authors: We acknowledge the need for an explicit derivation. The importance weight for a trajectory truncated at step t is w_t = 1 / ∏_{k=1}^t (1 - p(abort | s_k)), where p(abort | s_k) is the learned probability of hitting the marker at state s_k. This ratio exactly compensates for the conditional probability of reaching the abort point, so the reweighted advantage estimator remains unbiased under the original rollout distribution. Because the weight is applied directly to the advantage before the policy gradient, no higher-order terms are omitted and the GRPO estimator stays unbiased. We will insert this derivation and the closed-form expression into the revised Method section. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical layering over GRPO with explicit reweighting

full rationale

The paper describes DUET as a computationally efficient layer over GRPO using a pre-rollout surrogate for prompt selection and a marker-gated abort rule with importance reweighting for rollout length. No equations or derivations are presented that reduce the claimed performance gains (e.g., outperformance at 50% budget) to quantities defined by the method's own fitted parameters or self-referential inputs. The central claims rest on experimental comparisons to baselines rather than any self-definitional or fitted-input reduction, rendering the approach self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no equations or implementation details, so free parameters, axioms, and invented entities cannot be audited beyond the general RLVR setting.

pith-pipeline@v0.9.0 · 5678 in / 1150 out tokens · 45773 ms · 2026-05-12T01:53:20.347269+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 18 internal anchors

  1. [1]

    Best arm identification in multi-armed bandits

    Jean-Yves Audibert, Sébastien Bubeck, and Rémi Munos. Best arm identification in multi-armed bandits. InConference on Learning Theory (COLT), 2010

  2. [2]

    Adaptive stratified sampling for Monte-Carlo integration of differentiable functions

    Alexandra Carpentier and Rémi Munos. Adaptive stratified sampling for Monte-Carlo integration of differentiable functions. InAdvances in Neural Information Processing Systems, 2012. arXiv:1210.5345

  3. [3]

    Chun-Hung Chen, Jianwu Lin, Enver Yücesan, and Stephen E. Chick. Simulation budget allocation for further enhancing the efficiency of ordinal optimization.Discrete Event Dynamic Systems, 10(3):251–270, 2000

  4. [4]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

  5. [5]

    Seed-grpo: Semantic entropy enhanced grpo for uncertainty-aware policy optimization.arXiv preprint arXiv:2505.12346, 2025

    Minghan Chen, Guikun Chen, Wenguan Wang, and Yi Yang. SEED-GRPO: Semantic entropy enhanced GRPO for uncertainty-aware policy optimization.arXiv preprint arXiv:2505.12346, 2025

  6. [6]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  7. [7]

    Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems.Journal of Machine Learning Research, 7: 1079–1105, 2006

    Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems.Journal of Machine Learning Research, 7: 1079–1105, 2006

  8. [8]

    How to Allocate, How to Learn? Dynamic Rollout Allocation and Advantage Modulation for Policy Optimization

    Yangyi Fang, Jiaye Lin, Xiaoliang Fu, Cong Qin, Haolin Shi, Chaowen Hu, Lu Pan, Ke Zeng, and Xunliang Cai. How to allocate, how to learn? Dynamic rollout allocation and advantage modulation for policy optimization.arXiv preprint arXiv:2602.19208, 2026

  9. [9]

    Glynn and Ward Whitt

    Peter W. Glynn and Ward Whitt. The asymptotic efficiency of simulation estimators.Operations Research, 40(3):505–520, 1992

  10. [10]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  11. [11]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, et al. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning.Nature, 645:633–638, 2025. doi: 10.1038/s41586-025-09422-z. arXiv:2501.12948

  12. [12]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InAdvances in Neural Information Processing Systems Track on Datasets and Benchmarks, 2021. arXiv:2103.03874

  13. [13]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 2022

  14. [14]

    Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

    Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open- Reasoner-Zero: An open source approach to scaling up reinforcement learning on the base model.arXiv preprint arXiv:2503.24290, 2025

  15. [15]

    Doubly robust off-policy value evaluation for reinforcement learning

    Nan Jiang and Lihong Li. Doubly robust off-policy value evaluation for reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning, volume 48, pages 652–661, 2016. arXiv:1511.03722

  16. [16]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

  17. [17]

    Efficient Memory Management for Large Language Model Serving with PagedAttention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles, 2023. arXiv:2309.06180

  18. [18]

    Fast inference from transformers via speculative decoding

    Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. InProceedings of the 40th International Conference on Machine Learning, volume 202, pages 19274–19286, 2023. 10

  19. [19]

    In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3102–3116

    Xuefeng Li, Haoyang Zou, and Pengfei Liu. LIMR: Less is more for RL scaling.arXiv preprint arXiv:2502.11886, 2025

  20. [20]

    Breaking the curse of horizon: Infinite-horizon off-policy estimation

    Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. Breaking the curse of horizon: Infinite-horizon off-policy estimation. InAdvances in Neural Information Processing Systems, 2018. arXiv:1810.12429

  21. [21]

    s1: Simple test-time scaling

    Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling. arXiv preprint arXiv:2501.19393, 2025

  22. [22]

    Jerzy Neyman. On the two different aspects of the representative method: the method of stratified sampling and the method of purposive selection.Journal of the Royal Statistical Society, 97(4):558–625, 1934

  23. [23]

    Adaptive rollout allocation for online reinforcement learning with verifiable rewards, 2026

    Hieu Trung Nguyen, Bao Nguyen, Wenao Ma, Yuzhi Zhao, Ruifeng She, and Viet Anh Nguyen. Adaptive rollout allocation for online reinforcement learning with verifiable rewards. InInternational Conference on Learning Representations, 2026. arXiv:2602.01601

  24. [24]

    Owen.Monte Carlo Theory, Methods and Examples

    Art B. Owen.Monte Carlo Theory, Methods and Examples. 2013. Online manuscript, Stanford University

  25. [25]

    Carbon Emissions and Large Neural Network Training

    David Patterson, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean. Carbon emissions and large neural network training.arXiv preprint arXiv:2104.10350, 2021

  26. [26]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level Google-proof Q&A benchmark.arXiv preprint arXiv:2311.12022, 2023

  27. [27]

    Smith, and Oren Etzioni

    Roy Schwartz, Jesse Dodge, Noah A. Smith, and Oren Etzioni. Green AI.Communications of the ACM, 63(12):54–63, 2020

  28. [28]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  29. [29]

    Hybridflow: A flexible and efficient rlhf framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. HybridFlow: A flexible and efficient RLHF framework. InProceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297. ACM, 2025. doi: 10.1145/3689031.3696075. arXiv:2409.19256

  30. [30]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning. InInternational Conference on Learning Representations, 2025. arXiv:2408.03314

  31. [31]

    Energy and policy considerations for deep learning in NLP

    Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deep learning in NLP. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), pages 3645–3650, 2019

  32. [32]

    The self-normalized estimator for counterfactual learning

    Adith Swaminathan and Thorsten Joachims. The self-normalized estimator for counterfactual learning. In Advances in Neural Information Processing Systems, 2015

  33. [33]

    Thomas and Emma Brunskill

    Philip S. Thomas and Emma Brunskill. Data-efficient off-policy policy evaluation for reinforcement learning. InProceedings of the 33rd International Conference on Machine Learning, volume 48, pages 2139–2148, 2016. arXiv:1604.00923

  34. [34]

    Reinforcement learning for reasoning in large language models with one training example

    Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Liyuan Liu, Baolin Peng, Hao Cheng, Xuehai He, Kuan Wang, Jianfeng Gao, Weizhu Chen, Shuohang Wang, Simon Shaolei Du, and Yelong Shen. Reinforcement learning for reasoning in large language models with one training example. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

  35. [35]

    Light-r1: Curriculum sft, dpo and rl for long cot from scratch and beyond

    Liang Wen, Yunke Cai, Fenrui Xiao, Xin He, Qi An, Zhenyu Duan, Yimin Du, Junchen Liu, Lifu Tang, Xiaowei Lv, Haosheng Zou, Yongchao Deng, Shousheng Jia, and Xiangzheng Zhang. Light-R1: Curriculum SFT, DPO and RL for long CoT from scratch and beyond.arXiv preprint arXiv:2503.10460, 2025

  36. [36]

    Just enough thinking: Efficient reasoning with adaptive length penalties reinforcement learning, 2025

    Violet Xiang, Chase Blagden, Rafael Rafailov, Nathan Lile, Sang Truong, Chelsea Finn, and Nick Haber. Just enough thinking: Efficient reasoning with adaptive length penalties reinforcement learning.arXiv preprint arXiv:2506.05256, 2025. 11

  37. [37]

    Prune as you generate: Online rollout pruning for faster and better RLVR.arXiv preprint arXiv:2603.24840, 2026

    Haobo Xu, Sirui Chen, Ruizhong Qiu, Yuchen Yan, Chen Luo, Monica Cheng, Jingrui He, and Hang- hang Tong. Prune as you generate: Online rollout pruning for faster and better RLVR.arXiv preprint arXiv:2603.24840, 2026

  38. [38]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  39. [39]

    LIMO: Less is more for reasoning

    Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, and Pengfei Liu. LIMO: Less is more for reasoning. InConference on Language Modeling (COLM), 2025

  40. [40]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, et al. DAPO: An open-source LLM reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

  41. [41]

    Efficient rl training for reasoning models via length-aware optimization

    Danlong Yuan, Tian Xie, Shaohan Huang, Zhuocheng Gong, Huishuai Zhang, Chong Luo, Furu Wei, and Dongyan Zhao. Shorten after you’re right: Lazy length penalties for reasoning RL.arXiv preprint arXiv:2505.12284, 2025

  42. [42]

    SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

    Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, and Junxian He. SimpleRL-Zoo: Investigating and taming zero reinforcement learning for open base models in the wild. InConference on Language Modeling (COLM), 2025. arXiv:2503.18892

  43. [43]

    Bartoldson, Bhavya Kailkhura, Fan Lai, Jiawei Zhao, and Beidi Chen

    Haizhong Zheng, Yang Zhou, Brian R. Bartoldson, Bhavya Kailkhura, Fan Lai, Jiawei Zhao, and Beidi Chen. Act only when it pays: Efficient reinforcement learning for LLM reasoning via selective rollouts. arXiv preprint arXiv:2506.02177, 2025. 12 A Appendix A.1 Related Work Budget-Constrained Policy-Gradient Estimation.Stratified importance sampling with per...

  44. [44]

    ** Add the first two numbers :** \[ 139 + 27 = 166 \]

  45. [45]

    ** Add the third number to the result :** \[ 166 + 23 = 189 \]

  46. [46]

    If \ mathbf { A }\ mathbf { B } = ((20/3 , 4/3) , ( -8/3 , 8/3) ) , find \ mathbf { B }\ mathbf { A }

    ** Add the fourth number to the result :** \[ 189 + 11 = 200 \] So , the final answer is : \[ \ boxed {200} \] 25 Level 5 — Precalculus (math-train-precalculus-37, gold answerBA=AB) Prompt Let \ mathbf { A } and \ mathbf { B } be matrices such that \ mathbf { A } + \ mathbf { B } = \ mathbf { A }\ mathbf { B }. If \ mathbf { A }\ mathbf { B } = ((20/3 , 4...

  47. [47]

    Start with the given e qu ati on

  48. [48]

    Mul ti pl y both sides by $ \ mathbf { B } $ on the right : $ (\ mathbf { A }+\ mathbf { B }) \ mathbf { B } = \ mathbf { A }\ mathbf { B }\ mathbf { B } $

  49. [49]

    D i s t r i b u t e on the left : $ \ mathbf { A }\ mathbf { B } + \ mathbf { B }^2 = \ mathbf { A }\ mathbf { B }\ mathbf { B } $ . ... U n p r e s e n t e d length : 2880 tokens Step 230 •n q=15•K 1=366, K2=779• resp_len=480 (middle) Start with the given equ at io n : \[ \ mathbf { A } + \ mathbf { B } = \ mathbf { A }\ mathbf { B }. \] Rearranging , \[...