{"total":11,"items":[{"citing_arxiv_id":"2605.26606","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Spend Your Rollouts Where It Counts: Rollout Allocation for Group-Based RL Post-Training","primary_cat":"cs.LG","submitted_at":"2026-05-26T06:41:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Pilot-Commit estimates per-prompt informativeness via a pilot stage and skips low-variance prompts, matching baseline accuracy with up to 4.0x fewer cumulative rollouts than DAPO on math reasoning tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09806","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models","primary_cat":"cs.LG","submitted_at":"2026-05-10T23:05:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LEAD uses online adaptive mechanisms including Potential-Scaled Instability and symmetric efficiency rewards based on correct rollouts to achieve higher accuracy-efficiency scores with substantially shorter reasoning outputs than base models on math benchmarks.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"L1 [ 14] trains reasoning models to follow user-specified length constraints, O1- Pruner [15] uses length-harmonizing fine-tuning to reduce redundant long-thought reasoning, and DRPO [18] decouples the learning signals for correct and incorrect rollouts to avoid penalizing valid long reasoning. LASER [19] formulates efficient reasoning through adaptive length-based reward shaping, while GFPO [ 21] encourages concise reasoning by filtering sampled rollouts according to length and reward-per-token efficiency. Other methods estimate or impose problem-dependent budgets: ShorterBetter [ 16] uses the shortest correct rollout as a Sample Optimal Length, Smart- Thinker [17] calibrates reasoning length through a distributional estimate, SelfBudgeter [20] predicts"},{"citing_arxiv_id":"2605.08873","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CoDistill-GRPO: A Co-Distillation Recipe for Efficient Group Relative Policy Optimization","primary_cat":"cs.LG","submitted_at":"2026-05-09T10:51:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CoDistill-GRPO lets small and large models mutually improve via co-distillation in GRPO, raising small-model math accuracy by over 11 points while cutting large-model training time by about 18%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06523","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"On the Implicit Reward Overfitting and the Low-rank Dynamics in RLVR","primary_cat":"cs.LG","submitted_at":"2026-05-07T16:30:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"RLVR exhibits implicit reward overfitting to training data and optimizes heavy-tailed singular spectra with rank-1 focus on reasoning capability.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"has been proposed as an effective strategy to enhance reasoning in domains such as mathematics and programming. OpenAI's ol was the first to demonstrate that RL can incentivize large-scale reasoning, inspiring subsequent models such as DeepSeek-R1[11], and Qwen3 [37] . Building on these advances, later approaches such as Dr.GRPO [19], CISPO [4], GFPO [33], GMPO [42], etc. have further broadened the landscape of RL-based reasoning. Interpreting Reinforcement learning.A recent study [ 7] identified the phenomenon ofentropy collapsein reinforcement learning, where rapid early convergence causes the model to become overly confident, prematurely degrading its exploratory capacity. A related study [31] observed in"},{"citing_arxiv_id":"2605.06165","ref_index":284,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost","primary_cat":"cs.AI","submitted_at":"2026-05-07T12:51:49+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.09852","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MEMENTO: Teaching LLMs to Manage Their Own Context","primary_cat":"cs.AI","submitted_at":"2026-04-10T19:30:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MEMENTO trains LLMs to segment reasoning into blocks, generate mementos as dense summaries, and reason forward using only mementos and KV states, cutting peak KV cache by ~2.5x while preserving benchmark accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02913","ref_index":102,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-04-08T00:53:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.02795","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks","primary_cat":"cs.CL","submitted_at":"2026-04-03T07:02:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RTT bridges response-level rubrics to token-level rewards via a relevance discriminator and intra-sample group normalization, yielding higher instruction and rubric accuracy than baselines.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"[12] Jian Hu, Xibin Wu, Zilin Zhu, Xianyu, Weixun Wang, Dehao Zhang, and Yu Cao. 2024. Openrlhf: An easy-to-use, scalable and high-performance rlhf framework.arXiv preprint arXiv:2405.11143. [13] Zenan Huang, Yihong Zhuang, Guoshan Lu, Zeyu Qin, Haokai Xu, Tianyu Zhao, Ru Peng, Jiaqi Hu, Zhanming Shen, Xiaomeng Hu, et al. 2025. Reinforcement learning with rubric anchors.arXiv preprint arXiv:2508.12790. [14] Xue Jiang, Yihong Dong, Mengyang Liu, Hongyi Deng, Tian Wang, Yongding Tao, Rongyu Cao, Binhua Li, Zhi Jin, Wenpin Jiao, et al. 2025. Coderl+: Improving code generation via reinforcement with execution semantics alignment.arXiv preprint arXiv:2510.18471. [15] Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester"},{"citing_arxiv_id":"2601.05242","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization","primary_cat":"cs.CL","submitted_at":"2026-01-08T18:59:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GDPO decouples per-reward normalization in multi-reward RL to avoid advantage collapse and improve convergence over GRPO on tool-calling, math, and coding tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.26522","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Entropy After </Think> for reasoning model early exiting","primary_cat":"cs.LG","submitted_at":"2025-09-30T16:59:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Entropy After </Think> (EAT) enables early exiting in reasoning LLMs by tracking entropy stabilization after a </think> token, cutting token use 12-22% on MATH500 and AIME2025 with no accuracy loss.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.16419","ref_index":157,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models","primary_cat":"cs.CL","submitted_at":"2025-03-20T17:59:38+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A survey organizing techniques to achieve efficient reasoning in LLMs by shortening chain-of-thought outputs.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"5 [171]; O1-Pruner [127]; L1 [2]; Training [5]; Demystifying [217]; DAST [154];MRT [146]; Self-adaptive [212]; HA WKEYE [152]; ThinkPrune [66]; LongShort [134];ConciseRL [40]; Bingo [110]; Concise Reasoning [47]; Elastic Reasoning [207]; S-GRPO [33];TLDR [239]; SelfBudgeter [95]; Short-RL [225]; BRPO [144]; LASER [116]; ACPO [21];LIMOPro [198]; L-GRPO [160]; GRPO-λ[32]; AutoThink [175]; AdaptThink [230];DeGRPO [45]; HGPO [74]; DTO [3]; REO-RL [52]; ALP [196]; PLP [107]; LC-R1 [22];AdapThink [179]; AALC [90]; DuP-PO [36]; SCPO [63]; FCS [65]; CurriculumGRPO [57];GFPO [157]; SABER [242]; VSRM [228]; DR.SAF [12]; ASRR [238]; AdaCoT [122]; SFT withVariable-LengthCoT e.g. Distilling 2-1 [219]; C3oT [78]; TokenSkip [194]; CoT-Valve [130]; Self-Training [133]; Learnto Skip [115]; Token-Budget [58]; Verbosity [72]; Stepwise [31]; Z1 [223]; Prune-on-Logic [243];LS-Mixture SFT [218]; DRP [75]; AutoL2S [125]; Assembly of Experts [79]; Ada-R1 [126];ConCISE [145]; VeriThinker [19]; R1-Compress [187]; CTS [226]; A∗-Thought [205]; TLDR [96];OThink-R1 [235]; PNS [220]; ReCUT [77]; StepEntropy [94]; ASAP [229];"}],"limit":50,"offset":0}