{"total":18,"items":[{"citing_arxiv_id":"2606.01168","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Thinking Economically: A Hierarchical Framework for Adaptive-Complexity Reasoning in LLMs","primary_cat":"cs.CL","submitted_at":"2026-05-31T11:20:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"HAB applies coarse-to-fine budgeting to LLM reasoning, predicting per-problem depth and learning intra-step token budgets via PPL comparisons and adaptive Pareto optimization, yielding higher accuracy and lower token use than standard CoT on GSM8K and MATH500.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22567","ref_index":66,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LANG: Reinforcement Learning for Multilingual Reasoning with Language-Adaptive Hint Guidance","primary_cat":"cs.CL","submitted_at":"2026-05-21T14:47:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LANG combines language-adaptive hint guidance, progressive decay, and difficulty-tailored learning horizons in RL to boost non-English reasoning performance while preserving language consistency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14054","ref_index":59,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Bad Seeing or Bad Thinking? Rewarding Perception for Multimodal Reasoning","primary_cat":"cs.AI","submitted_at":"2026-05-13T19:23:53+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08905","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs","primary_cat":"cs.AI","submitted_at":"2026-05-09T11:57:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OPT-BENCH trains LLMs on NP-hard optimization via quality-aware RLVR, achieving 93.1% success rate and 46.6% quality ratio on Qwen2.5-7B while outperforming GPT-4o and transferring gains to other domains.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"The difficulty of generated problem instances is categorized according to the number of integers available (|numbers|), the typical size of the opti- mal solution (|I|), and the range of integer values: •Easy: - Total numbers ∈[5,10] , solution size ∈[4,8], values in[1,5]. - Small input with low values, ensuring frequent feasible solutions. •Medium: - Total numbers ∈[8,12] , solution size ∈[4,8], values in[1,10]. - Moderate instance size and range, requir- ing more careful subset selection. •Hard: - Total numbers ∈[12,15] , solution size ∈[8,12], values in[1,15]. - Larger solution sizes and wider value ranges increase combinatorial difficulty. •Benchmark: - Total numbers ∈[15,20] , solution size ∈[10,15], values in[1,15]."},{"citing_arxiv_id":"2605.08441","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards","primary_cat":"cs.LG","submitted_at":"2026-05-08T20:03:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DUET improves RLVR by allocating tokens across both prompt selection and rollout length, outperforming full-budget baselines even when using only half the tokens.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"(b) Accuracy at full budget across baselines; DUET at half budget exceeds full-budget GRPO. (c) Wall-clock speedup at full budget, normalized to GRPO; DUET at half budget runs 2.51× faster than full-budget GRPO. Preprint. arXiv:2605.08441v1 [cs.LG] 8 May 2026 1 Introduction Reasoning-centric large language models (LLMs) such as DeepSeek-R1 [11], Light-R1 [35], and Qwen3 [38] have advanced state-of-the-art performance on mathematical and code-reasoning bench- marks [12, 6, 4], and the post-training engine behind these results is reinforcement learning with verifiable rewards (RLVR) [28]. The recent rise of RLVR has been closely associated with GRPO [28], whose recipe is straightforward: at each training step, draw several candidate solutions per prompt,"},{"citing_arxiv_id":"2605.06165","ref_index":159,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost","primary_cat":"cs.AI","submitted_at":"2026-05-07T12:51:49+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17614","ref_index":73,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Characterizing Model-Native Skills","primary_cat":"cs.AI","submitted_at":"2026-04-19T20:58:25+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming human-characterized alternatives.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.01970","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Small Generalizable Prompt Predictive Models Can Steer Efficient RL Post-Training of Large Reasoning Models","primary_cat":"cs.AI","submitted_at":"2026-02-02T11:24:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GPS trains a small model on optimization history to predict prompt difficulty and select intermediate-difficulty diverse batches, yielding better training efficiency, final performance, and test-time allocation than baselines on reasoning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.03847","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DVPO: Distributional Value Modeling-based Policy Optimization for LLM Post-Training","primary_cat":"cs.LG","submitted_at":"2025-12-03T14:48:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DVPO learns token-level value distributions and uses asymmetric risk regularization to contract lower tails while expanding upper tails, outperforming PPO and GRPO under noisy supervision in dialogue, math, and QA tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.18814","ref_index":21,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Model Can Help Itself: Reward-Free Self-Training for LLM Reasoning","primary_cat":"cs.LG","submitted_at":"2025-10-21T17:15:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SePT alternates self-generation of responses at controlled temperatures with training on the latest model outputs, yielding gains over a strong no-training baseline on six math reasoning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.04265","ref_index":99,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation","primary_cat":"cs.AI","submitted_at":"2025-10-05T16:14:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A Dirichlet-prior Bayesian estimator for model success probability replaces Pass@k, delivering faster-converging and more stable rankings with credible intervals on math benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.03988","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The Signal is in the Steps: Local Scoring for Reasoning Data Selection","primary_cat":"cs.LG","submitted_at":"2025-10-05T01:15:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LALP scores local reasoning steps rather than full trajectories to improve selection of training data from diverse teacher models for distilling long-form reasoning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.25758","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Thinking Sparks!: Emergent Attention Heads in Reasoning Models During Post Training","primary_cat":"cs.AI","submitted_at":"2025-09-30T04:23:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Post-training on reasoning tasks sparks the emergence of specialized attention heads that enable structured computation, with SFT adding stable heads while GRPO uses dynamic activation and pruning tied to reward signals, and controllable think models relying on compensatory heads instead of specific","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.08636","ref_index":48,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"InternBootcamp Technical Report: Boosting LLM Reasoning with Verifiable Task Scaling","primary_cat":"cs.CL","submitted_at":"2025-08-12T05:00:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InternBootcamp supplies 1000+ verifiable, auto-generated task environments across domains that enable task scaling to improve LLM reasoning, producing a 32B model with state-of-the-art results on the new Bootcamp-EVAL benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.22312","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Skywork Open Reasoner 1 Technical Report","primary_cat":"cs.LG","submitted_at":"2025-05-28T12:56:04+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Skywork-OR1 uses RL on distilled CoT models to lift math and coding benchmark accuracy by 13-15 points while open-sourcing everything.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.20571","ref_index":43,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Reinforcement Learning for Reasoning in Large Language Models with One Training Example","primary_cat":"cs.LG","submitted_at":"2025-04-29T09:24:30+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"One training example via RLVR boosts LLM math reasoning from 17.6% to 35.7% average across six benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.14945","ref_index":47,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Learning to Reason under Off-Policy Guidance","primary_cat":"cs.LG","submitted_at":"2025-04-21T08:09:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LUFFY mixes off-policy reasoning traces into RLVR training via Mixed-Policy GRPO and regularized importance sampling, delivering over 6-point gains on math benchmarks and enabling training of weak models where on-policy RLVR fails.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.16419","ref_index":191,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models","primary_cat":"cs.CL","submitted_at":"2025-03-20T17:59:38+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A survey organizing techniques to achieve efficient reasoning in LLMs by shortening chain-of-thought outputs.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Token-Budget [58]; Chain of Draft [204]; Token Complexity [83]; Concise Chain-of-Thought(CCoT) [147]; MARP [13]; ThoughtMani [118]; NoThinking [129]; Brevity [142];PREMISE [221]; ConciseHint [169]; Routing byQuestion Attributese.g. Claude 3.7 Sonnet [4]; SoT [6]; Self-REF [25]; Confident [24]; RouteLLM [136];THOUGHTTERMINATOR [143]; ThinkSwitcher [98]; SwitchCoT [234]; SynapseRoute [236]; Efficient Data andModels Less Training Datae.g. LIMO [216]; s1 [132]; S2R [128]; Light-R1 [191]; Pruning &Quantization &Distillation e.g. Struggle [93]; Strong Verifiers [162]; TinyR1-32B-Preview [166]; Mixed Distillation [23];Counterfactual Distillation [48]; Feedback-Driven Distillation [248]; SKIntern [101]; AdaptiveThinking [15]; PRR [245]; CompressionReasoning [233]; TwT [203]; Benchmark &Insights Evaluation &Benchmarks e.g. 1B vs. 405B [113]; Sys2Bench [140]; Danger [29]; Inference-time Computation [109];Impact [76]; S1-Bench [237]; CompressionReasoning [233]; QuantRM [112]; SmallRM [250];"}],"limit":50,"offset":0}