{"total":42,"items":[{"citing_arxiv_id":"2606.16517","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"How Post-Training Shapes Biological Reasoning Models","primary_cat":"cs.LG","submitted_at":"2026-06-15T10:19:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Post-training stages reshape generalization in biological reasoning models distinctly: CPT aligns with biological language, SFT boosts ID performance but causes OOD to peak early and decline, while RL on strong SFT checkpoints can recover OOD generalization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01249","ref_index":66,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Trust Region On-Policy Distillation","primary_cat":"cs.LG","submitted_at":"2026-05-31T14:04:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TrOPD stabilizes on-policy distillation for LLMs with trust-region learning, outlier estimation, and off-policy guidance, outperforming prior OPD methods on reasoning and code benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01075","ref_index":36,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"On the Generalization Gap in Self-Evolving Language Model Reasoning","primary_cat":"cs.CL","submitted_at":"2026-05-31T07:43:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Closed-loop self-evolution on LLMs improves reasoning on Knights and Knaves tasks but plateaus short of oracle-supervised levels, with multi-turn revision nearly matching it for large models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28388","ref_index":39,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs","primary_cat":"cs.AI","submitted_at":"2026-05-27T12:25:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Sample difficulty in RLVR shows non-monotonic effects on LLM reasoning, with easy/medium problems strengthening computation and reasoning features while hard problems often yield weak or harmful signals.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.25198","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Hide to Guide: Learning via Semantic Masking","primary_cat":"cs.LG","submitted_at":"2026-05-24T17:59:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SMEPO applies fine-grained semantic masking to expert guidance in RLVR, turning hard problems into fill-in-the-blank tasks while preserving structure, yielding up to 3.2 point accuracy gains and 4.2x faster training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22642","ref_index":35,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning","primary_cat":"cs.AI","submitted_at":"2026-05-21T15:47:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Spreadsheet-RL applies RL fine-tuning and a custom Gym environment to raise LLM agent Pass@1 scores on spreadsheet benchmarks from roughly 8-12% to 17-23%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20256","ref_index":25,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"FBOS-RL: Feedback-Driven Bi-Objective Synergistic Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-05-18T12:48:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"FBOS-RL uses environment feedback for better exploration plus bi-objective training to speed up and raise the performance ceiling of RL compared to GRPO.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16874","ref_index":23,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Reasoning Can Be Restored by Correcting a Few Decision Tokens","primary_cat":"cs.AI","submitted_at":"2026-05-16T08:33:31+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Reasoning gaps between base LLMs and LRMs concentrate on ~8% of early planning tokens; intervening with the reasoning model only at high-disagreement positions recovers performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11538","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Taming Extreme Tokens: Covariance-Aware GRPO with Gaussian-Kernel Advantage Reweighting","primary_cat":"cs.CL","submitted_at":"2026-05-12T05:05:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Covariance-weighted GRPO with Gaussian-kernel reweighting tames extreme tokens to stabilize training and boost reasoning performance over standard GRPO.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08905","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs","primary_cat":"cs.AI","submitted_at":"2026-05-09T11:57:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OPT-BENCH trains LLMs on NP-hard optimization via quality-aware RLVR, achieving 93.1% success rate and 46.6% quality ratio on Qwen2.5-7B while outperforming GPT-4o and transferring gains to other domains.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"number of subsets |S|, and the relative subset size (controlled by the parameter subset_size_factor): •Easy: -|U| ∈[10,20] , |S| ∈[5,10] , subset size factor= 0.4 - Small universe and relatively large sub- sets, making coverage straightforward. •Medium: -|U| ∈[20,25] , |S| ∈[10,15] , subset size factor= 0.4 - Moderate universe size and subset count, requiring careful selection. •Hard: -|U| ∈[25,30] , |S| ∈[15,25] , subset size factor= 0.4 - Larger universe with more subsets, in- creasing combinatorial complexity. •Benchmark: -|U| ∈[30,40] , |S| ∈[20,30] , subset size factor= 0.4 - The most challenging setting, with the largest universes and dense subset collec- tions. A.5.2 Subset Sum The Subset Sum Problem asks whether a subset of integers sums up to a given target value T ."},{"citing_arxiv_id":"2605.08378","ref_index":153,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Reinforcement Learning for Scalable and Trustworthy Intelligent Systems","primary_cat":"cs.LG","submitted_at":"2026-05-08T18:36:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Reinforcement learning is advanced for communication-efficient federated optimization and for preference-aligned, contextually safe policies in large language models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"RL offers a natural fit. It allows the model to generate full task completions and be rewarded directly based on the presence or absence of specific information types in its output. Moreover, RL is often more data-efficient than supervised fine-tuning (SFT); recent work has shown that RL can yield improvements with as little as a single training example [ 153 ]. Nevertheless, comparing SFT and RL-based approaches on CI frameworks remains an important direction. Unstructured and Retrieval-Augmented Contexts. We constructed a relatively simple training dataset with semi-structured input. Yet our method yields consistent gains on more natural, free-form chats with conversation history (PrivacyLens) and shows the same trend on an external replication with ConfAIde (single-"},{"citing_arxiv_id":"2605.08283","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control","primary_cat":"cs.LG","submitted_at":"2026-05-08T07:38:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HTPO introduces hierarchical token-level objective control in RLVR to balance exploration and exploitation by grouping tokens according to difficulty, correctness, and entropy, yielding up to 8.6% gains on AIME benchmarks over DAPO.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Stabilizing off-policy reinforcement learning for llms via balanced policy optimization with adaptive clipping.arXiv preprint arXiv: 2510.18927, 2025. [38] Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable realworld web interaction with grounded language agents.In Proceedings of the Advances in Neural Information Processing Systems, 2022. [39] YipingWang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Lucas Liu, Baolin Peng, Hao Cheng, Xuehai He, Kuan Wang, Jianfeng Gao, Weizhu Chen, Shuohang Wang, Simon Shaolei Du, and Yelong Shen. Reinforcement learning for reasoning in large language models with one training example.arXiv preprint arXiv: 2504.20571, 2025. [40] Robin Young. Why is rlhf alignment shallow?"},{"citing_arxiv_id":"2605.07244","ref_index":93,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models","primary_cat":"cs.LG","submitted_at":"2026-05-08T05:01:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing identified as favorable on the stability-support trade-off.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06650","ref_index":58,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients","primary_cat":"cs.CL","submitted_at":"2026-05-07T17:55:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on AIME 2025.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06755","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Gradient Extrapolation-Based Policy Optimization","primary_cat":"cs.LG","submitted_at":"2026-05-07T16:20:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GXPO approximates longer local lookahead in GRPO training via gradient extrapolation from two optimizer steps using three backward passes total, improving pass@1 accuracy by 1.65-5.00 points over GRPO and delivering up to 4x step speedup.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"We first consider the clean case where the coordinate-wise geometric model is exact. Corollary 2(Diagonal-quadratic GD-surrogate sanity check).Consider the global diagonal quadratic loss L(θ) = 1 2 θ⊤H0θ, H 0 = diag(h1, . . . , hd), h i >0, ηh i ≤1. Assume all nonzero-gradient coordinates are active, finite-precision stabilizers are omitted, and α= 1. Let µ:= min i hi >0, ρ:= (1−ηµ) 2 ∈[0,1). Then one clean GXPO outer step with three backward passes reaches the same point as K+ 1 plain-GD steps: θGXPO new =θ GD K+1, and consequently, afterB∈3Nbackward passes, L \u0010 θGXPO B/3 \u0011 ≤ρ (K+1)B/3 L(θ0), hence, if0< ρ <1, BGXPO =O \u0012 3 K+ 1 log 1 ε \u0013 . This is only an algebraic sanity check: in the easiest case, the extrapolated point lands exactly where multiple GD steps would land. Appendix A.6 proves Corollary 2. Real losses are not diagonal quadratics, so the next result bounds the local error of the GD surrogate. Theorem 3(Local displacement-error bound for the GD surrogate).Suppose K≥2 , L ∈C 3, supξ ∥∇3L(ξ)∥ ≤M 3, and the true GD trajectory satisfies sup 0≤n<K ∥g(θtrue n )∥ ≤G. Letρ ⋆ ≥1andρ ⋆ ≥ ∥I−ηH 0∥. Split coordinates into A={i:|g 0,i|> δ},S=A c. Consider the clean active-set surrogate that uses empirical ratios on A and the observed two-probe displacement on S. If the active empirical ratios and diagonal surrogate rates are bounded by R, then ∥θemp K −θ true K ∥ ≤E off +E ratio +E nonquad, where Eoff comes from off-diagonal Hessian coupling, Eratio from empirical-ratio error and inactive- coordinate fallback, and Enonquad from the Taylor remainder. The explicit constants are given in Theorem 8, Lemma 9, and Corollary 10 in Appendix A.5; together they prove Eoff =O K2η2∥Hoff 0 ∥∥g0∥ρK−2 ⋆ \u0001 , Eratio =O(η 2/δ) +O(η∥g 0,S ∥1) +O η2∥(H0g0)S ∥1 \u0001 , Enonquad =O K3η3M3G2ρK−1 ⋆ \u0001 . This bound gives simple checks for whether GXPO is operating in its intended local regime. The extrapolated displacement should have small error, the error may grow with K but should r"},{"citing_arxiv_id":"2605.06241","ref_index":22,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning","primary_cat":"cs.CL","submitted_at":"2026-05-07T13:25:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RL for LLM reasoning acts as sparse policy selection at high-entropy tokens already present in the base model, enabling ReasonMaxxer—an efficient contrastive method that recovers most RL gains at three orders of magnitude lower cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01823","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Selector-Guided Autonomous Curriculum for One-Shot Reinforcement Learning from Verifiable Rewards","primary_cat":"cs.LG","submitted_at":"2026-05-03T11:10:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SGAC replaces reward-variance heuristics with a multi-feature learnable selector emphasizing output entropy, yielding 68% accuracy on Hendrycks MATH with Qwen2.5-Math-1.5B versus 64-66% baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.28020","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Cost-Aware Learning","primary_cat":"cs.LG","submitted_at":"2026-04-30T15:39:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Cost-Aware SGD samples by gradient-norm-to-cost ratio and is instantiated as Cost-Aware GRPO for length-dependent policy gradients, reducing tokens used in LLM RL while matching baseline accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.28005","ref_index":24,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Kernelized Advantage Estimation: From Nonparametric Statistics to LLM Reasoning","primary_cat":"cs.LG","submitted_at":"2026-04-30T15:27:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Kernel smoothing enables accurate low-variance value and gradient estimates for policy optimization in LLM reasoning under tight sampling constraints per prompt.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19937","ref_index":64,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Infection-Reasoner: A Compact Vision-Language Model for Wound Infection Classification with Evidence-Grounded Clinical Reasoning","primary_cat":"cs.CV","submitted_at":"2026-04-21T19:28:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Infection-Reasoner, a 4B VLM, reaches 86.8% accuracy on wound infection classification while producing rationales rated mostly correct by experts, via GPT-5.1 distillation followed by reinforcement learning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17928","ref_index":59,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment","primary_cat":"cs.LG","submitted_at":"2026-04-20T08:09:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15676","ref_index":85,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"EvoRAG: Making Knowledge Graph-based RAG Automatically Evolve through Feedback-driven Backpropagation","primary_cat":"cs.DB","submitted_at":"2026-04-17T03:54:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EvoRAG adds a feedback-driven backpropagation step that attributes response quality to individual knowledge-graph triplets and updates the graph to raise reasoning accuracy by 7.34 percent over prior KG-RAG methods.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"35, and PyTorch1.13.0backend. Datasets.We evaluate the effectiveness of EvoRAG on three real- world datasets, as summarized in Table 1. RGB [9] is constructed from news articles to evaluate reasoning capabilities of RAG tasks. MultiHop (MTH) [72] focuses on multi-hop queries that require integrating information from multiple news documents. HotpotQA (HPQ) [85] is a multi-hop QA dataset derived from Wikipedia, where answering each query requires combining evidence from two para- graphs. We use all300English-language reasoning queries from the RGB dataset and all816reasoning queries from the MTH dataset. For HPQ, we follow prior work [27] and randomly select600queries for evaluation due to the high cost of processing the full dataset."},{"citing_arxiv_id":"2604.08209","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering","primary_cat":"cs.CV","submitted_at":"2026-04-09T13:09:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"OmniJigsaw is a self-supervised proxy task that reconstructs shuffled audio-visual clips via joint integration, sample-level selection, and clip-level masking strategies, yielding gains on 15 video, audio, and reasoning benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"A primary bottleneck impeding the extension of these RL-driven successes to omni-modal reasoning is the significant difficulty of acquiring massive, high- quality annotated data and providing effective supervisory signals. In text-only domains such as mathematics or coding, it is relatively straightforward to gener- ate large-scale problem instances and provide verifiable, deterministic feedback for RL optimization [39]. Conversely, for omni-modal understanding [4,9,33,37], collecting an equivalent volume of omni-modal data that intrinsically necessi- tates complex collaborative cross-modal reasoning is prohibitively expensive and labor-intensive [11,38,52]. Driven by these challenges, we explore a fundamen- tal question in this work: Can we identify a suitable proxy task that effectively"},{"citing_arxiv_id":"2605.02913","ref_index":124,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-04-08T00:53:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.02341","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LLM Reasoning with Process Rewards for Outcome-Guided Steps","primary_cat":"cs.LG","submitted_at":"2026-02-08T06:38:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PROGRS uses outcome-conditioned centering on PRM scores to safely integrate process rewards into GRPO for improved Pass@1 on math benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.03452","ref_index":22,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Beyond Variance: Prompt-Efficient RLVR via Rare-Event Amplification and Bidirectional Pairing","primary_cat":"cs.LG","submitted_at":"2026-02-03T12:17:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Positive-negative prompt pairing with weighted GRPO improves RLVR sample efficiency, raising AIME 2025 Pass@8 from 16.8 to 22.2 on Qwen2.5-Math-7B while matching large-scale training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.01970","ref_index":34,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Small Generalizable Prompt Predictive Models Can Steer Efficient RL Post-Training of Large Reasoning Models","primary_cat":"cs.AI","submitted_at":"2026-02-02T11:24:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GPS trains a small model on optimization history to predict prompt difficulty and select intermediate-difficulty diverse batches, yielding better training efficiency, final performance, and test-time allocation than baselines on reasoning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.21464","ref_index":24,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Conversation for Non-verifiable Learning: Self-Evolving LLMs through Meta-Evaluation","primary_cat":"cs.CL","submitted_at":"2026-01-29T09:41:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CoNL lets LLMs self-improve on non-verifiable tasks by rewarding critiques that produce better solutions in multi-agent conversations, jointly optimizing generation and judging without external feedback.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.20829","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Training Reasoning Models on Saturated Problems via Failure-Prefix Conditioning","primary_cat":"cs.LG","submitted_at":"2026-01-28T18:29:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Failure-prefix conditioning unlocks learning from saturated reasoning problems by conditioning on failure prefixes, improving recovery from misleading early steps and matching gains from new medium-difficulty problems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.16175","ref_index":79,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Learning to Discover at Test Time","primary_cat":"cs.LG","submitted_at":"2026-01-22T18:24:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TTT-Discover applies test-time RL to set new state-of-the-art results on math inequalities, GPU kernels, algorithm contests, and single-cell denoising using an open model and public code.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.13399","ref_index":24,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Differentiable Evolutionary Reinforcement Learning","primary_cat":"cs.AI","submitted_at":"2025-12-15T14:50:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DERL is a differentiable bi-level method that evolves optimal reward structures for RL policies by composing atomic primitives and using meta-gradients from validation performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.00066","ref_index":35,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Sharpness-Guided Group Relative Policy Optimization via Probability Shaping","primary_cat":"cs.LG","submitted_at":"2025-10-29T08:07:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"GRPO-SG is a sharpness-guided token-weighted variant of GRPO that downweights high-gradient tokens to stabilize optimization and improve generalization in reinforcement learning with verifiable rewards.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.10649","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Unlocking Exploration in RLVR: Uncertainty-aware Advantage Shaping for Deeper Reasoning","primary_cat":"cs.AI","submitted_at":"2025-10-12T15:06:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UCAS refines RLVR advantage signals with a logit-space self-confidence proxy for response-level modulation and asymmetric token-level penalties based on raw logit certainty to boost exploration and reduce entropy collapse.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.25454","ref_index":23,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search","primary_cat":"cs.AI","submitted_at":"2025-09-29T20:00:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DeepSearch embeds MCTS into RLVR training with global frontier selection, entropy guidance, and adaptive replay to achieve 62.95% average accuracy on math reasoning benchmarks while using 5.7x fewer GPU hours than extended training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.00222","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization","primary_cat":"cs.AI","submitted_at":"2025-07-31T23:55:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RL-PLUS is a hybrid RL approach for LLMs that combines internal exploitation with external data via importance sampling and exploration advantages to prevent capability boundary collapse and achieve gains on math and OOD reasoning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.12549","ref_index":123,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The Serial Scaling Hypothesis","primary_cat":"cs.LG","submitted_at":"2025-07-16T18:01:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The serial scaling hypothesis formalizes inherently serial problems in complexity theory and demonstrates that diffusion models cannot solve them.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.02833","ref_index":29,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Generalizing Verifiable Instruction Following","primary_cat":"cs.CL","submitted_at":"2025-07-03T17:44:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces IFBench benchmark with 58 new constraints and demonstrates RLVR training improves generalization of language models to unseen verifiable output constraints.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.21734","ref_index":99,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Hierarchical Reasoning Model","primary_cat":"cs.AI","submitted_at":"2025-06-26T19:39:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"HRM is a recurrent architecture with high-level planning and low-level execution modules that reaches near-perfect accuracy on complex Sudoku, maze navigation, and ARC benchmarks using 27M parameters and 1000 samples without pre-training or CoT supervision.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.12119","ref_index":37,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mixture-of-Experts Can Surpass Dense LLMs Under Strictly Equal Resource","primary_cat":"cs.CL","submitted_at":"2025-06-13T17:59:05+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MoE models with activation rates in an optimal region outperform dense LLMs of identical total parameter count, training compute, and data budget, with the optimal region consistent across scales.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.01939","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning","primary_cat":"cs.CL","submitted_at":"2025-06-02T17:54:39+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"High-entropy minority tokens drive RLVR gains, so restricting gradients to the top 20% maintains or improves performance over full updates on Qwen3 models, especially larger ones.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.15134","ref_index":88,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning","primary_cat":"cs.LG","submitted_at":"2025-05-21T05:39:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Entropy minimization on self-generated outputs elicits strong reasoning in pretrained LLMs, matching or exceeding supervised RL methods on benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.01990","ref_index":165,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems","primary_cat":"cs.AI","submitted_at":"2025-03-31T18:00:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"This survey frames foundation agents using brain-inspired modular architectures and reviews challenges in evolution, collaboration, and safety.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"that adapt their own core functionalities through sophisticate collaboration. This system-level approach has driven a surge of innovation focused on multi-agent coordination, credit assignment, and stable training for real-world applications. A prominent direction is the shift from single agents to multi-agent systems to tackle complexity. Frameworks like MARFT [165] now provide standardized infrastructure for multi-agent reinforcement fine-tuning, enabling novel cognitive architectures where agents collaborate as thinkers, critics, and solvers to learn \"meta-think\" [166]. Addressing the core RL challenge of credit assignment in long-horizon tasks, researchers have developed more granular reward mechanisms. For instance, SPA-RL reinforces agents"}],"limit":50,"offset":0}