{"total":10,"items":[{"citing_arxiv_id":"2605.21467","ref_index":82,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards","primary_cat":"cs.LG","submitted_at":"2026-05-20T17:53:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DelTA estimates token coefficients to amplify discriminative directions in token-gradient vectors, reweighting the RLVR surrogate to produce more contrastive side-wise centroids and yielding 3.26 and 2.62 point gains on math benchmarks for 8B and 14B Qwen3 models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19425","ref_index":48,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"When to Stop Reusing: Dynamic Gradient Gating for Sample-Efficient RLVR","primary_cat":"cs.LG","submitted_at":"2026-05-19T06:23:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Dynamic Gradient Gating monitors lm_head gradient norms to safely reuse rollout batches in RLVR, achieving up to 2.93x sample efficiency and 2.14x wall-clock speedup across math, ALFWorld, WebShop, and QA tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17672","ref_index":45,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Stop When Reasoning Converges: Semantic-Preserving Early Exit for Reasoning Models","primary_cat":"cs.CL","submitted_at":"2026-05-17T22:04:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PUMA detects reasoning-level semantic redundancy to enable early exit in chains of thought, achieving 26.2% average token reduction across five LRMs and five benchmarks while preserving accuracy and CoT quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15113","ref_index":47,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Learning from Language Feedback via Variational Policy Distillation","primary_cat":"cs.LG","submitted_at":"2026-05-14T17:27:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VPD frames language feedback learning as variational EM so the teacher policy refines itself via trust-region updates on outcomes while the student learns dense token distributions on its own rollouts, outperforming fixed-teacher baselines on reasoning and code tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14539","ref_index":29,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards","primary_cat":"cs.CL","submitted_at":"2026-05-14T08:22:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CIPO jointly optimizes standard RLVR rewards with correction samples derived from the model's own failed attempts, yielding better reasoning and self-correction on math and code benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11609","ref_index":34,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information","primary_cat":"cs.LG","submitted_at":"2026-05-12T06:40:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Anti-Self-Distillation reverses self-distillation signals via PMI to fix overconfidence on structural tokens, matching GRPO baseline accuracy 2-10x faster with up to 11.5 point gains across 4B-30B models.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"AntiSD's gate is auto-calibrated from the first5 training steps (run at λ= 0 ): we record the median teacher entropy Hwarm and set τdown = 0.93H warm, with the gate re-enabling once H recovers to Hwarm. The 0.93 multiplier is shared across all model families, requiring no per-model tuning. Held-out evaluation reports avg@32 on AIME 2024 [33] / 2025 [34] / 2026 [35] and HMMT 2025 [3], and avg@4 on MinervaMath [11]. Full model list, sampling settings, gate-calibration details, and example teacher prompts are in Appendix B and C. 4.1 Main results Table 1:Main results(accuracy %). AIME24/25/26 and HMMT25: avg@ 32; Minerva: avg@4. Subscript onAvg= peak-mean step;Speedup= GRPO's best-Avg step / this row's first-reach step ( ×:"},{"citing_arxiv_id":"2605.08776","ref_index":32,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Reasoning Compression with Mixed-Policy Distillation","primary_cat":"cs.AI","submitted_at":"2026-05-09T08:04:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Mixed-Policy Distillation transfers concise reasoning behavior from larger to smaller LLMs by having the teacher compress student-generated trajectories, cutting token usage up to 27% while raising benchmark scores.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"methods can reduce token usage but often hurt reasoning quality. Chain-of-Draft substantially lowers token usage, but suffers from severe performance degradation, even underperforming Direct Comp., suggesting that forcing the model to produce sparse, \"draft-like\" intermediate thoughts can disrupt the continuous logical reasoning needed for problem solving [32]. (3) Fine-tuning on pre-refined traces provides only limited gains. Although LiteCoT performs slightly better than Chain-of-Draft, it still suffers from distribution mismatch caused by static off-policy supervision, since the student is not trained to compress its own reasoning trajectories. (4) On-policy distillation is helpful but insufficient without teacher-guided compression."},{"citing_arxiv_id":"2605.08704","ref_index":49,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"AgentPSO: Evolving Agent Reasoning Skill via Multi-agent Particle Swarm Optimization","primary_cat":"cs.AI","submitted_at":"2026-05-09T05:38:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AgentPSO applies a particle-swarm-inspired update rule to evolve natural-language reasoning skills across multiple LLM agents, yielding gains over static and test-time multi-agent baselines with cross-benchmark transfer.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"i =s 0 i , and the initial global-best skill g0 is selected as the highest-scoring initial skill on the validation batch. Algorithm 1 summarizes the overall procedure. 5 Experimental Setup Datasets.We evaluate AgentPSO on five benchmarks covering mathematical and general reasoning. (1) For mathematical reasoning, we use DeepMath [16], MATH [24], AIME25 [49], and Minerva [23], which require multi-step reasoning, symbolic manipulation, and numerical problem solving. The resulting optimized agent skills are evaluated on the DeepMath test set and further applied to out-of- distribution mathematical benchmarks, including MATH, AIME25, and Minerva. (2) For general reasoning, we use BigBenchHard (BBH) [ 37], a challenging benchmark composed of 23 diverse"},{"citing_arxiv_id":"2605.08472","ref_index":64,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models","primary_cat":"cs.AI","submitted_at":"2026-05-08T20:46:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Mid-training LLMs on self-generated diverse reasoning paths improves subsequent RL performance on mathematical benchmarks and OOD tasks.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"2-3B-Instruct[ 10] as the primary base model for all experiments, including both baselines and mid-training, and report additional results withQwen2.5- 7B-Instructin the § A.5. We useSkywork-Reward-V2-Llama-3.2-3B[ 29] as the reward model (Rϕ) to score responses during data generation. Evaluation DetailsWe evaluate on six mathematical reasoning benchmarks:Math-500[ 17], AIME 2024[ 63],AIME 2025[ 64],AMC 2023[ 35],HMMT 2025[ 2], andOlympiadBench[ 15], covering a wide range of difficulties and reasoning types. We use Math-Verify [19] to verify the correctness of the models' generated solutions automatically. Model performance is measured using the pass@ k metric [23], where a problem is solved if at least one of k samples is correct. Following Chen et al."},{"citing_arxiv_id":"2601.21464","ref_index":32,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Conversation for Non-verifiable Learning: Self-Evolving LLMs through Meta-Evaluation","primary_cat":"cs.CL","submitted_at":"2026-01-29T09:41:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CoNL lets LLMs self-improve on non-verifiable tasks by rewarding critiques that produce better solutions in multi-agent conversations, jointly optimizing generation and judging without external feedback.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}