{"total":11,"items":[{"citing_arxiv_id":"2605.22675","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Self-Policy Distillation via Capability-Selective Subspace Projection","primary_cat":"cs.CL","submitted_at":"2026-05-21T16:18:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Self-Policy Distillation extracts a capability subspace from model gradients on correctness tokens, projects KV activations into it for self-generation, and fine-tunes LLMs to achieve up to 13-16% gains over baselines without external signals.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21792","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Residual Skill Optimization for Text-to-SQL Ensembles","primary_cat":"cs.CL","submitted_at":"2026-05-20T22:36:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Residual skill optimization creates complementary Text-to-SQL agents by training each new skill on prior ensemble failures, yielding accuracy gains on Spider2-Lite and transfer to other dialects and tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14445","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale","primary_cat":"cs.LG","submitted_at":"2026-05-14T06:39:42+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"FrontierSmith automates synthesis of open-ended coding problems from closed-ended seeds and shows measurable gains on two open-ended LLM coding benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01474","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ReMedi: Reasoner for Medical Clinical Prediction","primary_cat":"cs.CL","submitted_at":"2026-05-02T14:44:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ReMedi boosts LLM performance on EHR clinical predictions by up to 19.9% F1 through ground-truth-guided rationale regeneration and fine-tuning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21138","ref_index":83,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Navigating the Clutter: Waypoint-Based Bi-Level Planning for Multi-Robot Systems","primary_cat":"cs.RO","submitted_at":"2026-04-22T22:58:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Waypoint-based bi-level planning with curriculum RLVR improves multi-robot task success rates in dense-obstacle benchmarks over motion-agnostic and VLA baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.21046","ref_index":206,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence","primary_cat":"cs.AI","submitted_at":"2025-07-28T17:59:05+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.01990","ref_index":157,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems","primary_cat":"cs.AI","submitted_at":"2025-03-31T18:00:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"This survey frames foundation agents using brain-inspired modular architectures and reviews challenges in evolution, collaboration, and safety.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"(MDPs)andleveragetreesearchtoexplorediversereasoningpathswhileusingPRMsforqualityassessment. In domain-specific applications, methods like rStar-Math [155] and DeepSeekMath [137] have demonstrated success in mathematical problem-solving through multi-round self-iteration and balanced exploration- exploitation strategies. For code generation, o1-Coder [157] leverages MCTS to generate code with reasoning processes,whileMarco-o1[ 157]extendsthisapproachtoopen-endedtasks. Theseimplementationshighlight how the synergy between MCTS and PRM achieves effective reasoning path exploration while maintaining solution quality through fine-grained supervision. Beyond data-driven approaches,reinforcement learningand agentic self-improvementare the most attractive"},{"citing_arxiv_id":"2503.17352","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles","primary_cat":"cs.CV","submitted_at":"2025-03-21T17:52:43+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Iterative SFT-RL cycles enable a 7B LVLM to develop sophisticated visual chain-of-thought reasoning and improve performance on math and general reasoning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.17419","ref_index":189,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From System 1 to System 2: A Survey of Reasoning Large Language Models","primary_cat":"cs.AI","submitted_at":"2025-02-24T18:50:52+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":3.0,"formal_verification":"none","one_line_summary":"The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"with the initial steps in the reasoning process. 3.2.3 Self Improvement Reasoning LLMs exemplify a progression from weak to strong supervision, while traditional CoT fine-tuning faces challenges in scaling effectively. Self improvement, using the model's exploration capabilities for self-supervision, gradually enhances LLMs performance [201] in tasks such as translation [189], mathematics [186], [190], and multi- modal perception [193]. This approach fosters exploration and application within reasoning LLMs [143], [216]-[218]. A summary of Self Improvement method is presented in Table 3. Training-based self improvement in LLMs can be cat- egorized based on exploration and improvement strate- gies. The exploration phase focuses on data collection to"},{"citing_arxiv_id":"2412.21187","ref_index":83,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs","primary_cat":"cs.CL","submitted_at":"2024-12-30T18:55:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"o1-like models overthink easy tasks; self-training reduces compute use without accuracy loss on GSM8K, MATH500, GPQA, and AIME.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2410.08146","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning","primary_cat":"cs.LG","submitted_at":"2024-10-10T17:31:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Process advantage verifiers trained to predict step-level progress under a distinct prover policy improve LLM reasoning accuracy by over 8% and sample efficiency by 5-6x over outcome reward models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}