{"total":15,"items":[{"citing_arxiv_id":"2606.28434","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SWE-MeM: Learning Adaptive Memory Management for Long-Horizon Coding Agents","primary_cat":"cs.SE","submitted_at":"2026-06-26T04:55:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SWE-MeM introduces adaptive memory management for coding agents via synthesized trajectories and Memory-aware GRPO, reporting 43.4% and 60.2% resolve rates on SWE-Bench Verified for 4B and 30B models while beating baselines on performance and token use.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14445","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale","primary_cat":"cs.LG","submitted_at":"2026-05-14T06:39:42+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"FrontierSmith automates synthesis of open-ended coding problems from closed-ended seeds and shows measurable gains on two open-ended LLM coding benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12925","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation","primary_cat":"cs.SE","submitted_at":"2026-05-13T03:00:57+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Con- secutive stage pairs are classified as pivots (forward progress, e.g. E→I), backtracks (regression, e.g. V→E), deepenings (same phase continues), or confirmations (transition to O). The coherence score combines a forward-progress ratio with a blind-retry penalty: Φcoh(τ) = |pivots|+|confirms| |pivots|+|confirms|+|backtracks|+ϵ × \u0012 1− r |T| \u0013 , (2) where r is the blind-retry count and |T| is the total transition count. A clean E→E→I→V trajectory scores 1.0; a trajectory with three regression cycles and a blind-retry cluster scores about 0.51. A worked example of the coherence calculation is provided in Appendix B.5. Temporal profile divergence (Φtemp).The trajectory is divided into three equal segments."},{"citing_arxiv_id":"2605.12913","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Revisiting DAgger in the Era of LLM-Agents","primary_cat":"cs.LG","submitted_at":"2026-05-13T02:40:28+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11051","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"On Problems of Implicit Context Compression for Software Engineering Agents","primary_cat":"cs.SE","submitted_at":"2026-05-11T14:47:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"In-Context Autoencoder succeeds on single-shot common-knowledge and code tasks but fails on multi-step agentic coding tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07769","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Coding Agents Don't Know When to Act","primary_cat":"cs.SE","submitted_at":"2026-05-08T14:10:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Coding agents exhibit action bias by proposing undesirable changes on already-fixed issues 35-65% of the time, and explicit reproduction instructions only partially mitigate this while creating new abstention errors.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04320","ref_index":7,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Reproduction Test Generation for Java SWE Issues","primary_cat":"cs.SE","submitted_at":"2026-05-05T21:49:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces the first benchmark for Java reproduction test generation from repository issues and adapts a prior Python tool to produce high performance on it.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.03546","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ProgramBench: Can Language Models Rebuild Programs From Scratch?","primary_cat":"cs.SE","submitted_at":"2026-05-05T09:17:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ProgramBench introduces 200 tasks where models must reconstruct full programs like FFmpeg or SQLite from docs alone; none of 9 evaluated LMs fully solve any task and the best passes 95% tests on only 3% of tasks while favoring monolithic code.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.27859","ref_index":4,"ref_count":3,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Rethinking Agentic Reinforcement Learning In Large Language Models","primary_cat":"cs.AI","submitted_at":"2026-04-30T13:43:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"The paper reviews conceptual foundations, methodological innovations, effective designs, critical challenges, and future directions for LLM-based Agentic Reinforcement Learning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"2 Fangming Cui, Ruixiao Zhu, Cheng Fang, Sunan Li, and Jiahong Li between passive language models and truly autonomous, self-improving systems capable of meta-reasoning. Unleashed by these capabilities, a new frontier of complex application scenarios emerges, moving far beyond the scope of simple chatbot interactions [130]. In the realm of software engineering [4, 26, 82, 108, 118] , these agents evolve from merely generating simple code snippets [40] to autonomously managing entire software repositories [8, 13, 16, 24]; they can now perform complex debugging cycles, write unit tests, code [1, 7], and even optimize algorithms based on runtime performance metrics, effectively acting as autonomous software developers."},{"citing_arxiv_id":"2604.16335","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Beyond Verifiable Rewards: Rubric-Based GRM for Reinforced Fine-Tuning SWE Agents","primary_cat":"cs.LG","submitted_at":"2026-03-13T02:23:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A rubric-based generative reward model improves reinforced fine-tuning of SWE agents by supplying richer behavioral guidance than binary terminal rewards alone.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.15763","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GLM-5: from Vibe Coding to Agentic Engineering","primary_cat":"cs.LG","submitted_at":"2026-02-17T17:50:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GLM-5 is a foundation model that claims state-of-the-art results on coding benchmarks and superior performance on end-to-end software engineering tasks via new asynchronous RL methods and cost-saving DSA.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"5 remains. This is because errors are compounded across the chain: a suboptimal edit in one task can silently break tests in subsequent tasks. Narrowing this gap will require advances in long-context consistency and long-horizon self-correction, both active areas of our ongoing research. 6.2.4 Evaluation on evolving SWE tasks We evaluate on SWE-rebench [4] because SWE-bench Verified is a static, public, human-validated test set and released for more than 2 years. In contrast, SWE-rebench is built on an automated pipeline that continuously mines fresh, real GitHub issue-fixing tasks, enabling decontaminated, time-robust evaluation that better measures generalization to new software engineering problems"},{"citing_arxiv_id":"2512.18552","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Toward Training Superintelligent Software Agents through Self-Play SWE-RL","primary_cat":"cs.SE","submitted_at":"2025-12-21T00:49:40+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.18470","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios","primary_cat":"cs.SE","submitted_at":"2025-12-20T19:08:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SWE-EVO shows GPT-5.4 with OpenHands reaching only 25% success on complex multi-file evolution tasks versus 72.8% on SWE-Bench Verified, and introduces Fix Rate as a partial-progress metric.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.08827","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Survey of Reinforcement Learning for Large Reasoning Models","primary_cat":"cs.CL","submitted_at":"2025-09-10T17:59:43+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.02544","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning","primary_cat":"cs.AI","submitted_at":"2025-09-02T17:44:45+00:00","verdict":"CONDITIONAL","verdict_confidence":"UNKNOWN","novelty_score":5.0,"formal_verification":"none","one_line_summary":"UI-TARS-2 reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld while attaining 59.8 mean normalized score on a 15-game suite through multi-turn RL and scalable data generation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"such as Go [60], to Atari benchmarks [9], to large-scale strategy games like StarCraft II [37], and open-ended environments such as Minecraft [17]. However, a key limitation of these efforts is their specificity: agents were typically optimized for a single game with tailored policies and parameters, hindering generalization across different environments [8, 38, 67]. The emergence of LLMs and VLMs has shifted attention toward more generalist agents [53]. Recent work explores their application to complex game scenarios, such as Pokémon [4, 14]. To cope with the long-horizon and multimodal nature of games, many approaches adopt workflow-style designs, equipping models with explicit modules for memory [68, 72] and planning [59, 81, 86], or fine-tuning VLMs on specific titles for"}],"limit":50,"offset":0}