{"total":12,"items":[{"citing_arxiv_id":"2605.23019","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PACE: Two-Timescale Self-Evolution for Small Language Model Agents","primary_cat":"cs.LG","submitted_at":"2026-05-21T20:42:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PACE coordinates low-risk prompt evolution with validated higher-risk control-logic updates to improve frozen SLM agents on benchmarks without model retraining.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22505","ref_index":53,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Towards Direct Evaluation of Harness Optimizers via Priority Ranking","primary_cat":"cs.AI","submitted_at":"2026-05-21T13:55:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Priority ranking offers a low-cost direct evaluation for harness optimizers that correlates with their real multi-step optimization performance, supported by the Shor dataset of 182 scenarios.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20086","ref_index":71,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"What Do Evolutionary Coding Agents Evolve?","primary_cat":"cs.NE","submitted_at":"2026-05-19T16:41:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Evolutionary coding agents achieve most benchmark gains through a small subset of edit types and by cycling previously deleted code lines rather than developing new algorithmic structures.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18930","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OEP: Poisoning Self-Evolving LLM Agents via Locally Correct but Non-Transferable Experiences","primary_cat":"cs.CR","submitted_at":"2026-05-18T14:08:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OEP poisons self-evolving LLM agents by constructing clean edge-case experiences that appear locally valid yet cause harmful over-generalization during reflection, achieving over 50% attack success rate on GPT-4o agents across three domains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11882","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment","primary_cat":"cs.AI","submitted_at":"2026-05-12T09:56:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FATE lets LLM agents self-evolve safer behaviors by generating and filtering repairs from their own failure trajectories using verifiers and Pareto optimization.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"interaction, Reflexion stores verbal reflections from past failures, and self-refinement methods use model-generated feedback or revisions at inference time [ 44, 38, 25, 47]. Recent studies also analyze self-evolving-agent risks and failure-based agent learning, including misevolution, experience-driven safety degradation, negative-trajectory fine-tuning, and hard-negative failure gen- eration [36, 49, 40, 17]. Different from these inference-time approaches, FATE performs on-policy policy refinement by turning failed trajectories into verifier-filtered repair supervision. Our work is also related to preference optimization and reinforcement learning from feedback, including RLHF, DPO, and GRPO [28, 4, 30, 35, 37]. However, scalar safety rewards can induce broad refusal or other"},{"citing_arxiv_id":"2605.10663","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents","primary_cat":"cs.AI","submitted_at":"2026-05-11T14:43:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Evolving-RL jointly optimizes experience extraction and utilization in LLM agents via RL with separate evaluation signals, delivering up to 98.7% relative gains on out-of-distribution tasks in ALFWorld and Mind2Web.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"design, with considerable research devoted to experience representation [33, 39, 27, 16], formulation, and management mechanisms [ 6, 24, 34]. While these prompt-based methods have successfully demonstrated that injecting past experiences can significantly enhance downstream decision-making, their effectiveness is ultimately bounded by the underlying model's ability to extract and leverage these experiences [ 18]-a process that heavily relies on the model possessing robust in-context learning [4] and abstract reasoning capabilities. Several recent studies [30, 29] have explored reinforcement learning as a way to enhance the model's ability to utilize experience. However, these methods do not optimize self-evolution as a unified process. They improve only the utilization phase, while relying on stronger external models or"},{"citing_arxiv_id":"2605.09315","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Do Self-Evolving Agents Forget? Capability Degradation and Preservation in Lifelong LLM Agent Adaptation","primary_cat":"cs.AI","submitted_at":"2026-05-10T04:20:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Self-evolving LLM agents exhibit capability erosion under continual adaptation, which Capability-Preserving Evolution mitigates by raising retained simple-task performance from 41.8% to 52.8% in workflow evolution under GPT-5.1.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08715","ref_index":47,"ref_count":4,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems","primary_cat":"cs.CL","submitted_at":"2026-05-09T05:55:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AgentForesight introduces an online auditor model that predicts decisive errors in multi-agent trajectories at the earliest step using a coarse-to-fine reinforcement learning recipe on a new curated dataset AFTraj-2K.","context_count":2,"top_context_role":"background","top_context_polarity":"background","context_text":"Your agent may misevolve: Emergent risks in self-evolving llm agents.arXiv preprint arXiv:2509.26354, 2025. 12 [46] Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, pages 1279-1297, 2025. [47] Zeru Shi, Kai Mei, Mingyu Jin, Yongye Su, Chaoji Zuo, Wenyue Hua, Wujiang Xu, Yujie Ren, Zirui Liu, Mengnan Du, et al. From commands to prompts: Llm-based semantic file system for aios. InInternational Conference on Learning Representations, volume 2025, pages 33108-33131, 2025. [48] Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao."},{"citing_arxiv_id":"2605.05583","ref_index":12,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Belief Memory: Agent Memory Under Partial Observability","primary_cat":"cs.AI","submitted_at":"2026-05-07T02:03:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"BeliefMem is a probabilistic memory architecture for LLM agents that retains multiple candidate conclusions with probabilities updated by Noisy-OR, achieving superior average performance over deterministic baselines on LoCoMo and ALFWorld.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24919","ref_index":98,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Agentic AI for Remote Sensing: Technical Challenges and Research Directions","primary_cat":"cs.CV","submitted_at":"2026-04-27T18:59:49+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"agent may misevolve: Emergent risks in self-evolving llm agents. arXiv preprint arXiv:2509.26354 (2025). [97] Yan Shu, Bin Ren, Zhitong Xiong, Danda Pani Paudel, Luc Van Gool, Begüm Demir, Nicu Sebe, and Paolo Rota. 2025. EarthMind: Leveraging Cross-Sensor Data for Advanced Earth Observation Interpretation with a Unified Multimodal LLM. arXiv preprint arXiv:2506.01667 (2025). [98] Yan Shu, Bin Ren, Zhitong Xiong, Xiao Xiang Zhu, Begüm Demir, Nicu Sebe, and Paolo Rota. 2026. TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation. arXiv preprint arXiv:2603.19039 (2026). [99] Simranjit Singh, Michael Fore, and Dimitrios Stamoulis. 2024. Evaluating tool-augmented agents in remote sensing platforms. arXiv preprint arXiv:2405."},{"citing_arxiv_id":"2604.15774","ref_index":4,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MemEvoBench: Benchmarking Safety Risks from Memory Misevolution in LLM Agents","primary_cat":"cs.CL","submitted_at":"2026-04-17T07:29:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"MemEvoBench is presented as the first standardized benchmark for long-horizon memory safety in LLM agents, covering adversarial memory injection, noisy tool outputs, and biased feedback across QA and workflow tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04759","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw","primary_cat":"cs.CR","submitted_at":"2026-04-06T15:27:05+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Poisoning any single CIK dimension of an AI agent raises average attack success rate from 24.6% to 64-74% across models, and tested defenses leave substantial residual risk.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}