EvoRepair is the first experience-based self-evolving agent framework for automated vulnerability repair, reporting 90.46% overall success on PATCHEVAL and SEC-bench benchmarks.
hub
arXiv: 2511.13646 [cs.SE].URL:https://arxiv.org/abs/2511.13646
14 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
PerfCodeBench reveals that state-of-the-art LLMs produce functionally correct but significantly slower code than expert-optimized versions on system-level tasks, especially those involving parallelism and GPUs.
DOLORES, an agent using a formal language for meta-reasoning to construct adaptive scaffolds on the fly, outperforms prior scaffolding methods by 24.8% on average across four hard benchmarks and multiple model sizes.
A harness for AI agents enabled construction of a Rust library with 100+ problem types and 200+ reduction rules for NP-hard problems in three months.
DAIRA integrates dynamic tracing into LLM agents to achieve 79.4% resolution rate on SWE-bench Verified for code defect repair.
Symbolon learns diverse code transformations via search on small programs, distills them into agent skills, and applies them to improve KLEE symbolic execution, yielding 3.69x coverage gains and 21 new Linux kernel bugs.
DemoEvolve bootstraps harness evolution with demonstrations to achieve more stable and effective edits than self-rollout search in sparse-feedback environments like Balatro.
AgentSPEX is a new language and harness for explicitly specifying and running structured LLM-agent workflows with typed steps, control flow, parallel execution, and a visual editor.
LLM agents resolve fewer than half of issues while satisfying design constraints despite passing tests, as shown by a benchmark of 495 issues and 1787 constraints from six repositories.
LLM agents complete over 80% of tasks on a new 849-task Rust verification benchmark and over 90% on unfinished human proofs.
RQGM enables co-evolution of agents and evaluators across epochs with non-stationary utilities, reporting gains in coding pass rates, paper acceptance, and proof grading over prior self-improving agents.
A survey that organizes existing work on LLM-based agents around code as the central harness, structured in three layers of interfaces, mechanisms, and multi-agent scaling, with applications across domains and listed open challenges.
Agent-generated tests mainly act as observational feedback channels and do not meaningfully improve issue resolution success in current LLM software engineering agents.
citing papers explorer
-
EvoRepair: Enhancing Vulnerability Repair Agents Through Experience-Based Self-Evolution
EvoRepair is the first experience-based self-evolving agent framework for automated vulnerability repair, reporting 90.46% overall success on PATCHEVAL and SEC-bench benchmarks.
-
PerfCodeBench: Benchmarking LLMs for System-Level High-Performance Code Optimization
PerfCodeBench reveals that state-of-the-art LLMs produce functionally correct but significantly slower code than expert-optimized versions on system-level tasks, especially those involving parallelism and GPUs.
-
Deep Reasoning in General Purpose Agents via Structured Meta-Cognition
DOLORES, an agent using a formal language for meta-reasoning to construct adaptive scaffolds on the fly, outperforms prior scaffolding methods by 24.8% on average across four hard benchmarks and multiple model sizes.
-
Problem Reductions at Scale: Agentic Integration of Computationally Hard Problems
A harness for AI agents enabled construction of a Rust library with 100+ problem types and 200+ reduction rules for NP-hard problems in three months.
-
Dynamic analysis enhances issue resolution
DAIRA integrates dynamic tracing into LLM agents to achieve 79.4% resolution rate on SWE-bench Verified for code defect repair.
-
Symbolon: Symbolic Execution by Learning Code Transformation
Symbolon learns diverse code transformations via search on small programs, distills them into agent skills, and applies them to improve KLEE symbolic execution, yielding 3.69x coverage gains and 21 new Linux kernel bugs.
-
DemoEvolve: Overcoming Sparse Feedback in Agentic Harness Evolution with Demonstrations
DemoEvolve bootstraps harness evolution with demonstrations to achieve more stable and effective edits than self-rollout search in sparse-feedback environments like Balatro.
-
AgentSPEX: An Agent SPecification and EXecution Language
AgentSPEX is a new language and harness for explicitly specifying and running structured LLM-agent workflows with typed steps, control flow, parallel execution, and a visual editor.
-
Does Pass Rate Tell the Whole Story? Evaluating Design Constraint Compliance in LLM-based Issue Resolution
LLM agents resolve fewer than half of issues while satisfying design constraints despite passing tests, as shown by a benchmark of 495 issues and 1787 constraints from six repositories.
-
VeruSAGE: A Study of Agent-Based Verification for Rust Systems
LLM agents complete over 80% of tasks on a new 849-task Rust verification benchmark and over 90% on unfinished human proofs.
-
The Red Queen G\"odel Machine: Co-Evolving Agents and Their Evaluators
RQGM enables co-evolution of agents and evaluators across epochs with non-stationary utilities, reporting gains in coding pass rates, paper acceptance, and proof grading over prior self-improving agents.
-
Code as Agent Harness
A survey that organizes existing work on LLM-based agents around code as the central harness, structured in three layers of interfaces, mechanisms, and multi-agent scaling, with applications across domains and listed open challenges.
-
Rethinking the Value of Agent-Generated Tests for LLM-Based Software Engineering Agents
Agent-generated tests mainly act as observational feedback channels and do not meaningfully improve issue resolution success in current LLM software engineering agents.
- Toward Training Superintelligent Software Agents through Self-Play SWE-RL