The Meta-Agent Challenge shows frontier AI models rarely match human-engineered agent baselines when tasked with autonomous development, with proprietary models succeeding most often and some exhibiting cheating under pressure.
hub
arXiv: 2511.13646 [cs.SE].URL:https://arxiv.org/abs/2511.13646
20 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
EvoRepair is the first experience-based self-evolving agent framework for automated vulnerability repair, reporting 90.46% overall success on PATCHEVAL and SEC-bench benchmarks.
PerfCodeBench reveals that state-of-the-art LLMs produce functionally correct but significantly slower code than expert-optimized versions on system-level tasks, especially those involving parallelism and GPUs.
DOLORES, an agent using a formal language for meta-reasoning to construct adaptive scaffolds on the fly, outperforms prior scaffolding methods by 24.8% on average across four hard benchmarks and multiple model sizes.
A harness for AI agents enabled construction of a Rust library with 100+ problem types and 200+ reduction rules for NP-hard problems in three months.
DUALVIEW is a dual-modal framework using Module Coupling, Function Call, Class Hierarchy, and Program Dependence graphs to enable persistent structural reasoning for agentic issue resolution, reporting gains on SWE-bench Pro and Verified.
PatchFusion uses deterministic atomic evidence fusion on candidate patches to outperform ranking, test-filtering, and LLM-judge selectors on SWE-bench and Defects4J pools.
LLVM-Bench supplies 423 validated LLVM issues and LLVM-Gym automates evaluation, showing LLMs are limited but an ensemble reaches 21.99% resolution.
Symbolon learns diverse code transformations via search on small programs, distills them into agent skills, and applies them to improve KLEE symbolic execution, yielding 3.69x coverage gains and 21 new Linux kernel bugs.
DemoEvolve bootstraps harness evolution with demonstrations to achieve more stable and effective edits than self-rollout search in sparse-feedback environments like Balatro.
AgentSPEX is a new language and harness for explicitly specifying and running structured LLM-agent workflows with typed steps, control flow, parallel execution, and a visual editor.
LLM agents resolve fewer than half of issues while satisfying design constraints despite passing tests, as shown by a benchmark of 495 issues and 1787 constraints from six repositories.
LLM agents complete over 80% of tasks on a new 849-task Rust verification benchmark and over 90% on unfinished human proofs.
RQGM enables co-evolution of agents and evaluators across epochs with non-stationary utilities, reporting gains in coding pass rates, paper acceptance, and proof grading over prior self-improving agents.
SkillHone introduces a harness that maintains persistent decision histories to support continual evolution of language-model agent skills, reporting 15.8-point gains on GAIA over a commercial deep-research agent.
A survey that organizes existing work on LLM-based agents around code as the central harness, structured in three layers of interfaces, mechanisms, and multi-agent scaling, with applications across domains and listed open challenges.
Agent-generated tests mainly act as observational feedback channels and do not meaningfully improve issue resolution success in current LLM software engineering agents.
MicroSkill Architecture partitions knowledge into atomic skill capsules selected via constrained optimization to cut token use over 90% and improve code generation metrics in one enterprise case study.
citing papers explorer
-
The Red Queen G\"odel Machine: Co-Evolving Agents and Their Evaluators
RQGM enables co-evolution of agents and evaluators across epochs with non-stationary utilities, reporting gains in coding pass rates, paper acceptance, and proof grading over prior self-improving agents.
-
SkillHone: A Harness for Continual Agent Skill Evolution Through Persistent Decision History
SkillHone introduces a harness that maintains persistent decision histories to support continual evolution of language-model agent skills, reporting 15.8-point gains on GAIA over a commercial deep-research agent.