VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual models, workloads, or hardware.
hub Canonical reference
Agentless: Demystifying LLM-based Software Engineering Agents
Canonical reference. 75% of citing Pith papers cite this work as background.
abstract
Recent advancements in large language models (LLMs) have significantly advanced the automation of software development tasks, including code synthesis, program repair, and test generation. More recently, researchers and industry practitioners have developed various autonomous LLM agents to perform end-to-end software development tasks. These agents are equipped with the ability to use tools, run commands, observe feedback from the environment, and plan for future actions. However, the complexity of these agent-based approaches, together with the limited abilities of current LLMs, raises the following question: Do we really have to employ complex autonomous software agents? To attempt to answer this question, we build Agentless -- an agentless approach to automatically solve software development problems. Compared to the verbose and complex setup of agent-based approaches, Agentless employs a simplistic three-phase process of localization, repair, and patch validation, without letting the LLM decide future actions or operate with complex tools. Our results on the popular SWE-bench Lite benchmark show that surprisingly the simplistic Agentless is able to achieve both the highest performance (32.00%, 96 correct fixes) and low cost ($0.70) compared with all existing open-source software agents! Furthermore, we manually classified the problems in SWE-bench Lite and found problems with exact ground truth patch or insufficient/misleading issue descriptions. As such, we construct SWE-bench Lite-S by excluding such problematic issues to perform more rigorous evaluation and comparison. Our work highlights the current overlooked potential of a simple, interpretable technique in autonomous software development. We hope Agentless will help reset the baseline, starting point, and horizon for autonomous software agents, and inspire future work along this crucial direction.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Recent advancements in large language models (LLMs) have significantly advanced the automation of software development tasks, including code synthesis, program repair, and test generation. More recently, researchers and industry practitioners have developed various autonomous LLM agents to perform end-to-end software development tasks. These agents are equipped with the ability to use tools, run commands, observe feedback from the environment, and plan for future actions. However, the complexity of these agent-based approaches, together with the limited abilities of current LLMs, raises the fo
co-cited works
representative citing papers
The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
Visual graphs of repository structure added to text inputs for multimodal LLM agents reduce token consumption by up to 26% while maintaining or improving issue-resolution accuracy.
SWE-Explore is a new benchmark evaluating repository exploration by coding agents on 848 issues across 203 repositories, using line-level ground truth from successful agent trajectories and showing agentic methods outperform classical retrieval on coverage and ranking.
Authors create LLMCVE dataset of LLM-in-the-loop vulnerabilities and demonstrate that agent-based repair methods achieve low success rates on them, particularly prompt injections at 28.57% Pass@1.
RepoMirage uses semantics-preserving perturbations on SWE-Bench to show code agents lack repository context reasoning, with performance falling sharply on extended structure tasks, and introduces RepoAnchor as a structure-first fix.
Phoenix-bench shows agentic AI systems lose 37-58% resolved rate when moving from SWE-bench Verified to hardware tasks because bugs spread across parallel modules via signal flow, with testbench feedback lifting performance by 42-45% while file-level oracles add only 1.4%.
The paper defines AI Harness Engineering as a runtime substrate with eleven components and a four-level ladder that reframes agent reliability as a model-harness-environment system property rather than model capability alone.
PerfCodeBench reveals that state-of-the-art LLMs produce functionally correct but significantly slower code than expert-optimized versions on system-level tasks, especially those involving parallelism and GPUs.
AgentLens reveals 10.7% of passing SWE-agent trajectories exhibit Lucky Pass behaviors and introduces a process-level evaluation framework with a new annotated dataset of 1,815 trajectories.
CrackMeBench introduces 20 deterministic binary validation tasks and reports GPT-5.5 solving 11/12 generated ones at pass@3 while Claude and Kimi lag, especially on harder tasks.
LLM agents exhibit constraint decay with assertion pass rates dropping substantially as structural requirements increase in multi-file backend code generation across web frameworks.
ProgramBench introduces 200 tasks where models must reconstruct full programs like FFmpeg or SQLite from docs alone; none of 9 evaluated LMs fully solve any task and the best passes 95% tests on only 3% of tasks while favoring monolithic code.
Developers use LLMs like ChatGPT mainly for knowledge acquisition and code generation at the detailed design level, reporting benefits such as better technology selection and early flaw detection alongside limitations like lengthy outputs, incorrect code, and hallucinations.
LLMs show up to 60.58% social bias in generated code; a new Fairness Monitor Agent cuts bias by 65.1% and raises functional correctness from 75.80% to 83.97%.
Adding product context retrieval to AI coding agents raises decision compliance from 46% to 95% on a new benchmark of 8 tasks with 41 weighted decision points.
ADI equips AI debugging agents with function-level interaction via a new execution trace structure, raising SWE-bench Verified resolution to 63.8% at $1.28 per task and delivering 6-18% gains when added to existing agents.
AgentFlow uses a typed graph DSL covering roles, prompts, tools, topology and protocol plus a runtime-signal feedback loop to optimize multi-agent harnesses, reaching 84.3% on TerminalBench-2 and discovering ten new zero-days in Chrome including two critical sandbox escapes.
LogicLoc combines LLMs with Datalog to achieve accurate repo-level code localization without relying on keyword shortcuts in benchmarks.
R2Eval is a new benchmark with 135 real-world code reasoning problems from Python projects that preserves complex data structures for more realistic LLM evaluation.
ConFixAgent repairs diverse concurrency bugs end-to-end by using Static Happens-Before graphs to extract relevant code context for LLMs, outperforming prior tools in benchmarks.
DAIRA integrates dynamic tracing into LLM agents to achieve 79.4% resolution rate on SWE-bench Verified for code defect repair.
Vibe Code Bench evaluates AI models on building complete web applications from specs, with the best of 16 models achieving 61.8% accuracy on the test split using autonomous browser evaluation.
AgenticSZZ reframes bug-inducing commit identification as temporal knowledge graph search navigated by an LLM agent, reporting F1 scores of 0.47-0.79 and up to 34% improvement over prior SZZ methods on three datasets.
citing papers explorer
-
EXPEREPAIR: Dual-Memory Enhanced LLM-based Repository-Level Program Repair
ExpeRepair improves LLM-based repository-level program repair by maintaining episodic memory of concrete fixes and semantic memory of abstract insights, reaching 60.3% and 74.6% pass@1 on SWE-Bench Lite and Verified.
-
Beyond Final Code: A Process-Oriented Error Analysis of Software Development Agents in Real-World GitHub Scenarios
Empirical study of 3977 agent trajectories finds Python execution errors correlate with lower success rates on GitHub issues, flags challenging errors, and reports three confirmed bugs in the SWE-Bench platform.
-
SWE-MeM: Learning Adaptive Memory Management for Long-Horizon Coding Agents
SWE-MeM introduces adaptive memory management for coding agents via synthesized trajectories and Memory-aware GRPO, reporting 43.4% and 60.2% resolve rates on SWE-Bench Verified for 4B and 30B models while beating baselines on performance and token use.
-
Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application
This survey categorizes agentic environments for LLMs by eight attributes and domains, introduces symbolic and neural synthesis paradigms with evaluation, and outlines four agent evolution pathways plus three environment evolution paradigms.
-
I-WebGenBench : Evaluating Interactivity in LLM-Generated Scientific Web Applications
A Paper-to-Interactive-System Agent and I-WebGenBench benchmark with 19 papers enable converting scientific PDFs into executable interactive web systems, with PaperVoyager framework shown to improve quality.
-
GEAR: Genetic AutoResearch for Agentic Code Evolution
GEAR applies genetic algorithms to maintain and evolve multiple research states in autonomous code agents, outperforming single-path baselines by continuing to discover improvements over extended runs.
-
KISS Sorcar: A Stupidly-Simple General-Purpose and Software Engineering AI Assistant
The paper introduces KISS Sorcar, a simple open-source AI agent framework with a five-layer hierarchy and git worktree isolation to address context limits, error propagation, and reviewability in software engineering tasks.
-
More Is Different: Toward a Theory of Emergence in AI-Native Software Ecosystems
AI-native software ecosystems exhibit emergent behaviors best explained by complex adaptive systems theory, requiring new ecosystem-level monitoring and seven testable propositions that may extend or replace Lehman's laws.
-
Sema Code: Decoupling AI Coding Agents into Programmable, Embeddable Infrastructure
Sema Code decouples AI coding agents into a programmable npm library with eight mechanisms for isolation, queuing, compression, scheduling, permissions, and integration.
-
Spec Kit Agents: Context-Grounded Agentic Workflows
A multi-agent SDD framework with phase-level context-grounding hooks improves LLM-judged quality by 0.15 points and SWE-bench Lite Pass@1 by 1.7 percent while preserving near-perfect test compatibility.
-
Improving Role Consistency in Multi-Agent Collaboration via Quantitative Role Clarity
A role clarity matrix from softmax-normalized behavior-role similarities is employed as a regularizer to enhance role consistency in multi-agent LLM collaborations.
-
Rethinking the Value of Agent-Generated Tests for LLM-Based Software Engineering Agents
Agent-generated tests mainly act as observational feedback channels and do not meaningfully improve issue resolution success in current LLM software engineering agents.
-
MiMo-V2-Flash Technical Report
MiMo-V2-Flash is a 309B/15B MoE model trained on 27T tokens with hybrid attention and multi-teacher on-policy distillation that matches larger models like DeepSeek-V3.2 while enabling 2.6x faster decoding via repurposed MTP layers.
-
Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks
Magentic-One is a modular multi-agent system that matches state-of-the-art performance on GAIA, AssistantBench, and WebArena using an orchestrator-led team of specialized agents.
-
Automated Summarization of Software Documents: An LLM-based Multi-Agent Approach
Metagente is an LLM multi-agent system using Teacher-Student collaboration that outperforms baselines on real-world software documentation summarization for requirements analysis and technical docs.
-
What makes a harness a harness: necessary and sufficient conditions for an agent harness
Proposes and tests a constitutive definition of 'agent harness' via conceptual analysis of literature and six real systems.
-
LLM-Based Automated Diagnosis Of Integration Test Failures At Google
Auto-Diagnose applies LLMs to summarize and diagnose root causes of integration test failures, reporting 90.14% accuracy on 71 manual cases and positive adoption after Google-wide rollout.
-
An Empirical Study on Influence-Based Pretraining Data Selection for Code Large Language Models
Data-influence-score filtering using validation-set loss on downstream coding tasks improves Code-LLM performance, with the most beneficial training data varying significantly across different programming tasks.
-
Reducing Token Usage of State-in-Context Agents using Minification
Code minification reduces average input token usage by 42% in state-in-context agents with a 12 percentage point drop in resolution rate on SWE-bench Verified.
-
A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications
A survey that defines agent skills as reusable procedural artifacts and reviews methods, resources, and applications across their representation, acquisition, retrieval, and evolution stages.
- Toward Training Superintelligent Software Agents through Self-Play SWE-RL