VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual models, workloads, or hardware.
hub
Agentless: Demystifying LLM-based Software Engineering Agents
37 Pith papers cite this work. Polarity classification is still indexing.
abstract
Recent advancements in large language models (LLMs) have significantly advanced the automation of software development tasks, including code synthesis, program repair, and test generation. More recently, researchers and industry practitioners have developed various autonomous LLM agents to perform end-to-end software development tasks. These agents are equipped with the ability to use tools, run commands, observe feedback from the environment, and plan for future actions. However, the complexity of these agent-based approaches, together with the limited abilities of current LLMs, raises the following question: Do we really have to employ complex autonomous software agents? To attempt to answer this question, we build Agentless -- an agentless approach to automatically solve software development problems. Compared to the verbose and complex setup of agent-based approaches, Agentless employs a simplistic three-phase process of localization, repair, and patch validation, without letting the LLM decide future actions or operate with complex tools. Our results on the popular SWE-bench Lite benchmark show that surprisingly the simplistic Agentless is able to achieve both the highest performance (32.00%, 96 correct fixes) and low cost ($0.70) compared with all existing open-source software agents! Furthermore, we manually classified the problems in SWE-bench Lite and found problems with exact ground truth patch or insufficient/misleading issue descriptions. As such, we construct SWE-bench Lite-S by excluding such problematic issues to perform more rigorous evaluation and comparison. Our work highlights the current overlooked potential of a simple, interpretable technique in autonomous software development. We hope Agentless will help reset the baseline, starting point, and horizon for autonomous software agents, and inspire future work along this crucial direction.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Recent advancements in large language models (LLMs) have significantly advanced the automation of software development tasks, including code synthesis, program repair, and test generation. More recently, researchers and industry practitioners have developed various autonomous LLM agents to perform end-to-end software development tasks. These agents are equipped with the ability to use tools, run commands, observe feedback from the environment, and plan for future actions. However, the complexity of these agent-based approaches, together with the limited abilities of current LLMs, raises the fo
co-cited works
roles
baseline 1polarities
baseline 1representative citing papers
The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
The paper defines AI Harness Engineering as a runtime substrate with eleven components and a four-level ladder that reframes agent reliability as a model-harness-environment system property rather than model capability alone.
10.7% of passing SWE-agent trajectories are Lucky Passes with chaotic behaviors, and a quality score based on process references changes model rankings across eight backends.
CrackMeBench introduces 20 deterministic binary validation tasks and reports GPT-5.5 solving 11/12 generated ones at pass@3 while Claude and Kimi lag, especially on harder tasks.
LLM agents exhibit constraint decay with assertion pass rates dropping substantially as structural requirements increase in multi-file backend code generation across web frameworks.
ProgramBench introduces 200 tasks where models must reconstruct full programs like FFmpeg or SQLite from docs alone; none of 9 evaluated LMs fully solve any task and the best passes 95% tests on only 3% of tasks while favoring monolithic code.
Developers use LLMs like ChatGPT mainly for knowledge acquisition and code generation at the detailed design level, reporting benefits such as better technology selection and early flaw detection alongside limitations like lengthy outputs, incorrect code, and hallucinations.
LLMs show up to 60.58% social bias in generated code; a new Fairness Monitor Agent cuts bias by 65.1% and raises functional correctness from 75.80% to 83.97%.
Adding product context retrieval to AI coding agents raises decision compliance from 46% to 95% on a new benchmark of 8 tasks with 41 weighted decision points.
ADI equips AI debugging agents with function-level interaction via a new execution trace structure, raising SWE-bench Verified resolution to 63.8% at $1.28 per task and delivering 6-18% gains when added to existing agents.
AgentFlow uses a typed graph DSL covering roles, prompts, tools, topology and protocol plus a runtime-signal feedback loop to optimize multi-agent harnesses, reaching 84.3% on TerminalBench-2 and discovering ten new zero-days in Chrome including two critical sandbox escapes.
LogicLoc combines LLMs with Datalog to achieve accurate repo-level code localization without relying on keyword shortcuts in benchmarks.
R2Eval is a new benchmark with 135 real-world code reasoning problems from Python projects that preserves complex data structures for more realistic LLM evaluation.
ConFixAgent repairs diverse concurrency bugs end-to-end by using Static Happens-Before graphs to extract relevant code context for LLMs, outperforming prior tools in benchmarks.
Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.
DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.
BoostAPR boosts automated program repair by training a sequence-level assessor and line-level credit allocator from execution outcomes, then applying them in PPO to reach 40.7% on SWE-bench Verified.
SkillLens organizes skills into policies-strategies-procedures-primitives layers, retrieves via degree-corrected random walk, and uses a verifier for local adaptation, yielding up to 6.31 pp gains on MuLocbench and raising ALFWorld success from 45% to 51.31%.
SWE Atlas is a benchmark suite for coding agents that evaluates Codebase Q&A, Test Writing, and Refactoring using comprehensive protocols assessing both functional correctness and software engineering quality.
A neuro-symbolic agent system for requirements reuse achieves 100% coverage and 0.2% constraint violations by construction through symbolic enforcement of an OOMRAM lattice.
abcoder-ts-parser builds reliable function-level code indexes for large TypeScript repositories significantly faster by using the compiler's native AST and semantic resolution instead of per-symbol language server calls.
AnyPoC introduces a multi-agent system for generating and validating PoC tests from LLM bug reports, producing 1.3x more valid PoCs, rejecting 9.8x more false positives, and discovering 122 new bugs across 12 major projects.
GALA uses hierarchical graph alignment between UI screenshots and code structures to achieve state-of-the-art bug localization in multimodal automated program repair on SWE-bench.
citing papers explorer
-
VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?
VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual models, workloads, or hardware.
-
Why Do Multi-Agent LLM Systems Fail?
The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
-
AI Harness Engineering: A Runtime Substrate for Foundation-Model Software Agents
The paper defines AI Harness Engineering as a runtime substrate with eleven components and a four-level ladder that reframes agent reliability as a model-harness-environment system property rather than model capability alone.
-
AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation
10.7% of passing SWE-agent trajectories are Lucky Passes with chaotic behaviors, and a quality score based on process references changes model rankings across eight backends.
-
CrackMeBench: Binary Reverse Engineering for Agents
CrackMeBench introduces 20 deterministic binary validation tasks and reports GPT-5.5 solving 11/12 generated ones at pass@3 while Claude and Kimi lag, especially on harder tasks.
-
Constraint Decay: The Fragility of LLM Agents in Backend Code Generation
LLM agents exhibit constraint decay with assertion pass rates dropping substantially as structural requirements increase in multi-file backend code generation across web frameworks.
-
ProgramBench: Can Language Models Rebuild Programs From Scratch?
ProgramBench introduces 200 tasks where models must reconstruct full programs like FFmpeg or SQLite from docs alone; none of 9 evaluated LMs fully solve any task and the best passes 95% tests on only 3% of tasks while favoring monolithic code.
-
Using LLMs in Software Design: An Empirical Study of GitHub and A Practitioner Survey
Developers use LLMs like ChatGPT mainly for knowledge acquisition and code generation at the detailed design level, reporting benefits such as better technology selection and early flaw detection alongside limitations like lengthy outputs, incorrect code, and hallucinations.
-
Social Bias in LLM-Generated Code: Benchmark and Mitigation
LLMs show up to 60.58% social bias in generated code; a new Fairness Monitor Agent cuts bias by 65.1% and raises functional correctness from 75.80% to 83.97%.
-
Context-Augmented Code Generation: How Product Context Improves AI Coding Agent Decision Compliance by 49%
Adding product context retrieval to AI coding agents raises decision compliance from 46% to 95% on a new benchmark of 8 tasks with 41 weighted decision points.
-
Empowering Autonomous Debugging Agents with Efficient Dynamic Analysis
ADI equips AI debugging agents with function-level interaction via a new execution trace structure, raising SWE-bench Verified resolution to 63.8% at $1.28 per task and delivering 6-18% gains when added to existing agents.
-
Synthesizing Multi-Agent Harnesses for Vulnerability Discovery
AgentFlow uses a typed graph DSL covering roles, prompts, tools, topology and protocol plus a runtime-signal feedback loop to optimize multi-agent harnesses, reaching 84.3% on TerminalBench-2 and discovering ten new zero-days in Chrome including two critical sandbox escapes.
-
Neurosymbolic Repo-level Code Localization
LogicLoc combines LLMs with Datalog to achieve accurate repo-level code localization without relying on keyword shortcuts in benchmarks.
-
Evaluating LLMs Code Reasoning Under Real-World Context
R2Eval is a new benchmark with 135 real-world code reasoning problems from Python projects that preserves complex data structures for more realistic LLM evaluation.
-
An End-to-End Approach for Fixing Concurrency Bugs via SHB-Based Context Extractor
ConFixAgent repairs diverse concurrency bugs end-to-end by using Static Happens-Before graphs to extract relevant code context for LLMs, outperforming prior tools in benchmarks.
-
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory
Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.
-
Revisiting DAgger in the Era of LLM-Agents
DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.
-
BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models
BoostAPR boosts automated program repair by training a sequence-level assessor and line-level credit allocator from execution outcomes, then applying them in PPO to reach 40.7% on SWE-bench Verified.
-
SkillLens: Adaptive Multi-Granularity Skill Reuse for Cost-Efficient LLM Agents
SkillLens organizes skills into policies-strategies-procedures-primitives layers, retrieves via degree-corrected random walk, and uses a verifier for local adaptation, yielding up to 6.31 pp gains on MuLocbench and raising ALFWorld success from 45% to 51.31%.
-
SWE Atlas: Benchmarking Coding Agents Beyond Issue Resolution
SWE Atlas is a benchmark suite for coding agents that evaluates Codebase Q&A, Test Writing, and Refactoring using comprehensive protocols assessing both functional correctness and software engineering quality.
-
Neuro-Symbolic Agents for Hallucination-Free Requirements Reuse
A neuro-symbolic agent system for requirements reuse achieves 100% coverage and 0.2% constraint violations by construction through symbolic enforcement of an OOMRAM lattice.
-
TypeScript Repository Indexing for Code Agent Retrieval
abcoder-ts-parser builds reliable function-level code indexes for large TypeScript repositories significantly faster by using the compiler's native AST and semantic resolution instead of per-symbol language server calls.
-
AnyPoC: Universal Proof-of-Concept Test Generation for Scalable LLM-Based Bug Detection
AnyPoC introduces a multi-agent system for generating and validating PoC tests from LLM bug reports, producing 1.3x more valid PoCs, rejecting 9.8x more false positives, and discovering 122 new bugs across 12 major projects.
-
GALA: Multimodal Graph Alignment for Bug Localization in Automated Program Repair
GALA uses hierarchical graph alignment between UI screenshots and code structures to achieve state-of-the-art bug localization in multimodal automated program repair on SWE-bench.
-
On the Role of Fault Localization Context for LLM-Based Program Repair
More fault localization context does not consistently improve LLM-based program repair; file-level context gives 15-17x gains, optimal around 6-10 files, while line-level context often degrades performance from noise.
-
Beyond Fixed Tests: Repository-Level Issue Resolution as Coevolution of Code and Behavioral Constraints
Agent-CoEvo is a multi-agent LLM framework that coevolves code patches and test patches to resolve repository-level issues, outperforming fixed-test baselines on SWE-bench Lite and SWT-bench Lite.
-
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?
SWE-Bench Pro is a new benchmark with 1,865 long-horizon tasks from 41 repositories designed to evaluate AI agents on realistic enterprise-level software engineering problems beyond prior benchmarks.
-
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training in three weeks on 512 GPUs.
-
KISS Sorcar: A Stupidly-Simple General-Purpose and Software Engineering AI Assistant
KISS Sorcar introduces a simple layered agent framework and VS Code IDE that reaches 62.2% pass rate on Terminal Bench 2.0 by combining ReAct execution, summarization-based continuation, parallel tools, persistent history, and git worktree isolation while self-validating outputs.
-
More Is Different: Toward a Theory of Emergence in AI-Native Software Ecosystems
AI-native software ecosystems exhibit emergent behaviors best explained by complex adaptive systems theory, requiring new ecosystem-level monitoring and seven testable propositions that may extend or replace Lehman's laws.
-
Sema Code: Decoupling AI Coding Agents into Programmable, Embeddable Infrastructure
Sema Code decouples AI coding agents into a programmable npm library with eight mechanisms for isolation, queuing, compression, scheduling, permissions, and integration.
-
Spec Kit Agents: Context-Grounded Agentic Workflows
A multi-agent SDD framework with phase-level context-grounding hooks improves LLM-judged quality by 0.15 points and SWE-bench Lite Pass@1 by 1.7 percent while preserving near-perfect test compatibility.
-
Improving Role Consistency in Multi-Agent Collaboration via Quantitative Role Clarity
A role clarity matrix from softmax-normalized behavior-role similarities is employed as a regularizer to enhance role consistency in multi-agent LLM collaborations.
-
MiMo-V2-Flash Technical Report
MiMo-V2-Flash is a 309B/15B MoE model trained on 27T tokens with hybrid attention and multi-teacher on-policy distillation that matches larger models like DeepSeek-V3.2 while enabling 2.6x faster decoding via repurposed MTP layers.
-
A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications
The paper surveys agent skills for LLM agents, organizing the literature into a four-stage lifecycle of representation, acquisition, retrieval, and evolution while highlighting their role in system scalability.
-
LLM-Based Automated Diagnosis Of Integration Test Failures At Google
Auto-Diagnose applies LLMs to summarize and diagnose root causes of integration test failures, reporting 90.14% accuracy on 71 manual cases and positive adoption after Google-wide rollout.
-
An Empirical Study on Influence-Based Pretraining Data Selection for Code Large Language Models
Data-influence-score filtering using validation-set loss on downstream coding tasks improves Code-LLM performance, with the most beneficial training data varying significantly across different programming tasks.