pith. machine review for the scientific record. sign in

arxiv: 2407.01489 · v2 · submitted 2024-07-01 · 💻 cs.SE · cs.AI· cs.CL· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Agentless: Demystifying LLM-based Software Engineering Agents

Chunqiu Steven Xia, Lingming Zhang, Soren Dunn, Yinlin Deng

Pith reviewed 2026-05-12 05:06 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CLcs.LG
keywords LLM agentssoftware engineeringprogram repairagentlessSWE-benchbug fixingautomation
0
0 comments X

The pith

A simple three-phase process without any agent outperforms all complex LLM-based software agents on the SWE-bench Lite benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether elaborate autonomous agents are truly needed for LLM-driven software tasks like bug fixing. It introduces Agentless as a straightforward alternative that follows a fixed sequence of localizing the problem, generating a fix, and checking the patch, without any dynamic planning or complex tool use by the model. On the SWE-bench Lite benchmark this basic workflow produces 96 correct fixes for a 32 percent success rate at an average cost of 70 cents, beating every compared open-source agent system. The authors also manually review the benchmark, remove cases with exact ground-truth patches or misleading descriptions, and release the stricter Lite-S subset to support more reliable comparisons. The work positions simple, interpretable methods as a competitive baseline rather than an afterthought in autonomous software engineering.

Core claim

Agentless shows that a fixed three-phase workflow of localization, repair, and patch validation solves more software issues correctly and at lower cost than agent-based systems that let the LLM choose actions and operate tools in an open-ended loop.

What carries the argument

The agentless three-phase process of localization, repair, and patch validation, which replaces LLM-driven planning and tool calls with a fixed sequence.

If this is right

  • Complex agent architectures are not required to reach state-of-the-art performance on current software repair benchmarks.
  • Improvements to the individual phases of localization and repair can deliver gains without adding agent overhead.
  • Low per-issue cost makes the method practical for large-scale use in software maintenance.
  • Cleaner benchmark subsets like Lite-S become necessary for rigorous evaluation as methods improve.
  • Simple fixed workflows provide a clearer starting point for measuring progress in LLM-based software engineering.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Current LLMs appear to benefit from rigid structure more than from open-ended autonomy when handling coding tasks.
  • Future agents could be built by wrapping the three phases with selective enhancements rather than starting from full autonomy.
  • Benchmark curation practices may need to become standard to prevent over-optimism from easy or ill-specified cases.
  • Practitioners could deploy low-cost phase-based methods immediately while agent research continues.

Load-bearing premise

The manual review and exclusion of problematic issues from SWE-bench Lite to create Lite-S does not introduce selection bias that favors the simple approach.

What would settle it

An independent run of Agentless and the competing agent systems on the original unfiltered SWE-bench Lite or on a fresh set of issues without any manual filtering.

read the original abstract

Recent advancements in large language models (LLMs) have significantly advanced the automation of software development tasks, including code synthesis, program repair, and test generation. More recently, researchers and industry practitioners have developed various autonomous LLM agents to perform end-to-end software development tasks. These agents are equipped with the ability to use tools, run commands, observe feedback from the environment, and plan for future actions. However, the complexity of these agent-based approaches, together with the limited abilities of current LLMs, raises the following question: Do we really have to employ complex autonomous software agents? To attempt to answer this question, we build Agentless -- an agentless approach to automatically solve software development problems. Compared to the verbose and complex setup of agent-based approaches, Agentless employs a simplistic three-phase process of localization, repair, and patch validation, without letting the LLM decide future actions or operate with complex tools. Our results on the popular SWE-bench Lite benchmark show that surprisingly the simplistic Agentless is able to achieve both the highest performance (32.00%, 96 correct fixes) and low cost ($0.70) compared with all existing open-source software agents! Furthermore, we manually classified the problems in SWE-bench Lite and found problems with exact ground truth patch or insufficient/misleading issue descriptions. As such, we construct SWE-bench Lite-S by excluding such problematic issues to perform more rigorous evaluation and comparison. Our work highlights the current overlooked potential of a simple, interpretable technique in autonomous software development. We hope Agentless will help reset the baseline, starting point, and horizon for autonomous software agents, and inspire future work along this crucial direction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces Agentless, a simple three-phase pipeline (localization, repair, patch validation) for LLM-based software engineering that avoids complex tool use, planning, or autonomous decision-making by the model. On the SWE-bench Lite benchmark it reports 32.00% success (96 correct fixes) at $0.70 average cost, outperforming existing open-source agents; the authors additionally construct SWE-bench Lite-S by manually classifying and excluding issues that have exact ground-truth patches or insufficient/misleading descriptions, claiming this enables more rigorous evaluation.

Significance. If the reported performance numbers and fairness of comparisons hold, the work is significant for demonstrating that a fixed, interpretable, low-cost pipeline can exceed the results of more elaborate agent architectures on a widely used benchmark. This could usefully reset baselines and shift research attention toward simpler methods. The public-benchmark evaluation and explicit cost reporting are strengths.

major comments (1)
  1. [Abstract and benchmark-construction section] The construction of SWE-bench Lite-S (Abstract and the section describing the manual classification) excludes issues based on single-team manual review for 'exact ground truth patch' or 'insufficient/misleading' descriptions. No inter-rater reliability, blinding protocol, or quantitative justification for the exclusion thresholds is provided; because the central claim of superior and more rigorous performance rests on this curated subset not selectively disadvantaging iterative agents, the omission is load-bearing and requires explicit documentation or sensitivity analysis.
minor comments (2)
  1. [Abstract] The abstract states results 'compared with all existing open-source software agents' without naming the specific systems or referencing the table/figure that lists them; adding this cross-reference would improve clarity.
  2. [Abstract and experimental results] Performance figures (e.g., 32.00%, 96 fixes) should be accompanied by the total number of problems in each benchmark variant for immediate context.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comments on our work. We address the major concern regarding the construction of SWE-bench Lite-S below.

read point-by-point responses
  1. Referee: [Abstract and benchmark-construction section] The construction of SWE-bench Lite-S (Abstract and the section describing the manual classification) excludes issues based on single-team manual review for 'exact ground truth patch' or 'insufficient/misleading' descriptions. No inter-rater reliability, blinding protocol, or quantitative justification for the exclusion thresholds is provided; because the central claim of superior and more rigorous performance rests on this curated subset not selectively disadvantaging iterative agents, the omission is load-bearing and requires explicit documentation or sensitivity analysis.

    Authors: We agree that the manuscript provides insufficient detail on the manual classification process for SWE-bench Lite-S. In the revised version we will expand the benchmark-construction section with: (1) explicit criteria and concrete examples for each exclusion category (exact ground-truth patch matches and insufficient/misleading descriptions), (2) the exact counts of issues removed under each category, and (3) a sensitivity analysis that reports performance on the original SWE-bench Lite, on Lite-S, and on intermediate subsets obtained by varying the exclusion thresholds. The classification was performed by the author team without formal inter-rater reliability statistics or blinding protocols; we will document this limitation transparently and make the full list of excluded issues and rationales publicly available for community inspection. Because every method we compare (including iterative agents) is evaluated on exactly the same curated subset, the comparison remains fair. The exclusions target objective benchmark defects rather than properties that would systematically favor non-iterative pipelines, and the sensitivity analysis will allow readers to verify that the reported advantage is robust to different curation choices. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on public benchmark with explicit methodological choices

full rationale

The paper reports direct empirical performance of Agentless (32.00% on SWE-bench Lite) against other agents via straightforward comparison on a fixed public benchmark, without equations, fitted parameters, or derivations. Construction of Lite-S is presented as an explicit manual filtering step for 'more rigorous evaluation,' not as a self-referential definition or prediction that reduces to its own inputs. No self-citation chains, uniqueness theorems, or ansatzes are load-bearing for the central claim. The evaluation chain is self-contained against external benchmarks and does not collapse by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper's contribution is primarily empirical, relying on the assumption that LLMs can effectively perform the subtasks in the three phases when prompted appropriately. No free parameters are introduced in the central claim, and no new entities are postulated.

axioms (1)
  • domain assumption LLMs can perform code localization, repair, and validation tasks when given appropriate prompts
    The approach relies on the capability of current LLMs to handle these subtasks effectively.

pith-pipeline@v0.9.0 · 5613 in / 1536 out tokens · 77003 ms · 2026-05-12T05:06:37.210988+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Forward citations

Cited by 37 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?

    cs.AI 2026-05 unverdicted novelty 8.0

    VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual ...

  2. Why Do Multi-Agent LLM Systems Fail?

    cs.AI 2025-03 unverdicted novelty 8.0

    The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.

  3. AI Harness Engineering: A Runtime Substrate for Foundation-Model Software Agents

    cs.SE 2026-05 unverdicted novelty 7.0

    The paper defines AI Harness Engineering as a runtime substrate with eleven components and a four-level ladder that reframes agent reliability as a model-harness-environment system property rather than model capability alone.

  4. AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation

    cs.SE 2026-05 conditional novelty 7.0

    10.7% of passing SWE-agent trajectories are Lucky Passes with chaotic behaviors, and a quality score based on process references changes model rankings across eight backends.

  5. CrackMeBench: Binary Reverse Engineering for Agents

    cs.SE 2026-05 accept novelty 7.0

    CrackMeBench introduces 20 deterministic binary validation tasks and reports GPT-5.5 solving 11/12 generated ones at pass@3 while Claude and Kimi lag, especially on harder tasks.

  6. BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models

    cs.AI 2026-05 unverdicted novelty 7.0

    BoostAPR improves automated program repair by using execution-grounded RL with a sequence-level assessor and line-level credit allocator, reaching 40.7% on SWE-bench Verified and strong cross-language results.

  7. Constraint Decay: The Fragility of LLM Agents in Backend Code Generation

    cs.SE 2026-05 unverdicted novelty 7.0

    LLM agents exhibit constraint decay with assertion pass rates dropping substantially as structural requirements increase in multi-file backend code generation across web frameworks.

  8. ProgramBench: Can Language Models Rebuild Programs From Scratch?

    cs.SE 2026-05 unverdicted novelty 7.0

    ProgramBench introduces 200 tasks where models must reconstruct full programs like FFmpeg or SQLite from docs alone; none of 9 evaluated LMs fully solve any task and the best passes 95% tests on only 3% of tasks while...

  9. Using LLMs in Software Design: An Empirical Study of GitHub and A Practitioner Survey

    cs.SE 2026-05 unverdicted novelty 7.0

    Developers use LLMs like ChatGPT mainly for knowledge acquisition and code generation at the detailed design level, reporting benefits such as better technology selection and early flaw detection alongside limitations...

  10. Social Bias in LLM-Generated Code: Benchmark and Mitigation

    cs.SE 2026-05 unverdicted novelty 7.0

    LLMs show up to 60.58% social bias in generated code; a new Fairness Monitor Agent cuts bias by 65.1% and raises functional correctness from 75.80% to 83.97%.

  11. Context-Augmented Code Generation: How Product Context Improves AI Coding Agent Decision Compliance by 49%

    cs.SE 2026-04 unverdicted novelty 7.0

    Adding product context retrieval to AI coding agents raises decision compliance from 46% to 95% on a new benchmark of 8 tasks with 41 weighted decision points.

  12. Empowering Autonomous Debugging Agents with Efficient Dynamic Analysis

    cs.SE 2026-04 unverdicted novelty 7.0

    ADI equips AI debugging agents with function-level interaction via a new execution trace structure, raising SWE-bench Verified resolution to 63.8% at $1.28 per task and delivering 6-18% gains when added to existing agents.

  13. Synthesizing Multi-Agent Harnesses for Vulnerability Discovery

    cs.CR 2026-04 unverdicted novelty 7.0

    AgentFlow uses a typed graph DSL covering roles, prompts, tools, topology and protocol plus a runtime-signal feedback loop to optimize multi-agent harnesses, reaching 84.3% on TerminalBench-2 and discovering ten new z...

  14. Neurosymbolic Repo-level Code Localization

    cs.SE 2026-04 unverdicted novelty 7.0

    LogicLoc combines LLMs with Datalog to achieve accurate repo-level code localization without relying on keyword shortcuts in benchmarks.

  15. Evaluating LLMs Code Reasoning Under Real-World Context

    cs.SE 2026-04 unverdicted novelty 7.0

    R2Eval is a new benchmark with 135 real-world code reasoning problems from Python projects that preserves complex data structures for more realistic LLM evaluation.

  16. An End-to-End Approach for Fixing Concurrency Bugs via SHB-Based Context Extractor

    cs.SE 2026-04 unverdicted novelty 7.0

    ConFixAgent repairs diverse concurrency bugs end-to-end by using Static Happens-Before graphs to extract relevant code context for LLMs, outperforming prior tools in benchmarks.

  17. Revisiting DAgger in the Era of LLM-Agents

    cs.LG 2026-05 conditional novelty 6.0

    DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.

  18. BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models

    cs.AI 2026-05 unverdicted novelty 6.0

    BoostAPR uses supervised fine-tuning on verified fixes, dual sequence- and line-level reward models from execution feedback, and PPO to reach 40.7% on SWE-bench Verified with strong cross-language results.

  19. SkillLens: Adaptive Multi-Granularity Skill Reuse for Cost-Efficient LLM Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    SkillLens organizes skills into policies-strategies-procedures-primitives layers, retrieves via degree-corrected random walk, and uses a verifier for local adaptation, yielding up to 6.31 pp gains on MuLocbench and ra...

  20. SWE Atlas: Benchmarking Coding Agents Beyond Issue Resolution

    cs.LG 2026-05 unverdicted novelty 6.0

    SWE Atlas is a benchmark suite for coding agents that evaluates Codebase Q&A, Test Writing, and Refactoring using comprehensive protocols assessing both functional correctness and software engineering quality.

  21. Neuro-Symbolic Agents for Hallucination-Free Requirements Reuse

    cs.SE 2026-05 unverdicted novelty 6.0

    A neuro-symbolic agent system for requirements reuse achieves 100% coverage and 0.2% constraint violations by construction through symbolic enforcement of an OOMRAM lattice.

  22. TypeScript Repository Indexing for Code Agent Retrieval

    cs.SE 2026-04 unverdicted novelty 6.0

    abcoder-ts-parser builds reliable function-level code indexes for large TypeScript repositories significantly faster by using the compiler's native AST and semantic resolution instead of per-symbol language server calls.

  23. AnyPoC: Universal Proof-of-Concept Test Generation for Scalable LLM-Based Bug Detection

    cs.SE 2026-04 conditional novelty 6.0

    AnyPoC introduces a multi-agent system for generating and validating PoC tests from LLM bug reports, producing 1.3x more valid PoCs, rejecting 9.8x more false positives, and discovering 122 new bugs across 12 major projects.

  24. GALA: Multimodal Graph Alignment for Bug Localization in Automated Program Repair

    cs.SE 2026-04 unverdicted novelty 6.0

    GALA uses hierarchical graph alignment between UI screenshots and code structures to achieve state-of-the-art bug localization in multimodal automated program repair on SWE-bench.

  25. On the Role of Fault Localization Context for LLM-Based Program Repair

    cs.SE 2026-04 unverdicted novelty 6.0

    More fault localization context does not consistently improve LLM-based program repair; file-level context gives 15-17x gains, optimal around 6-10 files, while line-level context often degrades performance from noise.

  26. Beyond Fixed Tests: Repository-Level Issue Resolution as Coevolution of Code and Behavioral Constraints

    cs.SE 2026-04 unverdicted novelty 6.0

    Agent-CoEvo is a multi-agent LLM framework that coevolves code patches and test patches to resolve repository-level issues, outperforming fixed-test baselines on SWE-bench Lite and SWT-bench Lite.

  27. SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

    cs.SE 2025-09 conditional novelty 6.0

    SWE-Bench Pro is a new benchmark with 1,865 long-horizon tasks from 41 repositories designed to evaluate AI agents on realistic enterprise-level software engineering problems beyond prior benchmarks.

  28. MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

    cs.CL 2025-06 unverdicted novelty 6.0

    MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training...

  29. KISS Sorcar: A Stupidly-Simple General-Purpose and Software Engineering AI Assistant

    cs.SE 2026-04 unverdicted novelty 5.0

    KISS Sorcar introduces a simple layered agent framework and VS Code IDE that reaches 62.2% pass rate on Terminal Bench 2.0 by combining ReAct execution, summarization-based continuation, parallel tools, persistent his...

  30. More Is Different: Toward a Theory of Emergence in AI-Native Software Ecosystems

    cs.SE 2026-04 unverdicted novelty 5.0

    AI-native software ecosystems exhibit emergent behaviors best explained by complex adaptive systems theory, requiring new ecosystem-level monitoring and seven testable propositions that may extend or replace Lehman's laws.

  31. Sema Code: Decoupling AI Coding Agents into Programmable, Embeddable Infrastructure

    cs.SE 2026-04 unverdicted novelty 5.0

    Sema Code decouples AI coding agents into a programmable npm library with eight mechanisms for isolation, queuing, compression, scheduling, permissions, and integration.

  32. Spec Kit Agents: Context-Grounded Agentic Workflows

    cs.SE 2026-04 unverdicted novelty 5.0

    A multi-agent SDD framework with phase-level context-grounding hooks improves LLM-judged quality by 0.15 points and SWE-bench Lite Pass@1 by 1.7 percent while preserving near-perfect test compatibility.

  33. Improving Role Consistency in Multi-Agent Collaboration via Quantitative Role Clarity

    cs.AI 2026-04 conditional novelty 5.0

    A role clarity matrix from softmax-normalized behavior-role similarities is employed as a regularizer to enhance role consistency in multi-agent LLM collaborations.

  34. MiMo-V2-Flash Technical Report

    cs.CL 2026-01 unverdicted novelty 5.0

    MiMo-V2-Flash is a 309B/15B MoE model trained on 27T tokens with hybrid attention and multi-teacher on-policy distillation that matches larger models like DeepSeek-V3.2 while enabling 2.6x faster decoding via repurpos...

  35. A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications

    cs.IR 2026-05 unverdicted novelty 4.0

    The paper surveys agent skills for LLM agents, organizing the literature into a four-stage lifecycle of representation, acquisition, retrieval, and evolution while highlighting their role in system scalability.

  36. LLM-Based Automated Diagnosis Of Integration Test Failures At Google

    cs.SE 2026-04 unverdicted novelty 4.0

    Auto-Diagnose applies LLMs to summarize and diagnose root causes of integration test failures, reporting 90.14% accuracy on 71 manual cases and positive adoption after Google-wide rollout.

  37. An Empirical Study on Influence-Based Pretraining Data Selection for Code Large Language Models

    cs.SE 2026-04 unverdicted novelty 4.0

    Data-influence-score filtering using validation-set loss on downstream coding tasks improves Code-LLM performance, with the most beneficial training data varying significantly across different programming tasks.

Reference graph

Works this paper leans on

115 extracted references · 115 canonical work pages · cited by 36 Pith papers · 6 internal anchors

  1. [1]

    Agent-101: A Software Engineering Agent for Code Assistance devel- oped by IBM Research

    2024. Agent-101: A Software Engineering Agent for Code Assistance devel- oped by IBM Research. https://github.com/swe-bench/experiments/blob/main/ evaluation/lite/20240612_IBM_Research_Agent101/README.md/

  2. [3]

    Alex SIMA

    2024. Alex SIMA. https://github.com/swe-bench/experiments/tree/main/ evaluation/lite/20240706_sima_gpt4o

  3. [4]

    Amazon Q Developer The most capable generative AI–powered assistant for software development

    2024. Amazon Q Developer The most capable generative AI–powered assistant for software development. https://aws.amazon.com/q/developer//

  4. [5]

    AppMap speedruns to the top of the SWE Bench Leaderboard.https://appmap

    2024. AppMap speedruns to the top of the SWE Bench Leaderboard.https://appmap. io/blog/2024/06/20/appmap-navie-swe-bench-leader/

  5. [6]

    AutoCodeRover Autonomous Software Engineering

    2024. AutoCodeRover Autonomous Software Engineering. https://autocoderover. dev/

  6. [7]

    Devin, AI software engineer

    2024. Devin, AI software engineer. https://www.cognition.ai/ introducing-devin

  7. [8]

    Empower your AI agents with Composio - a platform for managing and integrating tools with LLMs and AI agents using Function Calling

    2024. Empower your AI agents with Composio - a platform for managing and integrating tools with LLMs and AI agents using Function Calling. https://docs. composio.dev/introduction/intro/overview

  8. [9]

    Factory Bringing Autonomy to Software Engineering

    2024. Factory Bringing Autonomy to Software Engineering. https://www.factory. ai/

  9. [10]

    Honeycomb

    2024. Honeycomb. https://honeycomb.sh

  10. [11]

    Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku

    2024. Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku. https://www.anthropic.com/news/3-5-models-and-computer-use

  11. [12]

    2024. Isoform. https://github.com/swe-bench/experiments/tree/main/ evaluation/lite/20240829_Isoform

  12. [13]

    Lingma Agent

    2024. Lingma Agent. https://github.com/swe-bench/experiments/tree/main/ evaluation/lite/20240622_Lingma_Agent. 4Sadly the bike is currently broken. 19

  13. [14]

    MentatBot: New SOTA Coding Agent, Available Now

    2024. MentatBot: New SOTA Coding Agent, Available Now. https://mentat.ai/ blog/mentatbot-sota-coding-agent

  14. [15]

    Moatless Tools

    2024. Moatless Tools. https://github.com/aorwall/moatless-tools

  15. [16]

    OpenCSG StarShip

    2024. OpenCSG StarShip. https://opencsg.com/product?class=StarShip/

  16. [17]

    OpenDevin: Code Less, Make More

    2024. OpenDevin: Code Less, Make More. https://github.com/OpenDevin/ OpenDevin/

  17. [18]

    Python ast — Abstract Syntax Trees

    2024. Python ast — Abstract Syntax Trees. https://docs.python.org/3/library/ ast.html/

  18. [19]

    RepoGraph: Enhancing AI Software Engineering with Repository-level Code Graph

    2024. RepoGraph: Enhancing AI Software Engineering with Repository-level Code Graph. https://github.com/ozyyshr/RepoGraph

  19. [20]

    The Road to Ultimate Pull Request Machine

    2024. The Road to Ultimate Pull Request Machine. https://gru.ai/blog/ road-to-ultimate-pull-request-machine/

  20. [21]

    2024. Solver. https://solverai.com

  21. [22]

    SuperCoder

    2024. SuperCoder. https://superagi.com/supercoder/

  22. [23]

    SWE-bench Lite

    2024. SWE-bench Lite. https://www.swebench.com/lite.html

  23. [24]

    Rui Abreu, Peter Zoeteweij, Rob Golsteijn, and Arjan JC Van Gemund. 2009. A practical evaluation of spectrum-based fault localization.Journal of Systems and Software 82, 11 (2009), 1780–1792

  24. [25]

    Rui Abreu, Peter Zoeteweij, and Arjan JC Van Gemund. 2007. On the accuracy of spectrum-based fault localization. In Testing: Academic and industrial conference practice and research techniques-MUTATION (TAICP ART-MUTATION 2007). IEEE, 89–98

  25. [26]

    Anthropic. 2024. Introducing Claude 3.5 Sonnet.https://www.anthropic.com/news/ claude-3-5-sonnet/

  26. [27]

    Daman Arora, Atharv Sonwane, Nalin Wadhwa, Abhav Mehrotra, Saiteja Utpala, Ramakrishna Bairi, Aditya Kanade, and Nagarajan Natarajan. 2024. MASAI: Modular Architecture for Software-engineering AI Agents. arXiv preprint arXiv:2406.11638 (2024)

  27. [28]

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton

  28. [29]

    Program Synthesis with Large Language Models

    Program Synthesis with Large Language Models. arXiv:2108.07732 [cs.PL]

  29. [30]

    Islem Bouzenia, Premkumar Devanbu, and Michael Pradel. 2024. Repairagent: An autonomous, llm-based agent for program repair. arXiv preprint arXiv:2403.17134 (2024)

  30. [31]

    Dong Chen, Shaoxin Lin, Muhan Zeng, Daoguang Zan, Jian-Gang Wang, Anton Cheshkov, Jun Sun, Hao Yu, Guoliang Dong, Artem Aliev, et al. 2024. CodeR: Issue Resolving with Multi-Agent and Task Graphs. arXiv preprint arXiv:2406.01304 (2024)

  31. [32]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al . 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021)

  32. [33]

    Yang Chen. 2024. Flakiness Repair in the Era of Large Language Models. InProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings. 441–443

  33. [34]

    Zimin Chen, Steve Kommrusch, Michele Tufano, Louis-No¨el Pouchet, Denys Poshy- vanyk, and Martin Monperrus. 2019. SequenceR: Sequence-to-Sequence Learning for End-to-End Program Repair. IEEE Transaction on Software Engineering(2019). 20

  34. [35]

    Jimenez, John Yang, Kevin Liu, and Aleksander Madry

    Neil Chowdhury, James Aung, Chan Jun Shern, Oliver Jaffe, Dane Sherburn, Giulio Starace, Evan Mays, Rachel Dias, Marwan Aljubeh, Mia Glaese, Car- los E. Jimenez, John Yang, Kevin Liu, and Aleksander Madry. 2024. Introduc- ing SWE-bench Verified. OpenAI Blog (2024). https://openai.com/index/ introducing-swe-bench-verified/

  35. [36]

    Yinlin Deng, Chunqiu Steven Xia, Haoran Peng, Chenyuan Yang, and Lingming Zhang. 2023. Large Language Models are Zero-Shot Fuzzers: Fuzzing Deep-Learning Libraries via Large Language Models. In 32nd International Symposium on Software Testing and Analysis (ISSTA)

  36. [37]

    Yinlin Deng, Chunqiu Steven Xia, Chenyuan Yang, Shizhuo Dylan Zhang, Shujing Yang, and Lingming Zhang. 2024. Large Language Models are Edge-Case Fuzzers: Testing Deep Learning Libraries via FuzzGPT. In 46th International Conference on Software Engineering (ICSE)

  37. [38]

    Paul Gauthier. 2024. Aider is AI pair programming in your terminal. https://aider. chat/

  38. [39]

    Luca Gazzola, Daniela Micucci, and Leonardo Mariani. 2019. Automatic Software Repair: A Survey. IEEE Transactions on Software Engineering45, 1 (2019), 34–67

  39. [40]

    Ali Ghanbari, Samuel Benton, and Lingming Zhang. 2019. Practical Program Repair via Bytecode Mutation. InProceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis(Beijing, China) (ISSTA 2019). ACM, 19–30

  40. [41]

    D´avid Hidv´egi, Khashayar Etemadi, Sofia Bobadilla, and Martin Monperrus. 2024. Cigar: Cost-efficient program repair with llms. arXiv preprint arXiv:2402.06598 (2024)

  41. [42]

    Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. 2024. Large Language Models Cannot Self-Correct Reasoning Yet. InThe Twelfth International Conference on Learning Representations. https: //openreview.net/forum?id=IkmD3fKBPQ

  42. [43]

    Nan Jiang, Kevin Liu, Thibaud Lutellier, and Lin Tan. 2023. Impact of Code Language Models on Automated Program Repair. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). 1430–1442. https://doi.org/10.1109/ICSE48619. 2023.00125

  43. [44]

    Nan Jiang, Thibaud Lutellier, and Lin Tan. 2021. CURE: Code-Aware Neural Ma- chine Translation for Automatic Program Repair. 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE)(May 2021)

  44. [45]

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real- world Github Issues?. In The Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=VTF8yNQM66

  45. [46]

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. 2024. SWE-bench Leaderboard. https://www.swebench. com/

  46. [47]

    Wei Jin and Alessandro Orso. 2012. Bugredux: Reproducing field failures for in-house debugging. In 2012 34th international conference on software engineering (ICSE). IEEE, 474–484

  47. [48]

    James A Jones and Mary Jean Harrold. 2005. Empirical evaluation of the tarantula automatic fault-localization technique. In Proceedings of the 20th IEEE/ACM international Conference on Automated software engineering. 273–282

  48. [49]

    Sungmin Kang, Gabin An, and Shin Yoo. 2024. A quantitative and qualitative evalua- tion of LLM-based explainable fault localization. Proceedings of the ACM on Software Engineering 1, FSE (2024), 1424–1446. 21

  49. [50]

    Sophia D Kolak, Ruben Martins, Claire Le Goues, and Vincent Josua Hellendoorn

  50. [51]

    In Deep Learning for Code Workshop

    Patch Generation with Language Models: Feasibility and Scaling Behavior. In Deep Learning for Code Workshop

  51. [52]

    Le, David Lo, and Claire Le Goues

    Xuan Bach D. Le, David Lo, and Claire Le Goues. 2016. History Driven Program Repair. In 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER), Vol. 1. 213–224

  52. [53]

    Claire Le Goues, ThanhVu Nguyen, Stephanie Forrest, and Westley Weimer. 2012. GenProg: A Generic Method for Automatic Software Repair. IEEE Transactions on Software Engineering 38, 1 (2012), 54–72

  53. [54]

    Caroline Lemieux, Jeevana Priya Inala, Shuvendu K Lahiri, and Siddhartha Sen. 2023. CODAMOSA: Escaping Coverage Plateaus in Test Generation with Pre-trained Large Language Models. In 45th International Conference on Software Engineering (ICSE)

  54. [55]

    Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al . 2023. Starcoder: may the source be with you!

  55. [56]

    Xia Li, Wei Li, Yuqun Zhang, and Lingming Zhang. 2019. Deepfl: Integrating multiple fault diagnosis dimensions for deep fault localization. In Proceedings of the 28th ACM SIGSOFT international symposium on software testing and analysis. 169–180

  56. [57]

    Yi Li, Shaohua Wang, and Tien N. Nguyen. 2020. DLFix: Context-Based Code Trans- formation Learning for Automated Program Repair. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering(Seoul, South Korea) (ICSE ’20). Association for Computing Machinery, New York, NY, USA, 602–614

  57. [58]

    Junwei Liu, Kaixin Wang, Yixuan Chen, Xin Peng, Zhenpeng Chen, Lingming Zhang, and Yiling Lou. 2024. Large Language Model-Based Agents for Software Engineering: A Survey. arXiv preprint arXiv:2409.02977 (2024)

  58. [59]

    Bissyand ´e

    Kui Liu, Anil Koyuncu, Dongsun Kim, and Tegawend ´e F. Bissyand ´e. 2019. TBar: Revisiting Template-Based Automated Program Repair. InProceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2019). ACM, New York, NY, USA, 31–42

  59. [60]

    Yizhou Liu, Pengfei Gao, Xinchen Wang, Chao Peng, and Zhao Zhang. 2024. MarsCode Agent: AI-native Automated Bug Fixing. arXiv preprint arXiv:2409.00899 (2024)

  60. [61]

    Zhe Liu, Chunyang Chen, Junjie Wang, Mengzhuo Chen, Boyu Wu, Xing Che, Dan- dan Wang, and Qing Wang. 2024. Make llm a testing expert: Bringing human-like interaction to mobile gui testing via functionality-aware decisions. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13

  61. [62]

    LlamaIndex. 2024. LlamaIndex, Data Framework for LLM Applications. https: //www.llamaindex.ai/

  62. [63]

    Fan Long and Martin Rinard. 2015. Staged Program Repair with Condition Synthe- sis. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering (Bergamo, Italy) (ESEC/FSE 2015). New York, NY, USA, 166–178

  63. [64]

    Fan Long and Martin Rinard. 2016. An analysis of the search spaces for generate and validate patch generation systems. In Proceedings of the 38th International Conference on Software Engineering. 702–713

  64. [65]

    Yiling Lou, Ali Ghanbari, Xia Li, Lingming Zhang, Haotian Zhang, Dan Hao, and Lu Zhang. 2020. Can automated program repair refine fault localization? a unified debugging approach. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis. 75–87. 22

  65. [66]

    Yingwei Ma, Qingping Yang, Rongyu Cao, Binhua Li, Fei Huang, and Yongbin Li

  66. [67]

    How to Understand Whole Software Repository? arXiv preprint arXiv:2406.01422 (2024)

  67. [68]

    Sergey Mechtaev, Jooyong Yi, and Abhik Roychoudhury. 2016. Angelix: Scalable Multiline Program Patch Synthesis via Symbolic Analysis. In Proceedings of the 38th International Conference on Software Engineering(Austin, Texas) (ICSE ’16). 691–701

  68. [69]

    Ruijie Meng, Martin Mirchev, Marcel B¨ohme, and Abhik Roychoudhury. 2024. Large language model guided protocol fuzzing. In Proceedings of the 31st Annual Network and Distributed System Security Symposium (NDSS)

  69. [70]

    Xiangxin Meng, Xu Wang, Hongyu Zhang, Hailong Sun, and Xudong Liu. 2022. Improving fault localization and program repair with deep semantic features and transferred knowledge. In Proceedings of the 44th International Conference on Software Engineering. 1169–1180

  70. [71]

    Seokhyeon Moon, Yunho Kim, Moonzoo Kim, and Shin Yoo. 2014. Ask the mutants: Mutating faulty programs for fault localization. In 2014 IEEE Seventh International Conference on Software Testing, Verification and Validation. IEEE, 153–162

  71. [72]

    Theo X Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, and Armando Solar-Lezama. 2023. Is Self-Repair a Silver Bullet for Code Generation?. In The Twelfth International Conference on Learning Representations

  72. [73]

    Yaroslav Oliinyk, Michael Scott, Ryan Tsang, Chongzhou Fang, Houman Homayoun, et al. 2024. Fuzzing BusyBox: Leveraging LLM and Crash Reuse for Embedded Bug Unearthing. arXiv preprint arXiv:2403.03897 (2024)

  73. [74]

    OpenAI. 2023. GPT-4 Technical Report. ArXiv abs/2303.08774 (2023)

  74. [75]

    OpenAI. 2024. Hello GPT-4o. https://openai.com/index/hello-gpt-4o/

  75. [76]

    OpenAI. 2024. New embedding models and API updates. https://openai.com/ index/new-embedding-models-and-api-updates/

  76. [77]

    OpenAI. 2024. OpenAI o1 System Card. https://openai.com/index/ openai-o1-system-card/

  77. [78]

    Xianfei Ou, Cong Li, Yanyan Jiang, and Chang Xu. 2024. The Mutators Reloaded: Fuzzing Compilers with Large Language Model Generated Mutation Operators. In ASPLOS

  78. [79]

    Mike Papadakis, Marinos Kintis, Jie Zhang, Yue Jia, Yves Le Traon, and Mark Harman

  79. [80]

    In Advances in computers

    Mutation testing advances: an analysis and survey. In Advances in computers. Vol. 112. Elsevier, 275–378

  80. [81]

    Mike Papadakis and Yves Le Traon. 2015. Metallaxis-FL: mutation-based fault local- ization. Software Testing, Verification and Reliability25, 5-7 (2015), 605–628

Showing first 80 references.