pith. sign in

super hub Canonical reference

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

Canonical reference. 80% of citing Pith papers cite this work as background.

142 Pith papers citing it
Background 80% of classified citations
abstract

Language model (LM) agents are increasingly being used to automate complicated tasks in digital environments. Just as humans benefit from powerful software applications, such as integrated development environments, for complex tasks like software engineering, we posit that LM agents represent a new category of end users with their own needs and abilities, and would benefit from specially-built interfaces to the software they use. We investigate how interface design affects the performance of language model agents. As a result of this exploration, we introduce SWE-agent: a system that facilitates LM agents to autonomously use computers to solve software engineering tasks. SWE-agent's custom agent-computer interface (ACI) significantly enhances an agent's ability to create and edit code files, navigate entire repositories, and execute tests and other programs. We evaluate SWE-agent on SWE-bench and HumanEvalFix, achieving state-of-the-art performance on both with a pass@1 rate of 12.5% and 87.7%, respectively, far exceeding the previous state-of-the-art achieved with non-interactive LMs. Finally, we provide insight on how the design of the ACI can impact agents' behavior and performance.

hub tools

citation-role summary

background 22 method 2 baseline 1

citation-polarity summary

claims ledger

  • abstract Language model (LM) agents are increasingly being used to automate complicated tasks in digital environments. Just as humans benefit from powerful software applications, such as integrated development environments, for complex tasks like software engineering, we posit that LM agents represent a new category of end users with their own needs and abilities, and would benefit from specially-built interfaces to the software they use. We investigate how interface design affects the performance of language model agents. As a result of this exploration, we introduce SWE-agent: a system that facilitat

authors

co-cited works

clear filters

representative citing papers

FermiLink: A Unified Agent Framework for Multidomain Autonomous Scientific Simulations

physics.chem-ph · 2026-04-03 · conditional · novelty 8.0

FermiLink is a unified AI agent framework that automates multidomain scientific simulations via separated package knowledge bases and a four-layer progressive disclosure mechanism, reproducing 56% of target figures in benchmarks and generating research-grade results on unpublished problems.

Glite ARF: Verifier-Driven Research with Parallel LLM Coding Agents

cs.MA · 2026-06-25 · accept · novelty 7.0

Glite ARF introduces a verifier-driven three-role framework for parallel LLM coding agents, demonstrated by first- and second-place finishes in the BEA 2026 vocabulary-difficulty shared task across three languages with 29.9-35.9% RMSE reduction at ~$450 API cost.

Understanding the (In)Security of Vibe-Coded Applications

cs.CR · 2026-06-22 · unverdicted · novelty 7.0

Empirical study of real-world vibe-coded apps finds recurring vulnerabilities like placeholder logic and secret exposure caused by AI agent limitations such as memory loss and insufficient security knowledge.

Decentralized Multi-Agent Systems with Shared Context

cs.MA · 2026-06-09 · unverdicted · novelty 7.0

DeLM decentralizes LLM multi-agent coordination with shared verified context, delivering up to 10.5pp gains on SWE-bench Verified and 5.7pp on LongBench-v2 while cutting cost per task by ~50%.

Self-Harness: Harnesses That Improve Themselves

cs.CL · 2026-06-08 · unverdicted · novelty 7.0

Self-Harness lets LLM agents autonomously refine their interaction harnesses through weakness mining, proposal generation, and validation, raising held-out pass rates on Terminal-Bench-2.0 from 40.5% to 61.9%, 23.8% to 38.1%, and 42.9% to 57.1% across three models.

ADK Arena: Evaluating Agent Development Kits via LLM-as-a-Developer

cs.SE · 2026-06-04 · unverdicted · novelty 7.0

ADK Arena evaluates 51 Python ADKs by having an LLM learn each framework's API, write and repair agent code, and run on benchmarks, finding 57% success rate, 5.6x cost variation, no dominant framework, and substitutable information sources.

Constrained Code Generation with Discrete Diffusion

cs.CL · 2026-05-16 · unverdicted · novelty 7.0

Constrained Diffusion for Code (CDC) integrates constraint satisfaction into the reverse denoising process of discrete diffusion models via constraint-aware operators that use optimization and program analysis to steer generation toward feasible programs.

BootstrapAgent: Distilling Repository Setup into Reusable Agent Knowledge

cs.SE · 2026-05-15 · unverdicted · novelty 7.0

BootstrapAgent distills repository bootstrapping heuristics into a persistent .bootstrap contract via multi-agent evidence extraction, Docker verification, and trace-driven repair, reporting 92.9% success and efficiency gains on three benchmarks.

Harnessing Agentic Evolution

cs.AI · 2026-05-13 · unverdicted · novelty 7.0

AEvo introduces a meta-agent that edits the evolution procedure or agent context based on accumulated state, outperforming baselines by 26% relative improvement on agentic benchmarks and achieving SOTA on open-ended tasks.

citing papers explorer

Showing 9 of 9 citing papers after filters.

  • ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation cs.CR · 2025-07-14 · unverdicted · none · ref 54 · internal anchor

    ExCyTIn-Bench is the first benchmark of 7542 questions from Microsoft Sentinel threat investigation graphs, where the best LLM agent achieves a reward of 0.606.

  • Understanding the (In)Security of Vibe-Coded Applications cs.CR · 2026-06-22 · unverdicted · none · ref 18 · internal anchor

    Empirical study of real-world vibe-coded apps finds recurring vulnerabilities like placeholder logic and secret exposure caused by AI agent limitations such as memory loss and insufficient security knowledge.

  • Detecting Privilege Escalation in Polyglot Microservices via Agentic Program Analysis cs.CR · 2026-05-15 · unverdicted · none · ref 24 · internal anchor

    Neo combines LLM-based agents with code search primitives to detect privilege escalation in polyglot microservices, reporting 81% precision and 85% recall while uncovering 24 zero-day vulnerabilities across 25 applications.

  • Agentic Vulnerability Reasoning on Windows COM Binaries cs.CR · 2026-05-06 · accept · none · ref 57 · internal anchor

    SLYP agentic pipeline discovers race condition vulnerabilities in Windows COM binaries and generates debugger-verified PoCs, scoring 0.973 F1 on a 40-case benchmark and finding 28 new confirmed vulnerabilities in production services.

  • Prompt Injection Attack to Tool Selection in LLM Agents cs.CR · 2025-04-28 · conditional · none · ref 3 · internal anchor

    ToolHijacker optimizes malicious tool documents via a two-phase strategy to hijack LLM agents' tool selection in no-box settings.

  • VeriPort: Automated and Verified Patch Backporting at Scale cs.CR · 2026-06-21 · unverdicted · none · ref 16 · internal anchor

    VeriPort is an end-to-end agentic system that backports vulnerability patches to all affected versions of a package at scale while producing verification evidence, achieving 95.3% success on 128 benchmark tasks and generating over 5,000 verified patches across 169 CVEs.

  • Whose Agent Are You? Multi-Layer Fingerprinting and Attribution of Autonomous Web Agents cs.CR · 2026-06-18 · unverdicted · none · ref 36 · internal anchor

    Multi-layer fingerprinting using TLS, HTTP, and browser behavior identifies distinct AI web agent frameworks at 97% accuracy via decision tree classification.

  • SoK: Agentic Skills -- Beyond Tool Use in LLM Agents cs.CR · 2026-02-24 · unverdicted · none · ref 2 · internal anchor

    The paper systematizes agentic skills beyond tool use, providing design pattern and representation-scope taxonomies plus security analysis of malicious skill infiltration in agent marketplaces.

  • Challenges and Future Directions in Agentic Reverse Engineering Systems cs.CR · 2026-04-15 · unverdicted · none · ref 16 · internal anchor

    Agentic LLM systems for reverse engineering fail on obfuscation, timing, and unique architectures due to token limits and missing guardrails, with challenges and directions proposed.