super hub Canonical reference

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

Alexander Wettig, Carlos E. Jimenez, John Yang, Karthik Narasimhan, Kilian Lieret, Shunyu Yao · 2024 · cs.SE · arXiv 2405.15793

Canonical reference. 80% of citing Pith papers cite this work as background.

129 Pith papers citing it

Background 80% of classified citations

open full Pith review browse 129 citing papers more from Alexander Wettig arXiv PDF

abstract

Language model (LM) agents are increasingly being used to automate complicated tasks in digital environments. Just as humans benefit from powerful software applications, such as integrated development environments, for complex tasks like software engineering, we posit that LM agents represent a new category of end users with their own needs and abilities, and would benefit from specially-built interfaces to the software they use. We investigate how interface design affects the performance of language model agents. As a result of this exploration, we introduce SWE-agent: a system that facilitates LM agents to autonomously use computers to solve software engineering tasks. SWE-agent's custom agent-computer interface (ACI) significantly enhances an agent's ability to create and edit code files, navigate entire repositories, and execute tests and other programs. We evaluate SWE-agent on SWE-bench and HumanEvalFix, achieving state-of-the-art performance on both with a pass@1 rate of 12.5% and 87.7%, respectively, far exceeding the previous state-of-the-art achieved with non-interactive LMs. Finally, we provide insight on how the design of the ACI can impact agents' behavior and performance.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 22 method 2 baseline 1

citation-polarity summary

background 20 use method 2 baseline 1 support 1 unclear 1

claims ledger

abstract Language model (LM) agents are increasingly being used to automate complicated tasks in digital environments. Just as humans benefit from powerful software applications, such as integrated development environments, for complex tasks like software engineering, we posit that LM agents represent a new category of end users with their own needs and abilities, and would benefit from specially-built interfaces to the software they use. We investigate how interface design affects the performance of language model agents. As a result of this exploration, we introduce SWE-agent: a system that facilitat

authors

Alexander Wettig Carlos E. Jimenez John Yang Karthik Narasimhan Kilian Lieret Shunyu Yao

co-cited works

representative citing papers

Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty

cs.CL · 2026-05-12 · unverdicted · novelty 8.0

Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.

PDEAgent-Bench: A Multi-Metric, Multi-Library Benchmark for PDE Solver Generation

cs.AI · 2026-05-10 · unverdicted · novelty 8.0

PDEAgent-Bench is the first multi-metric, multi-library benchmark for AI-generated PDE solvers, evaluating executability, numerical accuracy, and efficiency across DOLFINx, Firedrake, and deal.II.

FermiLink: A Unified Agent Framework for Multidomain Autonomous Scientific Simulations

physics.chem-ph · 2026-04-03 · conditional · novelty 8.0

FermiLink is a unified AI agent framework that automates multidomain scientific simulations via separated package knowledge bases and a four-layer progressive disclosure mechanism, reproducing 56% of target figures in benchmarks and generating research-grade results on unpublished problems.

ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation

cs.CR · 2025-07-14 · unverdicted · novelty 8.0

ExCyTIn-Bench is the first benchmark of 7542 questions from Microsoft Sentinel threat investigation graphs, where the best LLM agent achieves a reward of 0.606.

Agent-Computer Observation Interfaces Enable Dynamic Computer Use

cs.AI · 2026-06-28 · conditional · novelty 7.0

AOI adds keyframe capture, volume-gated audio transcription, and visual narration to computer-use agents, producing +17 to +48 pp gains over screenshot baselines on DynaCU-Bench with no retraining.

Glite ARF: Verifier-Driven Research with Parallel LLM Coding Agents

cs.MA · 2026-06-25 · accept · novelty 7.0

Glite ARF introduces a verifier-driven three-role framework for parallel LLM coding agents, demonstrated by first- and second-place finishes in the BEA 2026 vocabulary-difficulty shared task across three languages with 29.9-35.9% RMSE reduction at ~$450 API cost.

RigorBench: Benchmarking Engineering Process Discipline in Autonomous AI Coding Agents

cs.SE · 2026-06-21 · unverdicted · novelty 7.0

RigorBench evaluates AI coding agents on process discipline via five pillars and reports 41% higher process scores and 17% better outcome correctness with structured approaches on 30 tasks.

Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks

cs.LG · 2026-06-10 · conditional · novelty 7.0

Claw-SWE-Bench is a 350-instance multilingual benchmark for OpenClaw-style agent harnesses that shows adapter design raises Pass@1 from 19.1% to 73.4% on the same model while releasing data for reproducible comparison.

Decentralized Multi-Agent Systems with Shared Context

cs.MA · 2026-06-09 · unverdicted · novelty 7.0

DeLM decentralizes LLM multi-agent coordination with shared verified context, delivering up to 10.5pp gains on SWE-bench Verified and 5.7pp on LongBench-v2 while cutting cost per task by ~50%.

Self-Harness: Harnesses That Improve Themselves

cs.CL · 2026-06-08 · unverdicted · novelty 7.0

Self-Harness lets LLM agents autonomously refine their interaction harnesses through weakness mining, proposal generation, and validation, raising held-out pass rates on Terminal-Bench-2.0 from 40.5% to 61.9%, 23.8% to 38.1%, and 42.9% to 57.1% across three models.

ADK Arena: Evaluating Agent Development Kits via LLM-as-a-Developer

cs.SE · 2026-06-04 · unverdicted · novelty 7.0

ADK Arena evaluates 51 Python ADKs by having an LLM learn each framework's API, write and repair agent code, and run on benchmarks, finding 57% success rate, 5.6x cost variation, no dominant framework, and substitutable information sources.

From Prompt to Process: a Process Taxonomy and Comparative Assessment of Frameworks Supporting AI Software Development Agents

cs.SE · 2026-06-03 · conditional · novelty 7.0

A new six-dimension process taxonomy for AI software development frameworks shows convergence on artifact persistence and human oversight but reveals that no framework covers all dimensions strongly, indicating a depth-portability trade-off.

DeployBench: Benchmarking LLM Agents for Research Artifact Deployment

cs.SE · 2026-06-03 · unverdicted · novelty 7.0

DeployBench is a new benchmark of 51 research-artifact deployment tasks where four LLMs with OpenHands achieve 7.8-51% pass rates, with failures mostly from agents stopping after weaker self-checks than the paper requires.

PassNet: Scaling Large Language Models for Graph Compiler Pass Generation

cs.AI · 2026-05-28 · unverdicted · novelty 7.0

PassNet provides a dataset of 18K graphs and PassBench for LLM-generated compiler passes, with fine-tuned models achieving 2.67x gains on long-tail tasks where TorchInductor underperforms.

ContextEcho: A Benchmark for Persona Drift in Long Agentic-Coding Sessions

cs.CL · 2026-05-22 · unverdicted · novelty 7.0

ContextEcho benchmark shows persona drift occurs across 23 frontier models in long agentic-coding sessions, is not reliably reset by compaction, and can be restored by single-shot anchors with mode-dependent effects.

MBABench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance

cs.AI · 2026-05-21 · unverdicted · novelty 7.0 · 2 refs

MBABench evaluates LLM agents on end-to-end financial spreadsheet tasks and shows current models fail to meet professional finance standards, especially beyond simple calculations.

From Runnable to Shippable: Multi-Agent Test-Driven Development for Generating Full-Stack Web Applications from Requirements

cs.SE · 2026-05-17 · unverdicted · novelty 7.0

TDDev automates the full TDD loop for web app generation from requirements, delivering 34-48 percentage point quality gains and zero manual intervention in user studies.

Constrained Code Generation with Discrete Diffusion

cs.CL · 2026-05-16 · unverdicted · novelty 7.0

Constrained Diffusion for Code (CDC) integrates constraint satisfaction into the reverse denoising process of discrete diffusion models via constraint-aware operators that use optimization and program analysis to steer generation toward feasible programs.

BootstrapAgent: Distilling Repository Setup into Reusable Agent Knowledge

cs.SE · 2026-05-15 · unverdicted · novelty 7.0

BootstrapAgent distills repository bootstrapping heuristics into a persistent .bootstrap contract via multi-agent evidence extraction, Docker verification, and trace-driven repair, reporting 92.9% success and efficiency gains on three benchmarks.

Detecting Privilege Escalation in Polyglot Microservices via Agentic Program Analysis

cs.CR · 2026-05-15 · unverdicted · novelty 7.0

Neo combines LLM-based agents with code search primitives to detect privilege escalation in polyglot microservices, reporting 81% precision and 85% recall while uncovering 24 zero-day vulnerabilities across 25 applications.

Harnessing Agentic Evolution

cs.AI · 2026-05-13 · unverdicted · novelty 7.0

AEvo introduces a meta-agent that edits the evolution procedure or agent context based on accumulated state, outperforming baselines by 26% relative improvement on agentic benchmarks and achieving SOTA on open-ended tasks.

Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation

cs.CL · 2026-05-12 · conditional · novelty 7.0 · 2 refs

Checkup2Action is a new multimodal dataset and benchmark for generating safe, prioritized action cards from real-world clinical check-up reports using large language models.

CrackMeBench: Binary Reverse Engineering for Agents

cs.SE · 2026-05-11 · accept · novelty 7.0

CrackMeBench introduces 20 deterministic binary validation tasks and reports GPT-5.5 solving 11/12 generated ones at pass@3 while Claude and Kimi lag, especially on harder tasks.

PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

PaperFit uses rendered page images in a closed loop to diagnose and repair typesetting defects in LaTeX documents, outperforming baselines on a new benchmark of 200 papers.

citing papers explorer

Showing 29 of 129 citing papers.

KISS Sorcar: A Stupidly-Simple General-Purpose and Software Engineering AI Assistant cs.SE · 2026-04-26 · unverdicted · none · ref 25 · 2 links · internal anchor
The paper introduces KISS Sorcar, a simple open-source AI agent framework with a five-layer hierarchy and git worktree isolation to address context limits, error propagation, and reviewability in software engineering tasks.
Reliability of AI Bots Footprints in GitHub Actions CI/CD Workflows cs.SE · 2026-04-20 · unverdicted · none · ref 22 · internal anchor
Large-scale analysis of AI bot PRs shows Copilot and Codex achieve the highest CI/CD success rates but more frequent AI contributions correlate with reduced workflow reliability.
Spatial Atlas: Compute-Grounded Reasoning for Spatial-Aware Research Agent Benchmarks cs.AI · 2026-04-13 · unverdicted · none · ref 26 · internal anchor
Spatial Atlas implements compute-grounded reasoning via a structured scene graph engine and deterministic computations to deliver competitive accuracy on spatial QA and Kaggle ML benchmarks while preserving interpretability.
Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering cs.SE · 2026-04-09 · accept · none · ref 172 · internal anchor
LLM agent progress depends on externalizing cognitive functions into memory, skills, protocols, and harness engineering that coordinates them reliably.
Deep Researcher Agent: An Autonomous Framework for 24/7 Deep Learning Experimentation with Zero-Cost Monitoring cs.AI · 2026-04-07 · unverdicted · none · ref 9 · internal anchor
Deep Researcher Agent is a framework for autonomous 24/7 deep learning experimentation by LLM agents using zero-cost monitoring, constant-size memory, and a minimal-toolset multi-agent design.
Separating Intelligence from Execution: A Workflow Engine for the Model Context Protocol cs.DC · 2026-03-13 · unverdicted · none · ref 18 · internal anchor
An MCP-native workflow engine decouples agent reasoning from execution by using declarative blueprints, reducing token cost by over 99% on a 67-step Kubernetes synchronization task.
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning cs.AI · 2025-09-02 · conditional · none · ref 79 · internal anchor
UI-TARS-2 reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld while attaining 59.8 mean normalized score on a 15-game suite through multi-turn RL and scalable data generation.
Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks cs.AI · 2024-11-07 · unverdicted · none · ref 66 · internal anchor
Magentic-One is a modular multi-agent system that matches state-of-the-art performance on GAIA, AssistantBench, and WebArena using an orchestrator-led team of specialized agents.
ContextSniper: AntTrail's Token-Efficient Code Memory for Repository-Level Program Repair cs.AI · 2026-07-02 · unverdicted · none · ref 2 · internal anchor
ContextSniper reduces token use by 38.9-51.5% in repository-level program repair agents on SWE-bench Lite with 2 percentage point drops in resolution rate.
Position: Coding Benchmarks Are Misaligned with Agentic Software Engineering cs.SE · 2026-06-16 · unverdicted · none · ref 49 · internal anchor
Coding benchmarks misalign with agentic software engineering because they conflate model and harness, grade against single references, and provide no component-level iteration signals.
The End of Code Review: Coding Agents Supersede Human Inspection cs.SE · 2026-06-11 · unverdicted · none · ref 6 · internal anchor
Coding agents have reached a capability level where human code review is no longer necessary because agents can serve every review goal more efficiently and the hybrid human-reviewer model does not scale.
Exploration Structure in LLM Agents for Multi-File Change Localization cs.SE · 2026-06-10 · unverdicted · none · ref 6 · internal anchor
Non-linear domain-scoped parallel LLM agents achieve higher micro F1 than linear exploration and some baselines for multi-file change localization on SWE-bench Pro ansible tasks.
What makes a harness a harness: necessary and sufficient conditions for an agent harness cs.SE · 2026-06-08 · unverdicted · none · ref 58 · internal anchor
Proposes and tests a constitutive definition of 'agent harness' via conceptual analysis of literature and six real systems.
From 0-to-1 to 1-to-N: Reproducible Engineering Evidence for MetaAI Recursive Self-Design cs.AI · 2026-06-08 · unverdicted · none · ref 21 · internal anchor
Proposes four operational criteria for MetaAI recursive self-design, maps public systems including DGM's reported benchmark gains, and supplies a reproducible protocol without completed experimental runs.
CLI-Anything: Towards Agent-Native Computer Use cs.HC · 2026-06-02 · unverdicted · none · ref 13 · internal anchor
CLI-Anything advocates transforming applications into CLI-based protocols for agent-native interaction and introduces the CLI-Hub platform to support this shift away from GUI agents.
Terminus-4B: Can a Smaller Model Replace Frontier LLMs at Agentic Execution Tasks? cs.AI · 2026-05-04 · unverdicted · none · ref 4 · internal anchor
A fine-tuned 4B model matches or exceeds frontier LLMs in terminal execution subagent tasks for coding agents, reducing main agent token usage by 30% with no performance loss.
AgentStop: Terminating Local AI Agents Early to Save Energy in Consumer Devices cs.LG · 2026-05-01 · unverdicted · none · ref 29 · internal anchor
AgentStop uses execution signals to early-terminate failing local LLM agent trajectories, cutting energy use 15-20% with minimal utility loss.
Towards Enabling An Artificial Self-Construction Software Life-cycle via Autopoietic Architectures cs.SE · 2026-04-15 · unverdicted · none · ref 70 · internal anchor
Proposes autopoietic architectures for self-constructing software as a fundamental shift in the SDLC, leveraging foundation models for autonomous evolution and maintenance.
OpenKedge: Governing Agentic Mutation with Execution-Bound Safety and Evidence Chains cs.AI · 2026-04-07 · unverdicted · none · ref 19 · internal anchor
OpenKedge redefines AI agent state mutations as a governed process using intent proposals, policy-evaluated execution contracts, and cryptographic evidence chains to enable safe, auditable agentic behavior.
From Cool Demos to Production-Ready FMware: Core Challenges and a Technology Roadmap cs.SE · 2024-10-28 · unverdicted · none · ref 118 · internal anchor
A semi-structured thematic synthesis identifies core challenges in FM selection, alignment, prompting, orchestration, testing, deployment, and cross-cutting concerns like observability for production-ready FMware.
The Rollout Infrastructure Tax in Coding-Agent Reinforcement Learning cs.LG · 2026-07-01 · unverdicted · none · ref 16 · internal anchor
Comparative study of four execution substrates for coding-agent RL rollouts finds 110x cold-start latency variation and 1.8x spread in worker-hours for one million 150-step trajectories.
The Rise of AI-Native Software Engineering: Implications for Practice, Education, and the Future Workforce cs.SE · 2026-06-11 · unverdicted · none · ref 4 · internal anchor
A systematic review of 48 papers on AI in software engineering synthesizes evidence into frameworks for AI-native practice, a nine-dimension competency model, a four-phase curriculum roadmap, and an agenda of research gaps, while noting contradictory productivity findings.
Agent System Operations: Categorization, Challenges, and Future Directions cs.MA · 2026-06-01 · unverdicted · none · ref 111 · internal anchor
This survey categorizes anomalies in agent systems into intra-agent and inter-agent types and introduces the AgentOps framework with four operational stages.
Meta-Engineering Harnesses for AI-Native Software Production: A Contract-Driven Adversarial Verification Architecture with Early Deployment Report cs.SE · 2026-05-25 · unverdicted · none · ref 9 · internal anchor
Presents a contract-driven adversarial verification architecture for AI-native software production with early deployment observations from 17 features.
A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications cs.IR · 2026-05-08 · unverdicted · none · ref 52 · 3 links · internal anchor
A survey that defines agent skills as reusable procedural artifacts and reviews methods, resources, and applications across their representation, acquisition, retrieval, and evolution stages.
Challenges and Future Directions in Agentic Reverse Engineering Systems cs.CR · 2026-04-15 · unverdicted · none · ref 16 · internal anchor
Agentic LLM systems for reverse engineering fail on obfuscation, timing, and unique architectures due to token limits and missing guardrails, with challenges and directions proposed.
Building an Internal Coding Agent at Zup: Lessons and Open Questions cs.SE · 2026-04-10 · unverdicted · none · ref 13 · internal anchor
Engineering choices for tools, safety guardrails, and human oversight determine whether an internal coding agent delivers value in practice more than the underlying model quality.
VeRO: A Harness for Agents to Optimize Agents cs.AI · 2026-02-25 · unreviewed · ref 28 · internal anchor
Toward Training Superintelligent Software Agents through Self-Play SWE-RL cs.SE · 2025-12-21 · unreviewed · ref 52 · internal anchor

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer