arxiv: 2405.15793 · v3 · submitted 2024-05-06 · 💻 cs.SE · cs.AI· cs.CL· cs.HC· cs.LG

Recognition: 2 theorem links

· Lean Theorem

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

Alexander Wettig, Carlos E. Jimenez, John Yang, Karthik Narasimhan, Kilian Lieret, Ofir Press, Shunyu Yao

Authors on Pith no claims yet

Pith reviewed 2026-05-11 12:41 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CLcs.HCcs.LG

keywords agent-computer interfacelanguage model agentssoftware engineeringSWE-benchcode editingrepository navigationautonomous agents

0 comments

The pith

A custom interface lets language model agents autonomously edit code, navigate repositories, and run tests to solve software engineering tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper posits that language model agents need specially designed tools to handle complex digital work the way humans use integrated development environments. It presents SWE-agent as a system built around an agent-computer interface that gives agents direct commands to create files, edit code, browse codebases, and execute programs. This setup produces measurable gains over agents that lack interactive access to the computer environment. The authors evaluate the approach on established benchmarks and report higher success rates than prior non-interactive language models. They also examine how specific interface choices shape the agents' behavior during task completion.

Core claim

SWE-agent's custom agent-computer interface significantly enhances an agent's ability to create and edit code files, navigate entire repositories, and execute tests and other programs, achieving state-of-the-art pass@1 rates of 12.5 percent on SWE-bench and 87.7 percent on HumanEvalFix.

What carries the argument

The agent-computer interface, a collection of commands and tools that allow the language model to interact directly with the file system, editor, and shell like a developer would.

Load-bearing premise

The observed performance gains stem primarily from the interface design rather than from the particular language model, prompt wording, or other implementation details.

What would settle it

Re-evaluate the same base language model on SWE-bench and HumanEvalFix using only a standard non-interactive text interface and compare the resulting pass rates to the reported 12.5 percent and 87.7 percent figures.

read the original abstract

Language model (LM) agents are increasingly being used to automate complicated tasks in digital environments. Just as humans benefit from powerful software applications, such as integrated development environments, for complex tasks like software engineering, we posit that LM agents represent a new category of end users with their own needs and abilities, and would benefit from specially-built interfaces to the software they use. We investigate how interface design affects the performance of language model agents. As a result of this exploration, we introduce SWE-agent: a system that facilitates LM agents to autonomously use computers to solve software engineering tasks. SWE-agent's custom agent-computer interface (ACI) significantly enhances an agent's ability to create and edit code files, navigate entire repositories, and execute tests and other programs. We evaluate SWE-agent on SWE-bench and HumanEvalFix, achieving state-of-the-art performance on both with a pass@1 rate of 12.5% and 87.7%, respectively, far exceeding the previous state-of-the-art achieved with non-interactive LMs. Finally, we provide insight on how the design of the ACI can impact agents' behavior and performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces SWE-agent, an LM-based agent system featuring a custom agent-computer interface (ACI) with specialized commands for file creation/editing, repository navigation, and test/program execution. The central claim is that this ACI design substantially improves agent performance on software engineering tasks, enabling state-of-the-art pass@1 results of 12.5% on SWE-bench and 87.7% on HumanEvalFix—far above prior non-interactive LM approaches—and provides qualitative insights into ACI design effects on agent behavior.

Significance. If the performance improvements can be causally attributed to the ACI rather than confounding factors, the work would provide concrete evidence that interface design tailored to LM agents' interaction patterns can unlock substantial gains in complex, stateful domains like software repositories. The empirical results on two established benchmarks offer a clear, falsifiable demonstration and could inform future agent architectures for automated software engineering.

major comments (2)

[Evaluation / Experiments section] The central claim that the custom ACI is the primary driver of the reported SOTA gains (12.5% pass@1 on SWE-bench) rests on an empirical comparison to prior non-interactive LMs, but the evaluation lacks controlled ablations that hold the underlying LM, base prompting strategy, and interaction loop fixed while varying only the ACI (e.g., against a generic ReAct loop or standard function-calling tools). Without such isolation, alternative explanations—model selection, prompt details, or the mere presence of interactivity—cannot be ruled out.
[Evaluation / Experiments section] The comparison to 'previous state-of-the-art achieved with non-interactive LMs' does not specify how those baselines were re-implemented or given equivalent access to tools and environment state; if the baselines were strictly non-interactive (no tool use at all), the performance delta may overstate the ACI's unique contribution relative to any interactive setup.

minor comments (2)

[Abstract / Introduction] The abstract and introduction would benefit from a brief, explicit statement of the ACI's command set (e.g., the exact syntax for file editing and navigation) to allow readers to assess novelty without immediately consulting the full system description.
[Figures] Figure captions and the ACI diagram could be expanded to label each specialized command and its effect on the agent's observation space, improving clarity for readers unfamiliar with the interface.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our evaluation methodology. We address each major comment below and will revise the manuscript to improve clarity on baselines and experimental design.

read point-by-point responses

Referee: [Evaluation / Experiments section] The central claim that the custom ACI is the primary driver of the reported SOTA gains (12.5% pass@1 on SWE-bench) rests on an empirical comparison to prior non-interactive LMs, but the evaluation lacks controlled ablations that hold the underlying LM, base prompting strategy, and interaction loop fixed while varying only the ACI (e.g., against a generic ReAct loop or standard function-calling tools). Without such isolation, alternative explanations—model selection, prompt details, or the mere presence of interactivity—cannot be ruled out.

Authors: We agree that additional controlled ablations would provide stronger causal evidence for the ACI's contribution. Our current evaluation follows standard practice by comparing against published non-interactive baselines on established benchmarks. The ACI's specialized commands for repository navigation, file editing, and execution are designed specifically for LM agents and differ from generic ReAct or function-calling setups. In revision, we will add a dedicated limitations and discussion subsection that explicitly addresses potential confounders such as model choice and interactivity, along with qualitative examples from agent trajectories illustrating ACI-specific behaviors. We will also include a new comparison to a ReAct baseline using the same LM where feasible. revision: partial
Referee: [Evaluation / Experiments section] The comparison to 'previous state-of-the-art achieved with non-interactive LMs' does not specify how those baselines were re-implemented or given equivalent access to tools and environment state; if the baselines were strictly non-interactive (no tool use at all), the performance delta may overstate the ACI's unique contribution relative to any interactive setup.

Authors: The prior SOTA results are taken directly from the original SWE-bench and HumanEvalFix papers, which used strictly non-interactive LM prompting without any tool access or environment interaction. We did not re-implement them, as is conventional when citing benchmark leaderboards. We will revise the relevant sections to explicitly state this and clarify that our ACI provides a tailored interactive interface with stateful commands unavailable in non-interactive setups. This distinction supports our core claim about interface design for LM agents. While comparisons to other interactive methods would be informative, they fall outside the paper's primary focus on ACI innovation. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical system introduction and benchmark evaluation

full rationale

The paper introduces SWE-agent as an empirical system with a custom ACI and reports pass@1 rates on SWE-bench (12.5%) and HumanEvalFix (87.7%). No derivation chain, equations, first-principles predictions, or uniqueness theorems appear in the provided abstract or claimed structure. Performance claims rest on direct benchmark measurements rather than any reduction to fitted parameters, self-definitions, or self-citation load-bearing steps. The absence of ablations noted in the skeptic take is an experimental-design concern, not a circularity issue under the specified patterns. The work is self-contained as an engineering demonstration against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that LM agents benefit from human-like specialized interfaces and on the empirical observation that the described ACI produces the reported gains; no free parameters or new physical entities are introduced.

axioms (1)

domain assumption Language model agents represent a new category of end users that would benefit from specially-built interfaces to the software they use.
Stated in the opening of the abstract as the motivating premise.

invented entities (1)

Agent-Computer Interface (ACI) no independent evidence
purpose: Custom interface enabling LM agents to create/edit files, navigate repositories, and execute programs for software engineering tasks.
New design artifact introduced by the paper; no independent evidence outside the system itself is provided in the abstract.

pith-pipeline@v0.9.0 · 5525 in / 1332 out tokens · 54095 ms · 2026-05-11T12:41:06.220716+00:00 · methodology

discussion (0)

Forward citations

Cited by 56 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty
cs.CL 2026-05 unverdicted novelty 8.0

Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.
PDEAgent-Bench: A Multi-Metric, Multi-Library Benchmark for PDE Solver Generation
cs.AI 2026-05 unverdicted novelty 8.0

PDEAgent-Bench is the first multi-metric, multi-library benchmark for AI-generated PDE solvers, evaluating executability, numerical accuracy, and efficiency across DOLFINx, Firedrake, and deal.II.
FermiLink: A Unified Agent Framework for Multidomain Autonomous Scientific Simulations
physics.chem-ph 2026-04 conditional novelty 8.0

FermiLink is a unified AI agent framework that automates multidomain scientific simulations via separated package knowledge bases and a four-layer progressive disclosure mechanism, reproducing 56% of target figures in...
Harnessing Agentic Evolution
cs.AI 2026-05 unverdicted novelty 7.0

AEvo introduces a meta-agent that edits the evolution procedure or agent context based on accumulated state, outperforming baselines by 26% relative improvement on agentic benchmarks and achieving SOTA on open-ended tasks.
Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation
cs.CL 2026-05 unverdicted novelty 7.0

Checkup2Action is a new multimodal dataset and benchmark for generating patient-oriented action cards from real-world clinical check-up reports.
Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation
cs.CL 2026-05 conditional novelty 7.0

Checkup2Action is a new multimodal dataset and benchmark for generating safe, prioritized action cards from real-world clinical check-up reports using large language models.
CrackMeBench: Binary Reverse Engineering for Agents
cs.SE 2026-05 accept novelty 7.0

CrackMeBench introduces 20 deterministic binary validation tasks and reports GPT-5.5 solving 11/12 generated ones at pass@3 while Claude and Kimi lag, especially on harder tasks.
PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents
cs.AI 2026-05 unverdicted novelty 7.0

PaperFit uses rendered page images in a closed loop to diagnose and repair typesetting defects in LaTeX documents, outperforming baselines on a new benchmark of 200 papers.
BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models
cs.AI 2026-05 unverdicted novelty 7.0

BoostAPR improves automated program repair by using execution-grounded RL with a sequence-level assessor and line-level credit allocator, reaching 40.7% on SWE-bench Verified and strong cross-language results.
Debugging the Debuggers: Failure-Anchored Structured Recovery for Software Engineering Agents
cs.SE 2026-05 unverdicted novelty 7.0

PROBE structures runtime telemetry into diagnoses and evidence-grounded guidance, raising recovery rates by 12.45 points over baselines on 257 unresolved software repair and AIOps cases.
TeamBench: Evaluating Agent Coordination under Enforced Role Separation
cs.AI 2026-05 unverdicted novelty 7.0

Enforcing role separation in agent teams reveals that prompt-only setups hide coordination failures, with verifiers approving 49% of failing work and teams sometimes harming performance when solo agents already succeed.
Agentic Vulnerability Reasoning on Windows COM Binaries
cs.CR 2026-05 accept novelty 7.0

SLYP agentic pipeline discovers race condition vulnerabilities in Windows COM binaries and generates debugger-verified PoCs, scoring 0.973 F1 on a 40-case benchmark and finding 28 new confirmed vulnerabilities in prod...
ProgramBench: Can Language Models Rebuild Programs From Scratch?
cs.SE 2026-05 unverdicted novelty 7.0

ProgramBench introduces 200 tasks where models must reconstruct full programs like FFmpeg or SQLite from docs alone; none of 9 evaluated LMs fully solve any task and the best passes 95% tests on only 3% of tasks while...
Theory Under Construction: Orchestrating Language Models for Research Software Where the Specification Evolves
cs.SE 2026-04 unverdicted novelty 7.0

Comet-H orchestrates LLMs via deficit-scoring prompt selection and half-life task tracking to co-evolve research software components, demonstrated by a static analysis tool reaching F1=0.768 versus a 0.364 baseline.
RepoDoc: A Knowledge Graph-Based Framework to Automatic Documentation Generation and Incremental Updates
cs.SE 2026-04 unverdicted novelty 7.0

RepoDoc uses a repository knowledge graph with module clustering and semantic impact propagation to generate more complete documentation 3x faster with 85% fewer tokens and handle incremental updates 73% faster than p...
Context-Augmented Code Generation: How Product Context Improves AI Coding Agent Decision Compliance by 49%
cs.SE 2026-04 unverdicted novelty 7.0

Adding product context retrieval to AI coding agents raises decision compliance from 46% to 95% on a new benchmark of 8 tasks with 41 weighted decision points.
From Skills to Talent: Organising Heterogeneous Agents as a Real-World Company
cs.AI 2026-04 unverdicted novelty 7.0

OMC framework turns multi-agent AI into self-organizing companies with Talents, Talent Market, and E²R search, achieving 84.67% success on PRDBench (15.48 points above prior art).
HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?
cs.AI 2026-04 unverdicted novelty 7.0

HiL-Bench shows frontier AI agents fail to ask for help on incomplete tasks, recovering only a fraction of full-information performance, but RL training on Ask-F1 reward improves judgment and transfers across domains.
Evaluating Repository-level Software Documentation via Question Answering and Feature-Driven Development
cs.SE 2026-04 unverdicted novelty 7.0

SWD-Bench evaluates repo-level docs through functionality detection, localization, and completion QA tasks on 4170 entries from PRs, showing best docs raise SWE-Agent issue-solving rate by 20%.
ABTest: Behavior-Driven Testing for AI Coding Agents
cs.SE 2026-04 unverdicted novelty 7.0

ABTest mines 400 failure reports into 47 patterns and 128 actions to generate 647 tests that flag 642 new anomalies across three AI coding agents at 40.8% precision.
SWE-Cycle: Benchmarking Code Agents across the Complete Issue Resolution Cycle
cs.SE 2026-05 unverdicted novelty 6.0

SWE-Cycle benchmark shows sharp drops in code agent success rates from isolated tasks to full autonomous issue resolution, highlighting cross-phase dependency issues.
Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 6.0

VLAs-as-Tools pairs a VLM planner with specialized VLA executors via a new interface and Tool-Aligned Post-Training to raise long-horizon robot success rates on LIBERO-Long and RoboTwin benchmarks.
Rollout Cards: A Reproducibility Standard for Agent Research
cs.AI 2026-05 conditional novelty 6.0

Rollout cards preserve complete agent rollout records and declare the reporting rules behind scores, enabling reproducible evaluation where changing only the rule can alter success rates by over 20 percentage points.
BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models
cs.AI 2026-05 unverdicted novelty 6.0

BoostAPR uses supervised fine-tuning on verified fixes, dual sequence- and line-level reward models from execution feedback, and PPO to reach 40.7% on SWE-bench Verified with strong cross-language results.
BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models
cs.AI 2026-05 unverdicted novelty 6.0

BoostAPR boosts automated program repair by training a sequence-level assessor and line-level credit allocator from execution outcomes, then applying them in PPO to reach 40.7% on SWE-bench Verified.
SkillLens: Adaptive Multi-Granularity Skill Reuse for Cost-Efficient LLM Agents
cs.AI 2026-05 unverdicted novelty 6.0

SkillLens organizes skills into policies-strategies-procedures-primitives layers, retrieves via degree-corrected random walk, and uses a verifier for local adaptation, yielding up to 6.31 pp gains on MuLocbench and ra...
Feedback-Normalized Developer Memory for Reinforcement-Learning Coding Agents: A Safety-Gated MCP Architecture
cs.SE 2026-05 unverdicted novelty 6.0

RL Developer Memory is a feedback-normalized, safety-gated memory architecture for RL coding agents that logs contextual decisions and applies conservative off-policy gates to maintain 80% decision accuracy and full h...
VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation
cs.CL 2026-04 conditional novelty 6.0

VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.
Evaluation-driven Scaling for Scientific Discovery
cs.LG 2026-04 unverdicted novelty 6.0

SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster ...
OpenGame: Open Agentic Coding for Games
cs.SE 2026-04 unverdicted novelty 6.0

OpenGame is the first open-source agentic framework for end-to-end web game creation, using Game Skills and GameCoder-27B to achieve state-of-the-art results on 150 prompts via a new benchmark measuring build health, ...
AutoOR: Scalably Post-training LLMs to Autoformalize Operations Research Problems
cs.LG 2026-04 unverdicted novelty 6.0

AutoOR uses synthetic data generation and RL post-training with solver feedback to enable 8B LLMs to autoformalize linear, mixed-integer, and non-linear OR problems, matching larger models on benchmarks.
When Agents Go Quiet: Output Generation Capacity and Format-Cost Separation for LLM Document Synthesis
cs.AI 2026-04 unverdicted novelty 6.0

LLM agents avoid output stalling and reduce generation tokens by 48-72% via deferred template rendering guided by Output Generation Capacity and a Format-Cost Separation Theorem.
KAIROS: Stateful, Context-Aware Power-Efficient Agentic Inference Serving
cs.DC 2026-04 unverdicted novelty 6.0

KAIROS reduces power by 27% on average (up to 39.8%) for agentic AI inference by using long-lived context to jointly manage GPU frequency, concurrency, and request routing across instances.
From Translation to Superset: Benchmark-Driven Evolution of a Production AI Agent from Rust to Python
cs.SE 2026-04 unverdicted novelty 6.0

LLM-driven translation of a production Rust AI agent to Python achieves near-parity on SWE-bench (73.8% vs 70.0%) and Terminal-Bench (42.5% vs 47.5%) while evolving into a 15.9x smaller superset with 30 new capabilities.
Pioneer Agent: Continual Improvement of Small Language Models in Production
cs.AI 2026-04 unverdicted novelty 6.0

Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on ...
Auditable Agents
cs.AI 2026-04 unverdicted novelty 6.0

No agent system can be accountable without auditability, which requires five dimensions (action recoverability, lifecycle coverage, policy checkability, responsibility attribution, evidence integrity) and mechanisms f...
Memory in the LLM Era: Modular Architectures and Strategies in a Unified Framework
cs.CL 2026-04 unverdicted novelty 6.0

A unified framework for LLM agent memory is benchmarked, with a new hybrid method outperforming state-of-the-art on standard tasks.
SoK: Agentic Skills -- Beyond Tool Use in LLM Agents
cs.CR 2026-02 unverdicted novelty 6.0

The paper systematizes agentic skills beyond tool use, providing design pattern and representation-scope taxonomies plus security analysis of malicious skill infiltration in agent marketplaces.
Agentless: Demystifying LLM-based Software Engineering Agents
cs.SE 2024-07 conditional novelty 6.0

Agentless, a basic three-phase LLM pipeline for bug localization, repair, and validation, outperforms complex open-source agents on SWE-bench Lite with 32% success rate at $0.70 cost.
Discovery of Interpretable Surrogates via Agentic AI: Application to Gravitational Waves
gr-qc 2026-05 unverdicted novelty 5.0

GWAgent agentic workflow produces analytic surrogates for eccentric BBH waveforms with 6.9e-4 median mismatch and 8.4x speedup, outperforming baselines, and infers eccentricity for GW200129.
Safe Multi-Agent Behavior Must Be Maintained, Not Merely Asserted: Constraint Drift in LLM-Based Multi-Agent Systems
cs.MA 2026-05 unverdicted novelty 5.0

Safety constraints in LLM-based multi-agent systems commonly weaken during execution through memory, communication, and tool use, requiring them to be maintained as explicit state rather than asserted once.
Signal Reshaping for GRPO in Weak-Feedback Agentic Code Repair
cs.AI 2026-05 unverdicted novelty 5.0

Reshaping outcome rewards, process signals, and rollout comparability in GRPO raises strict compile-and-semantic accuracy in agentic code repair from 0.385 to 0.535 under weak feedback.
Toward a Science of Intent: Closure Gaps and Delegation Envelopes for Open-World AI Agents
cs.AI 2026-04 unverdicted novelty 5.0

Intent compilation turns vague human goals into verifiable artifacts, using closure-gap vectors and delegation envelopes to separate open-world agent challenges from closed-world solvers and to benchmark closure fixes...
KISS Sorcar: A Stupidly-Simple General-Purpose and Software Engineering AI Assistant
cs.SE 2026-04 unverdicted novelty 5.0

KISS Sorcar introduces a simple layered agent framework and VS Code IDE that reaches 62.2% pass rate on Terminal Bench 2.0 by combining ReAct execution, summarization-based continuation, parallel tools, persistent his...
Reliability of AI Bots Footprints in GitHub Actions CI/CD Workflows
cs.SE 2026-04 unverdicted novelty 5.0

Large-scale analysis of AI bot PRs shows Copilot and Codex achieve the highest CI/CD success rates but more frequent AI contributions correlate with reduced workflow reliability.
Spatial Atlas: Compute-Grounded Reasoning for Spatial-Aware Research Agent Benchmarks
cs.AI 2026-04 unverdicted novelty 5.0

Spatial Atlas implements compute-grounded reasoning via a structured scene graph engine and deterministic computations to deliver competitive accuracy on spatial QA and Kaggle ML benchmarks while preserving interpretability.
Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering
cs.SE 2026-04 accept novelty 5.0

LLM agent progress depends on externalizing cognitive functions into memory, skills, protocols, and harness engineering that coordinates them reliably.
Deep Researcher Agent: An Autonomous Framework for 24/7 Deep Learning Experimentation with Zero-Cost Monitoring
cs.AI 2026-04 unverdicted novelty 5.0

Deep Researcher Agent is a framework for autonomous 24/7 deep learning experimentation by LLM agents using zero-cost monitoring, constant-size memory, and a minimal-toolset multi-agent design.
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
cs.AI 2025-09 conditional novelty 5.0

UI-TARS-2 reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld while attaining 59.8 mean normalized score on a 15-game suite through multi-turn RL and scalable data generation.
Iterative Audit Convergence in LLM-Managed Multi-Agent Systems: A Case Study in Prompt Engineering Quality Assurance
cs.SE 2026-05 conditional novelty 4.0

Nine LLM-agent audit rounds on a 7150-line prompt specification surface found 51 defects with non-monotonic convergence and a post-hoc seven-category taxonomy, showing single-file review misses defect classes.
A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications
cs.IR 2026-05 unverdicted novelty 4.0

The paper surveys agent skills for LLM agents, organizing the literature into a four-stage lifecycle of representation, acquisition, retrieval, and evolution while highlighting their role in system scalability.
Terminus-4B: Can a Smaller Model Replace Frontier LLMs at Agentic Execution Tasks?
cs.AI 2026-05 unverdicted novelty 4.0

A fine-tuned 4B model matches or exceeds frontier LLMs in terminal execution subagent tasks for coding agents, reducing main agent token usage by 30% with no performance loss.
Towards Enabling An Artificial Self-Construction Software Life-cycle via Autopoietic Architectures
cs.SE 2026-04 unverdicted novelty 4.0

Proposes autopoietic architectures for self-constructing software as a fundamental shift in the SDLC, leveraging foundation models for autonomous evolution and maintenance.
OpenKedge: Governing Agentic Mutation with Execution-Bound Safety and Evidence Chains
cs.AI 2026-04 unverdicted novelty 4.0

OpenKedge redefines AI agent state mutations as a governed process using intent proposals, policy-evaluated execution contracts, and cryptographic evidence chains to enable safe, auditable agentic behavior.
Challenges and Future Directions in Agentic Reverse Engineering Systems
cs.CR 2026-04 unverdicted novelty 3.0

Agentic LLM systems for reverse engineering fail on obfuscation, timing, and unique architectures due to token limits and missing guardrails, with challenges and directions proposed.
Building an Internal Coding Agent at Zup: Lessons and Open Questions
cs.SE 2026-04 unverdicted novelty 3.0

Engineering choices for tools, safety guardrails, and human oversight determine whether an internal coding agent delivers value in practice more than the underlying model quality.

Reference graph

Works this paper leans on

106 extracted references · 106 canonical work pages · cited by 53 Pith papers · 2 internal anchors

[1]

Austin, A

J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, and C. Sutton. Program synthesis with large language models, 2021

work page 2021
[2]

J. M. Carroll. Human-computer interaction: psychology as a science of design. Annual review of psychology, 48(1):61–83, 1997

work page 1997
[3]

Cassano, J

F. Cassano, J. Gouwar, D. Nguyen, S. Nguyen, L. Phipps-Costin, D. Pinckney, M.-H. Yee, Y . Zi, C. J. Anderson, M. Q. Feldman, A. Guha, M. Greenberg, and A. Jangda. Multipl-e: A scalable and extensible approach to benchmarking neural code generation, 2022

work page 2022
[4]

Chakraborty, Y

S. Chakraborty, Y . Li, M. Irvine, R. Saha, and B. Ray. Entropy guided spectrum based bug localization using statistical language model. arXiv preprint arXiv:1802.06947, 2018

work page arXiv 2018
[5]

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, and J. K. et. al. Evaluating large language models trained on code, 2021

work page 2021
[6]

Chiang, L

W.-L. Chiang, L. Zheng, Y . Sheng, A. N. Angelopoulos, T. Li, D. Li, H. Zhang, B. Zhu, M. Jordan, J. E. Gonzalez, and I. Stoica. Chatbot arena: An open platform for evaluating llms by human preference, 2024

work page 2024
[7]

Cooper, R

A. Cooper, R. Reimann, and D. Cronin. About face 3: the essentials of interaction design. John Wiley & Sons, Inc., USA, 2007. ISBN 9780470084113

work page 2007
[8]

Y . Ding, Z. Wang, W. U. Ahmad, H. Ding, M. Tan, N. Jain, M. K. Ramanathan, R. Nallapati, P. Bhatia, D. Roth, and B. Xiang. Crosscodeeval: A diverse and multilingual benchmark for cross-file code completion. In Thirty-seventh Conference on Neural Information Processing Sys- tems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum? id=wgDcbBMSfh

work page 2023
[9]

X. Du, M. Liu, K. Wang, H. Wang, J. Liu, Y . Chen, J. Feng, C. Sha, X. Peng, and Y . Lou. Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation, 2023

work page 2023
[10]

Z. Fan, X. Gao, M. Mirchev, A. Roychoudhury, and S. H. Tan. Automated repair of programs from large language models, 2023

work page 2023
[11]

R. Fang, R. Bindu, A. Gupta, Q. Zhan, and D. Kang. Llm agents can autonomously hack websites, 2024

work page 2024
[12]

T. L. Griffiths. Understanding human intelligence through human limitations. Trends in Cognitive Sciences, 24(11):873–883, 2020

work page 2020
[13]

Y . Gu, Y . Shu, H. Yu, X. Liu, Y . Dong, J. Tang, J. Srinivasa, H. Latapie, and Y . Su. Middleware for llms: Tools are instrumental for language agents in complex environments, 2024

work page 2024
[14]

D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, Y . Wu, Y . K. Li, F. Luo, Y . Xiong, and W. Liang. Deepseek-coder: When the large language model meets programming – the rise of code intelligence. CoRR, abs/2401.14196, 2024. URL https: //arxiv.org/abs/2401.14196. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Hendrycks, S

D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika, A. Arora, E. Guo, C. Burns, S. Puranik, H. He, D. Song, and J. Steinhardt. Measuring coding challenge competence with apps, 2021

work page 2021
[16]

S. Holt, M. R. Luyten, and M. van der Schaar. L2MAC: Large language model automatic computer for unbounded code generation. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=EhrzQwsV4K

work page 2024
[17]

S. Hong, M. Zhuge, J. Chen, X. Zheng, Y . Cheng, C. Zhang, J. Wang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou, C. Ran, L. Xiao, C. Wu, and J. Schmidhuber. Metagpt: Meta programming for a multi-agent collaborative framework, 2023

work page 2023
[18]

Huang, J

Q. Huang, J. V ora, P. Liang, and J. Leskovec. Mlagentbench: Evaluating language agents on machine learning experimentation, 2024

work page 2024
[19]

N. Jain, K. Han, A. Gu, W.-D. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code, 2024

work page 2024
[20]

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan. SWE-bench: Can language models resolve real-world github issues? In The Twelfth International Con- ference on Learning Representations, 2024. URL https://openreview.net/forum? id=VTF8yNQM66

work page 2024
[21]

R. Just, D. Jalali, and M. D. Ernst. Defects4J: A Database of existing faults to enable controlled testing studies for Java programs. In ISSTA 2014, Proceedings of the 2014 International Symposium on Software Testing and Analysis, pages 437–440, San Jose, CA, USA, July 2014. Tool demo

work page 2014
[22]

S. Kang, J. Yoon, and S. Yoo. Large language models are few-shot testers: Exploring llm-based general bug reproduction, 2023

work page 2023
[23]

Karampatsis and C

R.-M. Karampatsis and C. Sutton. How often do single-statement bugs occur? the manysstubs4j dataset. 2020 IEEE/ACM 17th International Conference on Mining Software Repositories (MSR), pages 573–577, 2019. URL https://api.semanticscholar.org/CorpusID: 173188438

work page 2020
[24]

J. Y . Koh, R. Lo, L. Jang, V . Duvvur, M. C. Lim, P.-Y . Huang, G. Neubig, S. Zhou, R. Salakhut- dinov, and D. Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks, 2024

work page 2024
[25]

Y . Lai, C. Li, Y . Wang, T. Zhang, R. Zhong, L. Zettlemoyer, S. W. tau Yih, D. Fried, S. Wang, and T. Yu. Ds-1000: A natural and reliable benchmark for data science code generation, 2022

work page 2022
[26]

J. Liu, C. S. Xia, Y . Wang, and L. Zhang. Is your code generated by chatgpt really correct? rigor- ous evaluation of large language models for code generation. arXiv preprint arXiv:2305.01210, 2023

work page internal anchor Pith review arXiv 2023
[27]

N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang. Lost in the middle: How language models use long contexts, 2023

work page 2023
[28]

T. Liu, C. Xu, and J. McAuley. Repobench: Benchmarking repository-level code auto- completion systems. In The Twelfth International Conference on Learning Representations,

work page
[29]

URL https://openreview.net/forum?id=pPjZIOuQuF

work page
[30]

Y . Liu, X. Tang, Z. Cai, J. Lu, Y . Zhang, Y . Shao, Z. Deng, H. Hu, K. An, R. Huang, S. Si, S. Chen, H. Zhao, L. Chen, Y . Wang, T. Liu, Z. Jiang, B. Chang, Y . Qin, W. Zhou, Y . Zhao, A. Cohan, and M. Gerstein. Ml-bench: Evaluating large language models for code generation in repository-level machine learning tasks, 2024

work page 2024
[31]

S. Lu, D. Guo, S. Ren, J. Huang, A. Svyatkovskiy, A. Blanco, C. Clement, D. Drain, D. Jiang, D. Tang, G. Li, L. Zhou, L. Shou, L. Zhou, M. Tufano, M. Gong, M. Zhou, N. Duan, N. Sun- daresan, S. K. Deng, S. Fu, and S. Liu. Codexglue: A machine learning benchmark dataset for code understanding and generation, 2021. 11

work page 2021
[32]

R. T. McCoy, S. Yao, D. Friedman, M. Hardy, and T. L. Griffiths. Embers of autoregression: Understanding large language models through the problem they are trained to solve. arXiv preprint arXiv:2309.13638, 2023

work page arXiv 2023
[33]

Muennighoff, Q

N. Muennighoff, Q. Liu, A. R. Zebaze, Q. Zheng, B. Hui, T. Y . Zhuo, S. Singh, X. Tang, L. V . Werra, and S. Longpre. Octopack: Instruction tuning code large language models. In The Twelfth International Conference on Learning Representations , 2024. URL https: //openreview.net/forum?id=mw1PWNSWZP

work page 2024
[34]

Nakano, J

R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim, C. Hesse, S. Jain, V . Kosaraju, W. Saunders, X. Jiang, K. Cobbe, T. Eloundou, G. Krueger, K. Button, M. Knight, B. Chess, and J. Schulman. Webgpt: Browser-assisted question-answering with human feedback, 2022

work page 2022
[35]

Achiam, S

OpenAI, J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, R. Avila, I. Babuschkin, S. Balaji, V . Balcom, P. Bal- tescu, H. Bao, M. Bavarian, J. Belgum, I. Bello, J. Berdine, G. Bernadett-Shapiro, C. Berner, L. Bogdonoff, O. Boiko, M. Boyd, A.-L. Brakman, G. Brockman, T. Brooks, M. Brun...

work page 2023
[36]

Packer, S

C. Packer, S. Wooders, K. Lin, V . Fang, S. G. Patil, I. Stoica, and J. E. Gonzalez. Memgpt: Towards llms as operating systems, 2024

work page 2024
[37]

emnlp-main.97

O. Press, M. Zhang, S. Min, L. Schmidt, N. Smith, and M. Lewis. Measuring and narrowing the compositionality gap in language models. In H. Bouamor, J. Pino, and K. Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 5687–5711, Singapore, Dec. 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.find...

work page doi:10.18653/v1/2023.findings-emnlp 2023
[38]

URL https://aclanthology.org/2023.findings-emnlp.378

work page 2023
[39]

M. Shao, B. Chen, S. Jancheska, B. Dolan-Gavitt, S. Garg, R. Karri, and M. Shafique. An empirical evaluation of llms for solving offensive security challenges, 2024. 12

work page 2024
[40]

W. Shi, R. Xu, Y . Zhuang, Y . Yu, J. Zhang, H. Wu, Y . Zhu, J. Ho, C. Yang, and M. D. Wang. Ehragent: Code empowers large language models for few-shot complex tabular reasoning on electronic health records, 2024

work page 2024
[41]

Shinn, F

N. Shinn, F. Cassano, E. Berman, A. Gopinath, K. Narasimhan, and S. Yao. Reflexion: Language agents with verbal reinforcement learning, 2023

work page 2023
[42]

Sobania, M

D. Sobania, M. Briesch, C. Hanna, and J. Petke. An analysis of the automatic bug fixing performance of chatgpt, 2023

work page 2023
[43]

Sridhar, R

A. Sridhar, R. Lo, F. F. Xu, H. Zhu, and S. Zhou. Hierarchical prompting assists large language model on web navigation, 2023

work page 2023
[44]

Sumers, S

T. Sumers, S. Yao, K. Narasimhan, and T. L. Griffiths. Cognitive architectures for language agents, 2023

work page 2023
[45]

X. Tang, A. Zou, Z. Zhang, Z. Li, Y . Zhao, X. Zhang, A. Cohan, and M. Gerstein. Medagents: Large language models as collaborators for zero-shot medical reasoning, 2024

work page 2024
[46]

Thakur, G

A. Thakur, G. Tsoukalas, Y . Wen, J. Xin, and S. Chaudhuri. An in-context learning agent for formal theorem-proving, 2024

work page 2024
[47]

Thoppilan, D

R. Thoppilan, D. D. Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H.-T. Cheng, A. Jin, T. Bos, L. Baker, Y . Du, Y . Li, H. Lee, H. S. Zheng, A. Ghafouri, M. Menegali, Y . Huang, M. Krikun, D. Lepikhin, J. Qin, D. Chen, Y . Xu, Z. Chen, A. Roberts, M. Bosma, V . Zhao, Y . Zhou, C.-C. Chang, I. Krivokon, W. Rusch, M. Pickett, P. Srinivasan, L. Man, K. Mei...

work page 2022
[48]

J. Wang, Y . Huang, C. Chen, Z. Liu, S. Wang, and Q. Wang. Software testing with large language model: Survey, landscape, and vision, 2023

work page 2023
[49]

L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen, J. Tang, X. Chen, Y . Lin, W. X. Zhao, Z. Wei, and J. Wen. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6), Mar. 2024. ISSN 2095-2236. doi: 10.1007/s11704-024-40231-1. URLhttp://dx.doi.org/10.1007/s11704-024-40231-1

work page doi:10.1007/s11704-024-40231-1 2024
[50]

X. Wang, Y . Chen, L. Yuan, Y . Zhang, Y . Li, H. Peng, and H. Ji. Executable code actions elicit better llm agents, 2024

work page 2024
[51]

Z. Wang, G. Cuenca, S. Zhou, F. F. Xu, and G. Neubig. Mconala: A benchmark for code generation from multiple natural languages, 2023

work page 2023
[52]

Z. Wang, S. Zhou, D. Fried, and G. Neubig. Execution-based evaluation for open-domain code generation, 2023

work page 2023
[53]

Z. Wang, D. Fried, and G. Neubig. Trove: Inducing verifiable and efficient toolboxes for solving programmatic tasks, 2024

work page 2024
[54]

Wornow, A

M. Wornow, A. Narayan, K. Opsahl-Ong, Q. McIntyre, N. H. Shah, and C. Re. Automating the enterprise with foundation models, 2024

work page 2024
[55]

Z. Wu, C. Han, Z. Ding, Z. Weng, Z. Liu, S. Yao, T. Yu, and L. Kong. Os-copilot: Towards generalist computer agents with self-improvement, 2024

work page 2024
[56]

Z. Xi, W. Chen, X. Guo, W. He, Y . Ding, B. Hong, M. Zhang, J. Wang, S. Jin, E. Zhou, R. Zheng, X. Fan, X. Wang, L. Xiong, Y . Zhou, W. Wang, C. Jiang, Y . Zou, X. Liu, Z. Yin, S. Dou, R. Weng, W. Cheng, Q. Zhang, W. Qin, Y . Zheng, X. Qiu, X. Huang, and T. Gui. The rise and potential of large language model based agents: A survey, 2023. 13

work page 2023
[57]

C. S. Xia and L. Zhang. Less training, more repairing please: revisiting automated program repair via zero-shot learning. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 959–971, 2022

work page 2022
[58]

C. S. Xia, M. Paltenghi, J. L. Tian, M. Pradel, and L. Zhang. Universal fuzzing via large language models. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, 2023

work page 2023
[59]

T. Xie, D. Zhang, J. Chen, X. Li, S. Zhao, R. Cao, T. J. Hua, Z. Cheng, D. Shin, F. Lei, Y . Liu, Y . Xu, S. Zhou, S. Savarese, C. Xiong, V . Zhong, and T. Yu. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024

work page 2024
[60]

A. Z. H. Yang, C. Le Goues, R. Martins, and V . Hellendoorn. Large language models for test-free fault localization. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, ICSE ’24, New York, NY , USA, 2024. Association for Computing Machinery. ISBN 9798400702174. doi: 10.1145/3597503.3623342. URL https://doi. org/10.1145/35...

work page doi:10.1145/3597503.3623342 2024
[61]

J. Yang, A. Prabhakar, K. R. Narasimhan, and S. Yao. Intercode: Standardizing and benchmarking interactive coding with execution feedback. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track , 2023. URL https://openreview.net/forum?id=fvKaLF1ns8

work page 2023
[62]

J. Yang, A. Prabhakar, S. Yao, K. Pei, and K. R. Narasimhan. Language agents as hackers: Evaluating cybersecurity skills with capture the flag. In Multi-Agent Security Workshop@ NeurIPS’23, 2023

work page 2023
[63]

S. Yao, H. Chen, J. Yang, and K. Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents, 2023

work page 2023
[64]

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y . Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=WE_vluYUL-X

work page 2023
[65]

Yin, W.-D

P. Yin, W.-D. Li, K. Xiao, A. Rao, Y . Wen, K. Shi, J. Howland, P. Bailey, M. Catasta, H. Michalewski, A. Polozov, and C. Sutton. Natural language to code generation in inter- active data science notebooks, 2022

work page 2022
[66]

H. Yu, B. Shen, D. Ran, J. Zhang, Q. Zhang, Y . Ma, G. Liang, Y . Li, T. Xie, and Q. Wang. Codereval: A benchmark of pragmatic code generation with generative pre-trained mod- els. In International Conference on Software Engineering , 2023. URL https://api. semanticscholar.org/CorpusID:256459413

work page 2023
[67]

Z. Yu, X. Zhang, N. Shang, Y . Huang, C. Xu, Y . Zhao, W. Hu, and Q. Yin. Wavecoder: Widespread and versatile enhanced instruction tuning with refined data generation. arXiv preprint arXiv:2312.14187, 2023

work page arXiv 2023
[68]

Zelikman, Q

E. Zelikman, Q. Huang, G. Poesia, N. D. Goodman, and N. Haber. Parsel: Algorithmic reasoning with language models by composing decompositions, 2022. URL https://arxiv.org/ abs/2212.10561

work page arXiv 2022
[69]

Zelikman, E

E. Zelikman, E. Lorch, L. Mackey, and A. T. Kalai. Self-taught optimizer (stop): Recursively self-improving code generation, 2024

work page 2024
[70]

Zhang, B

F. Zhang, B. Chen, Y . Zhang, J. Keung, J. Liu, D. Zan, Y . Mao, J.-G. Lou, and W. Chen. Repocoder: Repository-level code completion through iterative retrieval and generation. In The 2023 Conference on Empirical Methods in Natural Language Processing , 2023. URL https://openreview.net/forum?id=q09vTY1Cqh

work page 2023
[71]

Zhang, J

S. Zhang, J. Zhang, J. Liu, L. Song, C. Wang, R. Krishna, and Q. Wu. Training language model agents without modifying language models, 2024. 14

work page 2024
[72]

A. Zhou, K. Yan, M. Shlapentokh-Rothman, H. Wang, and Y .-X. Wang. Language agent tree search unifies reasoning acting and planning in language models, 2023

work page 2023
[73]

S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, Y . Bisk, D. Fried, U. Alon, and G. Neubig. Webarena: A realistic web environment for building autonomous agents, 2023. 15 Appendix In the appendix, we provide additional analyses and more extensive discussions about SWE-agent, agent-computer interface (ACI) design, and model performance on ...

work page 2023
[74]

Localization: Identify file(s)/line(s) causing the issue

work page
[75]

Editing: Generate fixes addressing the given issue

work page
[76]

scrolling

Testing: Write new scripts or modify existing test files to reproduce the issue and/or verify if fixes are correct. To enable LM-based agents to efficiently carry out these individual functions and progress towards the overarch- ing goal of resolving a codebase issue, we provide a file viewer, file editor, search / navigation system, and con- text managem...

work page
[77]

Running it again will lead to the same error.,→ Figure 11: A linting error message

Correct your edit code.,→ DO NOT re-run the same failed edit command. Running it again will lead to the same error.,→ Figure 11: A linting error message. This is emitted if a model generates an edit command that introduces a syntax error into the codebase. The error message shows the before and after of the proposed edit along with what error messages wer...

work page
[78]

Prompt templates: These prompt templates are used to inform the language model of the task setting, show the list of available commands, augment environment responses with the values of state variables, and provide the initial task setting

work page
[79]

Commands are easily modified, added, and removed through manipulating these files’ code contents directly

Command files: These files contain the source code of bash or Python functions and scripts. Commands are easily modified, added, and removed through manipulating these files’ code contents directly. Documentation added in these files can also be injected into prompts to inform the model of the available commands

work page
[80]

Control flow: Methods for parsing model responses and processing history can be specified through these configuration arguments

work page

Showing first 80 references.