arxiv: 2407.16741 · v3 · submitted 2024-07-23 · 💻 cs.SE · cs.AI· cs.CL

Recognition: 3 theorem links

· Lean Theorem

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Xingyao Wang , Boxuan Li , Yufan Song , Frank F. Xu , Xiangru Tang , Mingchen Zhuge , Jiayi Pan , Yueqi Song , Bowen Li , Jaskirat Singh , Hoang H. Tran , Fuqiang Li , Ren Ma , Mingzhang Zheng , Bill Qian , Yanjun Shao , Niklas Muennighoff , Yizhe Zhang , Binyuan Hui , Junyang Lin , Robert Brennan , Hao Peng , Heng Ji , Graham Neubig

Authors on Pith no claims yet

Pith reviewed 2026-05-11 07:00 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CL

keywords OpenHandsAI agentssoftware engineeringlarge language modelsSWE-BenchWebArenaagent platformsgeneralist agents

0 comments

The pith

OpenHands is an open platform that lets AI agents develop software by writing code, using the command line, and browsing the web.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents OpenHands as a platform for creating AI agents that perform software development tasks in the same manner as human programmers. These agents write and run code, issue commands through a terminal, and navigate websites, all while operating inside controlled environments. The system supports building new agents, linking several agents to collaborate, and running them against standard test sets for coding and web tasks. A sympathetic reader would care because this infrastructure could speed up progress toward AI that handles complete programming projects with minimal oversight.

Core claim

We introduce OpenHands, a platform for the development of powerful and flexible AI agents that interact with the world in similar ways to those of a human developer: by writing code, interacting with a command line, and browsing the web. The platform allows for the implementation of new agents, safe interaction with sandboxed environments for code execution, coordination between multiple agents, and incorporation of evaluation benchmarks. Based on our currently incorporated benchmarks, we perform an evaluation of agents over 15 challenging tasks, including software engineering and web browsing.

What carries the argument

The OpenHands platform, which supplies standardized interfaces for agents to write and execute code, run terminal commands, and perform web actions inside isolated sandboxes.

Load-bearing premise

That the sandboxing, multi-agent coordination, and benchmark tools will produce agents whose results on test tasks carry over to useful, real-world software work.

What would settle it

An experiment in which an agent built with OpenHands attempts a SWE-Bench task but either breaks out of its sandbox or produces code that fails to pass the required tests without human changes.

read the original abstract

Software is one of the most powerful tools that we humans have at our disposal; it allows a skilled programmer to interact with the world in complex and profound ways. At the same time, thanks to improvements in large language models (LLMs), there has also been a rapid development in AI agents that interact with and affect change in their surrounding environments. In this paper, we introduce OpenHands (f.k.a. OpenDevin), a platform for the development of powerful and flexible AI agents that interact with the world in similar ways to those of a human developer: by writing code, interacting with a command line, and browsing the web. We describe how the platform allows for the implementation of new agents, safe interaction with sandboxed environments for code execution, coordination between multiple agents, and incorporation of evaluation benchmarks. Based on our currently incorporated benchmarks, we perform an evaluation of agents over 15 challenging tasks, including software engineering (e.g., SWE-BENCH) and web browsing (e.g., WEBARENA), among others. Released under the permissive MIT license, OpenHands is a community project spanning academia and industry with more than 2.1K contributions from over 188 contributors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces OpenHands (formerly OpenDevin), an open platform for developing AI agents that perform software engineering tasks by writing code, interacting with command-line interfaces, and browsing the web. It describes the platform's support for implementing new agents, safe sandboxed code execution, multi-agent coordination, and integration with evaluation benchmarks. The authors report performing evaluations of agents across 15 tasks drawn from benchmarks including SWE-Bench and WebArena, and note the project's MIT license and community contributions exceeding 2.1K from over 188 contributors.

Significance. If the described architecture and features hold, this is a useful engineering contribution to AI agent research for software development, providing an extensible, open-source framework that integrates existing benchmarks and supports community-driven development. The permissive licensing and documented contributor scale are explicit strengths that aid reproducibility and extension by others in the field.

major comments (1)

[Evaluation] Evaluation section: the manuscript states that evaluations were performed across 15 tasks from SWE-Bench, WebArena, and related benchmarks, but reports no quantitative metrics (e.g., success rates, pass@k scores), baselines, error analysis, or per-task breakdowns. This absence leaves the central claim that the platform enables 'powerful and flexible' agents only moderately supported, as effectiveness cannot be assessed from the provided description alone.

minor comments (2)

[Abstract] The abstract and introduction use 'f.k.a. OpenDevin' without a dedicated footnote or section explaining the rebranding rationale or continuity of prior work.
[Architecture] Architecture diagrams (if present in §3) would benefit from explicit callouts for the sandboxing and multi-agent coordination mechanisms to improve clarity for readers implementing new agents.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the platform's utility for the community. We address the major comment point by point below.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the manuscript states that evaluations were performed across 15 tasks from SWE-Bench, WebArena, and related benchmarks, but reports no quantitative metrics (e.g., success rates, pass@k scores), baselines, error analysis, or per-task breakdowns. This absence leaves the central claim that the platform enables 'powerful and flexible' agents only moderately supported, as effectiveness cannot be assessed from the provided description alone.

Authors: We agree that the current evaluation section primarily describes the platform's integration with the benchmarks and the selection of 15 tasks without providing quantitative metrics, baselines, or per-task breakdowns. The manuscript's core contribution is the open platform itself (architecture, sandboxing, multi-agent support, and benchmark integration), with the evaluation intended to demonstrate applicability rather than to serve as a full agent benchmark study. To better substantiate the claims of enabling powerful and flexible agents, we will revise the evaluation section to include available quantitative results (success rates on the tasks), relevant baselines, and per-task breakdowns. This will be incorporated in the next version of the manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a descriptive engineering platform paper that introduces OpenHands (f.k.a. OpenDevin) for developing AI agents capable of code writing, CLI interaction, and web browsing. It details sandboxed environments, multi-agent coordination, benchmark integration, and reports evaluations on 15 tasks from existing benchmarks (SWE-Bench, WebArena, etc.). No mathematical derivations, equations, fitted parameters, or predictions appear that reduce to prior quantities by construction. No self-citation chains or uniqueness theorems are invoked as load-bearing for any central claim. The contribution is self-contained as an open-source platform release under MIT license with community input.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an engineering platform paper with no mathematical derivations, fitted parameters, or postulated entities.

pith-pipeline@v0.9.0 · 5607 in / 1073 out tokens · 41231 ms · 2026-05-11T07:00:48.924335+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce OpenHands... platform for... AI agents that interact with the world in similar ways to those of a human developer: by writing code, interacting with a command line, and browsing the web... safe interaction with sandboxed environments... coordination between multiple agents... evaluation benchmarks.
IndisputableMonolith.Foundation.LedgerCanonicality ZeroParameterComparisonLedger unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

OpenHands consists of 3 main components: 1) Agent abstraction... 2) Event stream... 3) Runtime to execute all actions into observations.
IndisputableMonolith.Foundation.PhiForcing phi_forcing unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Evaluation... 15 challenging tasks, including software engineering (e.g., SWE-BENCH) and web browsing (e.g., WEBARENA).

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Continual Harness: Online Adaptation for Self-Improving Foundation Agents
cs.LG 2026-05 conditional novelty 8.0

Continual Harness automates online self-improvement for foundation-model embodied agents by refining prompts, sub-agents, skills, and memory within one run, cutting button-press costs on Pokemon Red and Emerald and cl...
PDEAgent-Bench: A Multi-Metric, Multi-Library Benchmark for PDE Solver Generation
cs.AI 2026-05 unverdicted novelty 8.0

PDEAgent-Bench is the first multi-metric, multi-library benchmark for AI-generated PDE solvers, evaluating executability, numerical accuracy, and efficiency across DOLFINx, Firedrake, and deal.II.
SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
cs.AI 2026-05 unverdicted novelty 8.0

SimWorld Studio uses a self-evolving coding agent to generate adaptive 3D environments that improve embodied agent performance, with reported gains of 18 points over fixed environments in navigation tasks.
SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
cs.AI 2026-05 accept novelty 8.0

SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.
HWE-Bench: Benchmarking LLM Agents on Real-World Hardware Bug Repair Tasks
cs.AI 2026-04 unverdicted novelty 8.0

HWE-Bench is the first repository-level benchmark for LLM agents on real hardware bug repair, where the best agent fixes 70.7% of 417 tasks but drops below 65% on complex SoC projects.
Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems
cs.CR 2026-04 unverdicted novelty 8.0

DDIPE poisons LLM agent skills by embedding malicious logic in documentation examples, achieving 11.6-33.5% bypass rates across frameworks while explicit attacks are blocked, with 2.5% evading detection.
Why Do Multi-Agent LLM Systems Fail?
cs.AI 2025-03 unverdicted novelty 8.0

The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
Remember Your Trace: Memory-Guided Long-Horizon Agentic Framework for Consistent and Hierarchical Repository-Level Code Documentation
cs.SE 2026-05 unverdicted novelty 7.0

MemDocAgent generates consistent hierarchical repository-level code documentation by combining dependency-aware traversal with memory-guided agent interactions that accumulate work traces.
AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation
cs.SE 2026-05 conditional novelty 7.0

10.7% of passing SWE-agent trajectories are Lucky Passes with chaotic behaviors, and a quality score based on process references changes model rankings across eight backends.
Covering Human Action Space for Computer Use: Data Synthesis and Benchmark
cs.CV 2026-05 unverdicted novelty 7.0

Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.
Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation
cs.CL 2026-05 conditional novelty 7.0

Checkup2Action is a new multimodal dataset and benchmark for generating safe, prioritized action cards from real-world clinical check-up reports using large language models.
Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation
cs.CL 2026-05 unverdicted novelty 7.0

Checkup2Action is a new multimodal dataset and benchmark for generating patient-oriented action cards from real-world clinical check-up reports.
Can a Single Message Paralyze the AI Infrastructure? The Rise of AbO-DDoS Attacks through Targeted Mobius Injection
cs.CR 2026-05 unverdicted novelty 7.0

Mobius Injection exploits semantic closure in LLM agents to enable single-message AbO-DDoS attacks achieving up to 51x call amplification and 229x latency inflation.
PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents
cs.AI 2026-05 unverdicted novelty 7.0

PaperFit uses rendered page images in a closed loop to diagnose and repair typesetting defects in LaTeX documents, outperforming baselines on a new benchmark of 200 papers.
AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems
cs.CL 2026-05 unverdicted novelty 7.0

AgentForesight trains a 7B model to perform online auditing of multi-agent LLM trajectories, detecting early decisive errors and outperforming larger models on custom and external benchmarks.
TeamBench: Evaluating Agent Coordination under Enforced Role Separation
cs.AI 2026-05 unverdicted novelty 7.0

Enforcing role separation in agent teams reveals that prompt-only setups hide coordination failures, with verifiers approving 49% of failing work and teams sometimes harming performance when solo agents already succeed.
Weblica: Scalable and Reproducible Training Environments for Visual Web Agents
cs.AI 2026-05 unverdicted novelty 7.0

Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar op...
Crab: A Semantics-Aware Checkpoint/Restore Runtime for Agent Sandboxes
cs.OS 2026-04 unverdicted novelty 7.0

Crab bridges the agent-OS semantic gap with an eBPF inspector, turn-aligned coordinator, and host engine to deliver 100% recovery correctness while cutting checkpoint traffic up to 87% and adding under 2% overhead.
Theory Under Construction: Orchestrating Language Models for Research Software Where the Specification Evolves
cs.SE 2026-04 unverdicted novelty 7.0

Comet-H orchestrates LLMs via deficit-scoring prompt selection and half-life task tracking to co-evolve research software components, demonstrated by a static analysis tool reaching F1=0.768 versus a 0.364 baseline.
Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses
cs.CL 2026-04 unverdicted novelty 7.0

AHE automates coding-agent harness evolution via component, experience, and decision observability, raising Terminal-Bench 2 pass@1 from 69.7% to 77.0% with transfer gains across models and benchmarks.
Empowering Autonomous Debugging Agents with Efficient Dynamic Analysis
cs.SE 2026-04 unverdicted novelty 7.0

ADI equips AI debugging agents with function-level interaction via a new execution trace structure, raising SWE-bench Verified resolution to 63.8% at $1.28 per task and delivering 6-18% gains when added to existing agents.
From Skills to Talent: Organising Heterogeneous Agents as a Real-World Company
cs.AI 2026-04 unverdicted novelty 7.0

OMC framework turns multi-agent AI into self-organizing companies with Talents, Talent Market, and E²R search, achieving 84.67% success on PRDBench (15.48 points above prior art).
Taint-Style Vulnerability Detection and Confirmation for Node.js Packages Using LLM Agent Reasoning
cs.CR 2026-04 unverdicted novelty 7.0

LLMVD.js uses LLM agents to confirm 84% of taint-style vulnerabilities on public benchmarks (vs. <22% for prior tools) and generates validated exploits for 36 of 260 new packages (vs. ≤2 for traditional tools).
Neurosymbolic Repo-level Code Localization
cs.SE 2026-04 unverdicted novelty 7.0

LogicLoc combines LLMs with Datalog to achieve accurate repo-level code localization without relying on keyword shortcuts in benchmarks.
Problem Reductions at Scale: Agentic Integration of Computationally Hard Problems
cs.AI 2026-04 unverdicted novelty 7.0

A harness for AI agents enabled construction of a Rust library with 100+ problem types and 200+ reduction rules for NP-hard problems in three months.
Evaluating LLM Agents on Automated Software Analysis Tasks
cs.SE 2026-04 unverdicted novelty 7.0

A custom LLM agent achieves 94% manually verified success on a new benchmark of 35 software analysis setups, outperforming baselines at 77%, but struggles with stage mixing, error localization, and overestimating its ...
TimeSeriesExamAgent: Creating Time Series Reasoning Benchmarks at Scale
cs.AI 2026-04 conditional novelty 7.0

TimeSeriesExamAgent combines templates and LLM agents to generate scalable time series reasoning benchmarks, demonstrating that current LLMs have limited performance on both abstract and domain-specific tasks.
Benchmarking Requirement-to-Architecture Generation with Hybrid Evaluation
cs.SE 2026-04 unverdicted novelty 7.0

R2ABench benchmark shows LLMs generate syntactically valid software architectures from requirements but produce structurally fragmented results due to weak relational reasoning.
Squeez: Task-Conditioned Tool-Output Pruning for Coding Agents
cs.SE 2026-04 unverdicted novelty 7.0

A LoRA-fine-tuned Qwen 3.5 2B model for task-conditioned tool-output pruning reaches 0.86 recall and 0.80 F1 on a new 618-example test set while removing 92% of input tokens and outperforming larger zero-shot models.
Inside the Scaffold: A Source-Code Taxonomy of Coding Agent Architectures
cs.SE 2026-04 accept novelty 7.0

Analysis of 13 coding agent scaffolds at pinned commits yields a 12-dimension taxonomy showing five composable loop primitives, with 11 agents combining multiple primitives instead of using one fixed structure.
ABTest: Behavior-Driven Testing for AI Coding Agents
cs.SE 2026-04 unverdicted novelty 7.0

ABTest mines 400 failure reports into 47 patterns and 128 actions to generate 647 tests that flag 642 new anomalies across three AI coding agents at 40.8% precision.
AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents
cs.AI 2026-04 unverdicted novelty 7.0

AgentHazard benchmark shows computer-use agents remain highly vulnerable, with attack success rates reaching 73.63% on models like Qwen3-Coder powering Claude Code.
Evaluating the Environmental Impact of using SLMs and Prompt Engineering for Code Generation
cs.SE 2026-04 unverdicted novelty 7.0

Chain-of-Thought prompting balances high accuracy with low energy use in small language models for code generation, while multi-sampling strategies add high energy costs for small accuracy gains.
AgentSZZ: Teaching the LLM Agent to Play Detective with Bug-Inducing Commits
cs.SE 2026-04 conditional novelty 7.0

AgentSZZ is an LLM-agent framework that identifies bug-inducing commits with up to 27.2% higher F1 scores than prior methods by enabling adaptive exploration and causal tracing, especially for cross-file and ghost commits.
Automating Database-Native Function Code Synthesis with LLMs
cs.DB 2026-04 conditional novelty 7.0

DBCooker automates synthesis of database native functions via LLM-guided characterization, coding plans, hybrid filling, and progressive validation, delivering 34.55% higher accuracy than baselines on SQLite, PostgreS...
Dynamic analysis enhances issue resolution
cs.SE 2026-03 conditional novelty 7.0

DAIRA integrates dynamic tracing into LLM agents to achieve 79.4% resolution rate on SWE-bench Verified for code defect repair.
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory
cs.CL 2025-11 unverdicted novelty 7.0

Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks
cs.CL 2024-12 conditional novelty 7.0

TheAgentCompany benchmark finds that the strongest LLM agents autonomously complete 30% of tasks in a simulated real-world software company environment.
How to Interpret Agent Behavior
cs.AI 2026-05 conditional novelty 6.0

ACT*ONOMY is a Grounded-Theory-derived hierarchical taxonomy and open repository that enables systematic comparison and characterization of autonomous agent behavior across trajectories.
SWE-Cycle: Benchmarking Code Agents across the Complete Issue Resolution Cycle
cs.SE 2026-05 unverdicted novelty 6.0

SWE-Cycle benchmark shows sharp drops in code agent success rates from isolated tasks to full autonomous issue resolution, highlighting cross-phase dependency issues.
Revisiting DAgger in the Era of LLM-Agents
cs.LG 2026-05 conditional novelty 6.0

DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces
cs.AI 2026-05 unverdicted novelty 6.0

OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
Slipstream: Trajectory-Grounded Compaction Validation for Long-Horizon Agents
cs.MA 2026-05 unverdicted novelty 6.0

Slipstream uses asynchronous compaction with trajectory-grounded judge validation to improve long-horizon agent accuracy by up to 8.8 percentage points and reduce latency by up to 39.7%.
SWE Atlas: Benchmarking Coding Agents Beyond Issue Resolution
cs.LG 2026-05 unverdicted novelty 6.0

SWE Atlas is a benchmark suite for coding agents that evaluates Codebase Q&A, Test Writing, and Refactoring using comprehensive protocols assessing both functional correctness and software engineering quality.
When Routine Chats Turn Toxic: Unintended Long-Term State Poisoning in Personalized Agents
cs.CR 2026-05 unverdicted novelty 6.0

Routine user chats can unintentionally poison the long-term state of personalized LLM agents, causing authorization drift, tool escalation, and unchecked autonomy, as measured by a new benchmark and reduced by the Sta...
Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows
cs.SE 2026-04 unverdicted novelty 6.0

Claw-Eval-Live benchmark with 105 tasks shows no frontier LLM agent exceeds 66.7% success rate on evolving real-world workflows, with HR and multi-system tasks as persistent bottlenecks.
Architectural Design Decisions in AI Agent Harnesses
cs.AI 2026-04 unverdicted novelty 6.0

An empirical study of 70 AI agent systems identifies five design dimensions and five common architectural patterns in their supporting infrastructure.
MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation
cs.CV 2026-04 unverdicted novelty 6.0

MM-WebAgent is a hierarchical multimodal agent that coordinates AIGC tools through planning and iterative self-reflection to generate coherent, visually consistent webpages and outperforms baselines on a new benchmark.
TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration
cs.AI 2026-04 unverdicted novelty 6.0

TREX automates the LLM training lifecycle via collaborative agents and tree-based exploration, delivering consistent performance gains across 10 real-world fine-tuning tasks in FT-Bench.
AutoSurrogate: An LLM-Driven Multi-Agent Framework for Autonomous Construction of Deep Learning Surrogate Models in Subsurface Flow
cs.LG 2026-04 unverdicted novelty 6.0

AutoSurrogate is a multi-agent LLM framework that autonomously constructs, tunes, and validates deep learning surrogates for subsurface flow from natural language, outperforming expert baselines on a 3D carbon storage task.
From Translation to Superset: Benchmark-Driven Evolution of a Production AI Agent from Rust to Python
cs.SE 2026-04 unverdicted novelty 6.0

LLM-driven translation of a production Rust AI agent to Python achieves near-parity on SWE-bench (73.8% vs 70.0%) and Terminal-Bench (42.5% vs 47.5%) while evolving into a 15.9x smaller superset with 30 new capabilities.
Contexty: Capturing and Organizing In-situ Thoughts for Context-Aware AI Support
cs.HC 2026-04 unverdicted novelty 6.0

Contexty captures users' cognitive traces as editable snippets and organizes them to enable more effective, user-controlled context-aware AI collaboration during complex tasks.
Pioneer Agent: Continual Improvement of Small Language Models in Production
cs.AI 2026-04 unverdicted novelty 6.0

Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on ...
When LLMs Lag Behind: Knowledge Conflicts from Evolving APIs in Code Generation
cs.SE 2026-04 unverdicted novelty 6.0

LLMs produce executable code only 42.55% of the time under API evolution without full documentation, improving to 66.36% with structured docs and by 11% more with reasoning strategies, yet outdated patterns persist.
Program Analysis Guided LLM Agent for Proof-of-Concept Generation
cs.SE 2026-04 unverdicted novelty 6.0

PAGENT integrates static and dynamic program analysis guidance with an LLM agent to improve automated proof-of-concept generation success by 132% over prior agentic methods.
Auditable Agents
cs.AI 2026-04 unverdicted novelty 6.0

No agent system can be accountable without auditability, which requires five dimensions (action recoverability, lifecycle coverage, policy checkability, responsibility attribution, evidence integrity) and mechanisms f...
On the Role of Fault Localization Context for LLM-Based Program Repair
cs.SE 2026-04 unverdicted novelty 6.0

More fault localization context does not consistently improve LLM-based program repair; file-level context gives 15-17x gains, optimal around 6-10 files, while line-level context often degrades performance from noise.
Beyond Fixed Tests: Repository-Level Issue Resolution as Coevolution of Code and Behavioral Constraints
cs.SE 2026-04 unverdicted novelty 6.0

Agent-CoEvo is a multi-agent LLM framework that coevolves code patches and test patches to resolve repository-level issues, outperforming fixed-test baselines on SWE-bench Lite and SWT-bench Lite.
Multilingual Prompt Localization for Agent-as-a-Judge: Language and Backbone Sensitivity in Requirement-Level Evaluation
cs.CL 2026-04 unverdicted novelty 6.0

Localizing judge prompts to five languages shows that LLM backbones interact with language in agent-as-a-judge evaluations, inverting rankings and revealing no universal best model with low inter-judge agreement.
A-MEM: Agentic Memory for LLM Agents
cs.CL 2025-02 unverdicted novelty 6.0

A-MEM is a dynamic memory system for LLM agents that builds and refines an interconnected network of notes with agent-driven linking and evolution, showing performance gains over prior memory methods on six models.