Heimdall automates translation of eBPF C programs to Rust with formal equivalence proofs for 94.1% of 102 tested programs using LLMs, static analysis, and Z3-based checking.
super hub Canonical reference
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Canonical reference. 82% of citing Pith papers cite this work as background.
abstract
Language models have outpaced our ability to evaluate them effectively, but for their future development it is essential to study the frontier of their capabilities. We find real-world software engineering to be a rich, sustainable, and challenging testbed for evaluating the next generation of language models. To this end, we introduce SWE-bench, an evaluation framework consisting of $2,294$ software engineering problems drawn from real GitHub issues and corresponding pull requests across $12$ popular Python repositories. Given a codebase along with a description of an issue to be resolved, a language model is tasked with editing the codebase to address the issue. Resolving issues in SWE-bench frequently requires understanding and coordinating changes across multiple functions, classes, and even files simultaneously, calling for models to interact with execution environments, process extremely long contexts and perform complex reasoning that goes far beyond traditional code generation tasks. Our evaluations show that both state-of-the-art proprietary models and our fine-tuned model SWE-Llama can resolve only the simplest issues. The best-performing model, Claude 2, is able to solve a mere $1.96$% of the issues. Advances on SWE-bench represent steps towards LMs that are more practical, intelligent, and autonomous.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Language models have outpaced our ability to evaluate them effectively, but for their future development it is essential to study the frontier of their capabilities. We find real-world software engineering to be a rich, sustainable, and challenging testbed for evaluating the next generation of language models. To this end, we introduce SWE-bench, an evaluation framework consisting of $2,294$ software engineering problems drawn from real GitHub issues and corresponding pull requests across $12$ popular Python repositories. Given a codebase along with a description of an issue to be resolved, a
authors
co-cited works
representative citing papers
ExploitBench decomposes LLM exploitation into 16 oracle-verified capability flags and finds public frontier models trigger crashes but rarely reach arbitrary code execution on 41 V8 bugs.
EnergyAgentBench is a new benchmark with 70 task variants that evaluates LLM agents on live energy data for datacenter siting, long-horizon optimization, and causal grid diagnosis.
Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.
A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.
PDEAgent-Bench is the first multi-metric, multi-library benchmark for AI-generated PDE solvers, evaluating executability, numerical accuracy, and efficiency across DOLFINx, Firedrake, and deal.II.
SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.
VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual models, workloads, or hardware.
StabilizerBench is a new benchmark for evaluating AI agents on generating, optimizing, and making fault-tolerant stabilizer circuits for quantum error correction, with efficient verification and multi-tier scoring.
neuralCAD-Edit benchmark shows even the best foundation model (GPT 5.2) scores 53% lower than human CAD experts in acceptance trials for multimodal-instructed 3D model edits.
HWE-Bench is the first repository-level benchmark for LLM agents on real hardware bug repair, where the best agent fixes 70.7% of 417 tasks but drops below 65% on complex SoC projects.
SlopCodeBench shows coding agents degrade in structural quality and verbosity across iterative extensions, with no agent solving any problem completely and agent code 2x more eroded than human code.
MCP-Atlas is a new benchmark with 1000 tasks on production MCP servers that uses claim-level scoring to evaluate LLM agents on realistic multi-step tool-use competency.
ExCyTIn-Bench is the first benchmark of 7542 questions from Microsoft Sentinel threat investigation graphs, where the best LLM agent achieves a reward of 0.606.
The AI Scientist framework enables LLMs to independently conduct the full scientific process from idea generation to paper writing and review, demonstrated across three ML subfields with papers costing under $15 each.
OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
LLMs brew code answers in early layers before resolving into Resolved, Overprocessed, Misresolved, or Unresolved states, with 41.5% resolved overall and brewing duration stable at 24-42% across 16 models.
Visual graphs of repository structure added to text inputs for multimodal LLM agents reduce token consumption by up to 26% while maintaining or improving issue-resolution accuracy.
A new six-dimension process taxonomy for AI software development frameworks shows convergence on artifact persistence and human oversight but reveals that no framework covers all dimensions strongly, indicating a depth-portability trade-off.
Sakura is a multi-agent system that generates structurally complex tests from NL descriptions, achieving 50-78% higher compilability and 38-66% higher coverage overlap than baselines on 1,464 scenarios from 20 Apache Commons applications.
AgentCARD benchmark shows heterogeneous LLM agent teams with mixed deployments reach the cost-accuracy frontier, delivering up to 44% higher accuracy or 12x lower cost than uniform teams, with domain-specific role bottlenecks.
EvoRepair is the first experience-based self-evolving agent framework for automated vulnerability repair, reporting 90.46% overall success on PATCHEVAL and SEC-bench benchmarks.
VirtualME is a new infrastructure that continuously extracts and interprets in-IDE developer behaviors to build personalized personas, delivering 33.8% better performance on repository-level knowledge Q&A than generic baselines.
OR-Space is a benchmark for LLM agents performing full-lifecycle optimization tasks across Build, Revise, and Explain modes in executable multi-artifact workspaces.
citing papers explorer
-
ExploitBench: A Capability Ladder Benchmark for LLM Cybersecurity Agents
ExploitBench decomposes LLM exploitation into 16 oracle-verified capability flags and finds public frontier models trigger crashes but rarely reach arbitrary code execution on 41 V8 bugs.
-
StabilizerBench: A Benchmark for AI-Assisted Quantum Error Correction Circuit Synthesis
StabilizerBench is a new benchmark for evaluating AI agents on generating, optimizing, and making fault-tolerant stabilizer circuits for quantum error correction, with efficient verification and multi-tier scoring.
-
From Prompt to Process: a Process Taxonomy and Comparative Assessment of Frameworks Supporting AI Software Development Agents
A new six-dimension process taxonomy for AI software development frameworks shows convergence on artifact persistence and human oversight but reveals that no framework covers all dimensions strongly, indicating a depth-portability trade-off.
-
IdleSpec: Exploiting Idle Time via Speculative Planning for LLM Agents
IdleSpec improves LLM agent accuracy by generating and aggregating speculative plans during idle time between tool calls and observations using complementary drafting strategies.
-
Reversa: A Reverse Documentation Engineering Framework for Converting Legacy Software into Operational Specifications for AI Agents
Reversa is a reverse documentation engineering framework that deploys a multi-agent pipeline to extract implicit rules from legacy software and produce traceable specifications with confidence scores and explicit gaps for human review.
-
Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack
BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.
-
3D Primitives are a Spatial Language for VLMs
3D geometric primitives in executable code act as an effective intermediate spatial language that boosts VLMs on reconstruction and question-answering tasks.
-
Skill Drift Is Contract Violation: Proactive Maintenance for LLM Agent Skill Libraries
SkillGuard extracts executable environment contracts from LLM skill documents to detect only relevant drifts, reporting zero false positives on 599 cases, 100% precision in known-drift tests, and raising one-round repair success from 10% to 78%.
-
Assessing the Impact of Requirement Ambiguity on LLM-based Function-Level Code Generation
Orchid benchmark shows requirement ambiguity degrades LLM code generation performance across all models, with advanced models hit hardest, and LLMs rarely detect or resolve the ambiguity themselves.
-
MatClaw: An Autonomous Code-First LLM Agent for End-to-End Materials Exploration
MatClaw shows a code-first LLM agent autonomously generating and executing workflows for ML force field training, Curie temperature prediction, and parameter search on CuInP2S6, succeeding on code but requiring interventions for tacit domain knowledge.
-
AgentSZZ: Teaching the LLM Agent to Play Detective with Bug-Inducing Commits
AgentSZZ is an LLM-agent framework that identifies bug-inducing commits with up to 27.2% higher F1 scores than prior methods by enabling adaptive exploration and causal tracing, especially for cross-file and ghost commits.
-
Dynamic analysis enhances issue resolution
DAIRA integrates dynamic tracing into LLM agents to achieve 79.4% resolution rate on SWE-bench Verified for code defect repair.
-
Automatic Generation of High-Performance RL Environments
Closed-loop prompt-based translation with hierarchical verification and iterative repair produces equivalent high-performance RL environments across five cases including new TCGJax.
-
Debug2Fix: Can Interactive Debugging Help Coding Agents Fix More Bugs?
Debug2Fix integrates interactive debugging via subagents into coding agents, delivering >20% gains on GitBug-Java and SWE-Bench-Live while enabling weaker models to match stronger ones.
-
Open-World Evaluations for Measuring Frontier AI Capabilities
Open-world evaluations using qualitative review of real-world tasks can give earlier warnings of frontier AI capabilities than automated benchmarks, as demonstrated by an AI agent publishing a simple iOS app with one minor human fix.
-
MemRepair: Hierarchical Memory for Agentic Repository-Level Vulnerability Repair
MemRepair is a hierarchical memory-augmented agent framework that raises repository-level vulnerability repair rates to 58.0-58.2% on Python/Go/JS benchmarks and 30.58% on C++ by combining history, pattern, and refinement memories with iterative feedback.
-
Revisiting DAgger in the Era of LLM-Agents
DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.
-
Deterministic vs. LLM-Controlled Orchestration for COBOL-to-Python Modernization
Deterministic orchestration matches LLM-controlled methods in COBOL-to-Python translation accuracy but improves worst-case robustness, reduces run-to-run variability, and cuts token consumption by up to 3.5 times.
-
Why Does Agentic Safety Fail to Generalize Across Tasks?
Agentic safety fails to generalize across tasks because the task-to-safe-controller mapping has a higher Lipschitz constant than the task-to-controller mapping alone, as proven in linear-quadratic control and demonstrated in quadcopter and LLM experiments.
-
EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies
EcoGym is a new open benchmark with three economic environments that reveals no leading LLM dominates at sustained plan-and-execute decision making across scenarios.
-
MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training
MTraining scales LLM training to 512K-token contexts on 32 A100 GPUs by integrating dynamic sparse training patterns with balanced and hierarchical sparse ring attention, achieving up to 6x throughput gains without accuracy loss on long-context benchmarks.
-
Adversarial Agent Collaboration for Correctness Improvements of C to Safe Rust Translation
ACToR improves C-to-Rust translation correctness via an adversarial LLM-agent loop that generates differential fuzz tests to drive iterative refinements, achieving over 90% pass rates on 63 real-world utilities.
-
Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation
Embodied-R1 uses a pointing-centric representation and reinforced fine-tuning on a 200K dataset to achieve state-of-the-art results on embodied benchmarks plus 56.2% success in SIMPLEREnv and 87.5% on real XArm tasks without task-specific training.
-
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
DeepResearch Bench supplies 100 expert-crafted PhD-level tasks and two human-aligned evaluation frameworks to measure deep research agents on report quality and citation accuracy.
-
EXPEREPAIR: Dual-Memory Enhanced LLM-based Repository-Level Program Repair
ExpeRepair improves LLM-based repository-level program repair by maintaining episodic memory of concrete fixes and semantic memory of abstract insights, reaching 60.3% and 74.6% pass@1 on SWE-Bench Lite and Verified.
-
The AI Codebase Maturity Model: From Assisted Coding to Fully Autonomous Systems
The AI Codebase Maturity Model defines six sequential levels of AI-driven development based on feedback loop topologies, validated by experience reports showing 5x PR and 37x issue throughput gains from level 2 to level 6.
-
Towards Understanding, Analyzing, and Optimizing Agentic AI Execution: A CPU-Centric Perspective
The paper analyzes CPU bottlenecks in agentic AI serving, selects representative workloads, and demonstrates that CPU-aware scheduling optimizations COMB and MAS can reduce P50 latency by up to 1.7x and total latency by up to 2.49x on two hardware systems.
-
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
UI-TARS-2 reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld while attaining 59.8 mean normalized score on a 15-game suite through multi-turn RL and scalable data generation.
-
"Should I Give Up Now?" Investigating LLM Pitfalls in Software Engineering
User study reveals nine LLM failure categories in SE tasks and quantifies abandonment factors from 26 participants.