Nl2repo-bench: Towards long-horizon repository generation evaluation of coding agents.CoRR, abs/2512.12730

Jingzhe Ding, Shengda Long, Changxin Pu, Huan Zhou, Hongwan Gao, Xiang Gao, Chao He, Yue Hou, Fei Hu, Zhaojian Li, et al · 2025 · arXiv 2512.12730

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it

read on arXiv browse 12 citing papers

citation-role summary

background 1 dataset 1

citation-polarity summary

background 1 baseline 1

representative citing papers

Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

cs.AI · 2026-06-09 · unverdicted · novelty 7.0

Workflow-GYM is a new benchmark for long-horizon professional GUI agent tasks where state-of-the-art models reach only slightly above 30% success.

RealClawBench: Live OpenClaw Benchmarks from Real Developer-Agent Sessions

cs.CL · 2026-06-02 · unverdicted · novelty 7.0

RealClawBench turns 281 real OpenClaw sessions into reproducible tasks that preserve the original distribution and shows the best of 14 models solves only 65.8 percent.

SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents

cs.SE · 2026-05-20 · unverdicted · novelty 7.0

SpecBench shows frontier coding agents saturate visible test suites but exhibit persistent reward hacking on held-out tests, with the gap growing 28 percentage points per tenfold increase in code size.

SaaSBench: Exploring the Boundaries of Coding Agents in Long-Horizon Enterprise SaaS Engineering

cs.SE · 2026-05-17 · unverdicted · novelty 7.0

SaaSBench introduces a heterogeneous benchmark for enterprise SaaS engineering and shows that state-of-the-art coding agents fail over 95% of the time before reaching deep business logic due to setup and integration problems.

SWE-Chain: Benchmarking Coding Agents on Chained Release-Level Package Upgrades

cs.SE · 2026-05-14 · unverdicted · novelty 7.0

SWE-Chain provides 155 chained version transitions and 1,660 requirements across 9 Python packages, where frontier agents resolve 44.8% of tasks on average and struggle to preserve functionality across releases.

Constraint Decay: The Fragility of LLM Agents in Backend Code Generation

cs.SE · 2026-05-07 · unverdicted · novelty 7.0

LLM agents exhibit constraint decay with assertion pass rates dropping substantially as structural requirements increase in multi-file backend code generation across web frameworks.

Evaluating LLM-Based 0-to-1 Software Generation in End-to-End CLI Tool Scenarios

cs.SE · 2026-04-08 · unverdicted · novelty 7.0

A new benchmark for 0-to-1 CLI tool generation shows state-of-the-art LLMs achieve under 43% success rate with black-box equivalence testing against real oracles.

Toward Executable Repository-Level Code Generation via Environment Alignment

cs.SE · 2026-04-04 · unverdicted · novelty 7.0

EnvGraph improves executable repository-level code generation by jointly modeling external dependencies and internal references through a dual-layer environment representation and targeted iterative alignment.

SWE-Cycle: Benchmarking Code Agents across the Complete Issue Resolution Cycle

cs.SE · 2026-05-13 · unverdicted · novelty 6.0

SWE-Cycle benchmark shows sharp drops in code agent success rates from isolated tasks to full autonomous issue resolution, highlighting cross-phase dependency issues.

Cheap Code, Costly Judgment: A Case Study on Governable Agentic Software Engineering

cs.SE · 2026-07-01 · unverdicted · novelty 5.0

A case study of AI-agentic software development yields a process model explaining how engineering judgment converts recurring structural failures into durable governance mechanisms.

ATM: CID-Brokered Pre-Write Admission for Multi-Agent Code Co-Synthesis

cs.SE · 2026-06-29 · unverdicted · novelty 5.0

ATM is a CID-brokered governance framework that maps write intents to semantic atoms for pre-admission control, validation, and neutral-steward application in single-domain multi-agent code synthesis.

Seed2.0 Model Card: Towards Intelligence Frontier for Real-World Complexity

cs.AI · 2026-06-30 · unverdicted · novelty 2.0

Seed2.0 model series reports gains in reasoning, visual understanding, search, and reliability on intricate long-horizon tasks via an internal evaluation system.

citing papers explorer

Showing 1 of 1 citing paper after filters.

SWE-Cycle: Benchmarking Code Agents across the Complete Issue Resolution Cycle cs.SE · 2026-05-13 · unverdicted · none · ref 10
SWE-Cycle benchmark shows sharp drops in code agent success rates from isolated tasks to full autonomous issue resolution, highlighting cross-phase dependency issues.

Nl2repo-bench: Towards long-horizon repository generation evaluation of coding agents.CoRR, abs/2512.12730

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer