MirrorCode benchmark shows current AI models achieving up to 56% success reimplementing 25 diverse full programs from behavior alone, including a 16,000-line bioinformatics toolkit.
ProgramBench: Can Language Models Rebuild Programs From Scratch?
8 Pith papers cite this work. Polarity classification is still indexing.
abstract
Turning ideas into full software projects from scratch has become a popular use case for language models. Agents are being deployed to seed, maintain, and grow codebases over extended periods with minimal human oversight. Such settings require models to make high-level software architecture decisions. However, existing benchmarks measure focused, limited tasks such as fixing a single bug or developing a single, specified feature. We therefore introduce ProgramBench to measure the ability of software engineering agents to develop software holisitically. In ProgramBench, given only a program and its documentation, agents must architect and implement a codebase that matches the reference executable's behavior. End-to-end behavioral tests are generated via agent-driven fuzzing, enabling evaluation without prescribing implementation structure. Our 200 tasks range from compact CLI tools to widely used software such as FFmpeg, SQLite, and the PHP interpreter. We evaluate 9 LMs and find that none fully resolve any task, with the best model passing 95\% of tests on only 3\% of tasks. Models favor monolithic, single-file implementations that diverge sharply from human-written code.
years
2026 8representative citing papers
SEC-bench Pro benchmark with 183 real vulnerabilities shows frontier LLM coding agents achieve at most 38.8% success on SpiderMonkey and 32% on V8.
OSWorld 2.0 is a benchmark of 108 realistic long-horizon computer-use tasks where current agents achieve only 20.6% binary completion, struggling with state inference and constraint tracking.
ORAgentBench evaluates 14 LLM agent configurations on 107 end-to-end OR tasks and finds the best agent passes only 35.51% overall and 20.59% of hard tasks.
LRMs show a large production-evaluation gap on the VAIR dataset with valid answers but invalid reasoning, driven by answer confirmation bias as evidenced by CoT analysis, linear probes, and causal patching.
RAMP evaluates 15 models on production-like serial workflows and reports completion rates collapsing from 100% to 20% with none finishing the full pipeline and costs varying by three orders of magnitude.
A case study of AI-agentic software development yields a process model explaining how engineering judgment converts recurring structural failures into durable governance mechanisms.
Coding benchmarks misalign with agentic software engineering because they conflate model and harness, grade against single references, and provide no component-level iteration signals.
citing papers explorer
-
MirrorCode: AI can rebuild entire programs from behavior alone
MirrorCode benchmark shows current AI models achieving up to 56% success reimplementing 25 diverse full programs from behavior alone, including a 16,000-line bioinformatics toolkit.
-
SEC-bench Pro: Can Language Models Solve Long-Horizon Software Security Tasks?
SEC-bench Pro benchmark with 183 real vulnerabilities shows frontier LLM coding agents achieve at most 38.8% success on SpiderMonkey and 32% on V8.
-
OSWorld2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks
OSWorld 2.0 is a benchmark of 108 realistic long-horizon computer-use tasks where current agents achieve only 20.6% binary completion, struggling with state inference and constraint tracking.
-
ORAgentBench: Can LLM Agents Solve Challenging Operations Research Tasks End to End?
ORAgentBench evaluates 14 LLM agent configurations on 107 end-to-end OR tasks and finds the best agent passes only 35.51% overall and 20.59% of hard tasks.
-
An Enigma of Artificial Reason: Investigating the Production-Evaluation Gap in Large Reasoning Models
LRMs show a large production-evaluation gap on the VAIR dataset with valid answers but invalid reasoning, driven by answer confirmation bias as evidenced by CoT analysis, linear probes, and causal patching.
-
Benchmarks are Not Enough: RAMP for Runtime Assessing of Agentic Models in Production Systems
RAMP evaluates 15 models on production-like serial workflows and reports completion rates collapsing from 100% to 20% with none finishing the full pipeline and costs varying by three orders of magnitude.
-
Cheap Code, Costly Judgment: A Case Study on Governable Agentic Software Engineering
A case study of AI-agentic software development yields a process model explaining how engineering judgment converts recurring structural failures into durable governance mechanisms.
-
Position: Coding Benchmarks Are Misaligned with Agentic Software Engineering
Coding benchmarks misalign with agentic software engineering because they conflate model and harness, grade against single references, and provide no component-level iteration signals.