ProgramBench: Can Language Models Rebuild Programs From Scratch?

· 2026 · cs.SE · arXiv 2605.03546

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

open full Pith review browse 8 citing papers arXiv PDF

abstract

Turning ideas into full software projects from scratch has become a popular use case for language models. Agents are being deployed to seed, maintain, and grow codebases over extended periods with minimal human oversight. Such settings require models to make high-level software architecture decisions. However, existing benchmarks measure focused, limited tasks such as fixing a single bug or developing a single, specified feature. We therefore introduce ProgramBench to measure the ability of software engineering agents to develop software holisitically. In ProgramBench, given only a program and its documentation, agents must architect and implement a codebase that matches the reference executable's behavior. End-to-end behavioral tests are generated via agent-driven fuzzing, enabling evaluation without prescribing implementation structure. Our 200 tasks range from compact CLI tools to widely used software such as FFmpeg, SQLite, and the PHP interpreter. We evaluate 9 LMs and find that none fully resolve any task, with the best model passing 95\% of tests on only 3\% of tasks. Models favor monolithic, single-file implementations that diverge sharply from human-written code.

representative citing papers

MirrorCode: AI can rebuild entire programs from behavior alone

cs.AI · 2026-06-29 · unverdicted · novelty 7.0

MirrorCode benchmark shows current AI models achieving up to 56% success reimplementing 25 diverse full programs from behavior alone, including a 16,000-line bioinformatics toolkit.

SEC-bench Pro: Can Language Models Solve Long-Horizon Software Security Tasks?

cs.CR · 2026-05-26 · unverdicted · novelty 7.0

SEC-bench Pro benchmark with 183 real vulnerabilities shows frontier LLM coding agents achieve at most 38.8% success on SpiderMonkey and 32% on V8.

OSWorld2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks

cs.AI · 2026-06-28 · unverdicted · novelty 6.0

OSWorld 2.0 is a benchmark of 108 realistic long-horizon computer-use tasks where current agents achieve only 20.6% binary completion, struggling with state inference and constraint tracking.

ORAgentBench: Can LLM Agents Solve Challenging Operations Research Tasks End to End?

cs.AI · 2026-06-18 · unverdicted · novelty 6.0

ORAgentBench evaluates 14 LLM agent configurations on 107 end-to-end OR tasks and finds the best agent passes only 35.51% overall and 20.59% of hard tasks.

An Enigma of Artificial Reason: Investigating the Production-Evaluation Gap in Large Reasoning Models

cs.AI · 2026-05-31 · conditional · novelty 6.0

LRMs show a large production-evaluation gap on the VAIR dataset with valid answers but invalid reasoning, driven by answer confirmation bias as evidenced by CoT analysis, linear probes, and causal patching.

Benchmarks are Not Enough: RAMP for Runtime Assessing of Agentic Models in Production Systems

cs.SE · 2026-05-26 · unverdicted · novelty 6.0

RAMP evaluates 15 models on production-like serial workflows and reports completion rates collapsing from 100% to 20% with none finishing the full pipeline and costs varying by three orders of magnitude.

Cheap Code, Costly Judgment: A Case Study on Governable Agentic Software Engineering

cs.SE · 2026-07-01 · unverdicted · novelty 5.0

A case study of AI-agentic software development yields a process model explaining how engineering judgment converts recurring structural failures into durable governance mechanisms.

Position: Coding Benchmarks Are Misaligned with Agentic Software Engineering

cs.SE · 2026-06-16 · unverdicted · novelty 4.0

Coding benchmarks misalign with agentic software engineering because they conflate model and harness, grade against single references, and provide no component-level iteration signals.

citing papers explorer

Showing 8 of 8 citing papers after filters.

MirrorCode: AI can rebuild entire programs from behavior alone cs.AI · 2026-06-29 · unverdicted · none · ref 11 · internal anchor
MirrorCode benchmark shows current AI models achieving up to 56% success reimplementing 25 diverse full programs from behavior alone, including a 16,000-line bioinformatics toolkit.
SEC-bench Pro: Can Language Models Solve Long-Horizon Software Security Tasks? cs.CR · 2026-05-26 · unverdicted · none · ref 20 · internal anchor
SEC-bench Pro benchmark with 183 real vulnerabilities shows frontier LLM coding agents achieve at most 38.8% success on SpiderMonkey and 32% on V8.
OSWorld2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks cs.AI · 2026-06-28 · unverdicted · none · ref 101 · internal anchor
OSWorld 2.0 is a benchmark of 108 realistic long-horizon computer-use tasks where current agents achieve only 20.6% binary completion, struggling with state inference and constraint tracking.
ORAgentBench: Can LLM Agents Solve Challenging Operations Research Tasks End to End? cs.AI · 2026-06-18 · unverdicted · none · ref 2 · internal anchor
ORAgentBench evaluates 14 LLM agent configurations on 107 end-to-end OR tasks and finds the best agent passes only 35.51% overall and 20.59% of hard tasks.
An Enigma of Artificial Reason: Investigating the Production-Evaluation Gap in Large Reasoning Models cs.AI · 2026-05-31 · conditional · none · ref 59 · internal anchor
LRMs show a large production-evaluation gap on the VAIR dataset with valid answers but invalid reasoning, driven by answer confirmation bias as evidenced by CoT analysis, linear probes, and causal patching.
Benchmarks are Not Enough: RAMP for Runtime Assessing of Agentic Models in Production Systems cs.SE · 2026-05-26 · unverdicted · none · ref 30 · internal anchor
RAMP evaluates 15 models on production-like serial workflows and reports completion rates collapsing from 100% to 20% with none finishing the full pipeline and costs varying by three orders of magnitude.
Cheap Code, Costly Judgment: A Case Study on Governable Agentic Software Engineering cs.SE · 2026-07-01 · unverdicted · none · ref 23 · internal anchor
A case study of AI-agentic software development yields a process model explaining how engineering judgment converts recurring structural failures into durable governance mechanisms.
Position: Coding Benchmarks Are Misaligned with Agentic Software Engineering cs.SE · 2026-06-16 · unverdicted · none · ref 51 · internal anchor
Coding benchmarks misalign with agentic software engineering because they conflate model and harness, grade against single references, and provide no component-level iteration signals.

ProgramBench: Can Language Models Rebuild Programs From Scratch?

fields

years

verdicts

representative citing papers

citing papers explorer