Abc-bench: Benchmarking agentic backend coding in real-world development.arXiv preprint arXiv:2601.11077

· 2026 · arXiv 2601.11077

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Constraint Decay: The Fragility of LLM Agents in Backend Code Generation

cs.SE · 2026-05-07 · unverdicted · novelty 7.0

LLM agents exhibit constraint decay with assertion pass rates dropping substantially as structural requirements increase in multi-file backend code generation across web frameworks.

MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents

cs.CR · 2026-05-05 · unverdicted · novelty 7.0

MOSAIC-Bench demonstrates that nine production coding agents achieve 53-86% end-to-end attack success rates on staged innocuous tickets across 10 web substrates and 31 CWE classes, far higher than the 0-20.4% rates seen with direct prompts.

From Question Answering to Task Completion: A Survey on Agent System and Harness Design

cs.AI · 2026-06-14 · unverdicted · novelty 4.0

Survey framing LLM agents as model-plus-harness systems, decomposing harness responsibilities, mapping them to tasks, and highlighting open challenges in evaluation, safety, and co-evolution.

citing papers explorer

Showing 3 of 3 citing papers.

Constraint Decay: The Fragility of LLM Agents in Backend Code Generation cs.SE · 2026-05-07 · unverdicted · none · ref 23
LLM agents exhibit constraint decay with assertion pass rates dropping substantially as structural requirements increase in multi-file backend code generation across web frameworks.
MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents cs.CR · 2026-05-05 · unverdicted · none · ref 14
MOSAIC-Bench demonstrates that nine production coding agents achieve 53-86% end-to-end attack success rates on staged innocuous tickets across 10 web substrates and 31 CWE classes, far higher than the 0-20.4% rates seen with direct prompts.
From Question Answering to Task Completion: A Survey on Agent System and Harness Design cs.AI · 2026-06-14 · unverdicted · none · ref 181
Survey framing LLM agents as model-plus-harness systems, decomposing harness responsibilities, mapping them to tasks, and highlighting open challenges in evaluation, safety, and co-evolution.

Abc-bench: Benchmarking agentic backend coding in real-world development.arXiv preprint arXiv:2601.11077

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer