ProcCtrlBench introduces an ontology of 11 defect types across 4 categories plus control preservation metrics to evaluate LLM coding agent trajectories on 200 cases from AndroidBench, TerminalBench, and SWE-bench-Verified.
ACM Computing Surveys , volume=
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
Item response theory applied to 17 LLMs on SciEntsBank and Beetle reveals that models with similar overall scores differ sharply in robustness to difficult responses, with errors clustering on partial-credit labels.
citing papers explorer
-
ProcCtrlBench: Evaluating Process-Level Defects and Control Preservation in LLM Coding Agents
ProcCtrlBench introduces an ontology of 11 defect types across 4 categories plus control preservation metrics to evaluate LLM coding agent trajectories on 200 cases from AndroidBench, TerminalBench, and SWE-bench-Verified.