arxiv: 2604.03362 · v2 · submitted 2026-04-03 · 💻 cs.SE

Recognition: 1 theorem link

· Lean Theorem

ABTest: Behavior-Driven Testing for AI Coding Agents

Gias Uddin, Hung Viet Pham, Jinqiu Yang, Moses Openja, Song Wang, Wuyang Dai

Pith reviewed 2026-05-13 18:32 UTC · model grok-4.3

classification 💻 cs.SE

keywords AI coding agentsbehavior-driven testingfuzzing frameworksoftware robustnessanomaly detectioninteraction patternsuser-reported failurestest generation

0 comments

The pith

ABTest converts 400 real user-reported failures into 647 executable tests that flag 1,573 anomalies across three AI coding agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ABTest as a behavior-driven fuzzing framework that mines developer-confirmed anomalies to derive reusable Interaction Patterns and Action types. These patterns are composed into stepwise templates, instantiated as concrete test cases inside actual code repositories, and then executed against coding agents while capturing traces. Running the resulting 647-case suite on Claude Code, OpenAI Codex CLI, and Gemini CLI produces 1,573 flagged behavioral anomalies, of which 642 are manually verified as previously unreported true failures. A sympathetic reader would care because AI coding agents are moving into live development workflows, yet their failure modes under realistic conditions remain largely untested. The method supplies a repeatable way to turn anecdotal bug reports into systematic, repository-grounded evaluations.

Core claim

ABTest (1) mines user-reported anomalies to derive 47 Interaction Patterns and 128 Action types, (2) composes them into stepwise fuzzing templates, (3) instantiates executable test cases in real repositories, (4) executes them against coding agents while recording traces, and (5) detects and validates anomalous behaviors. Applied to 400 developer-confirmed failures, the framework generates 647 repository-grounded cases whose execution flags 1,573 anomalies, 642 of which are manually confirmed as new true anomalies at 40.8 percent precision.

What carries the argument

Interaction Patterns and Action types mined from user-reported anomalies, composed into stepwise fuzzing templates that are instantiated as executable test cases inside real repositories.

If this is right

ABTest exposes measurable robustness differences among distinct coding-agent families when the same test bundle is executed.
The framework surfaces failure modes that were not previously documented in the literature or vendor reports.
The 40.8 percent precision rate indicates that roughly two-fifths of the flagged anomalies are genuine new issues warranting developer attention.
Repository-grounded instantiation ensures the generated tests reflect actual code contexts rather than synthetic toy problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers could embed the pattern-mining step inside issue trackers so that every new confirmed failure automatically expands the test suite.
The same mining-to-fuzzing pipeline could be applied to non-coding AI agents such as planning or debugging assistants.
Periodic re-execution of the 647-case bundle after model updates would give a quantitative regression signal for agent robustness.
The Action-type taxonomy might serve as a lightweight specification language for future agent safety benchmarks.

Load-bearing premise

The 400 user-reported anomalies are representative of the full space of agent failures and the derived patterns capture essential behaviors without significant selection bias.

What would settle it

Applying the same mining and generation process to an independent, larger corpus of confirmed agent failures and obtaining a materially lower rate of new true anomalies would falsify the claim that the extracted patterns generalize.

Figures

Figures reproduced from arXiv: 2604.03362 by Gias Uddin, Hung Viet Pham, Jinqiu Yang, Moses Openja, Song Wang, Wuyang Dai.

**Figure 1.** Figure 1: The overview of ABTest Interaction Patterns and Action Types (Section 3.1). Second, it composes compatible pairs into reusable fuzzing seed templates (Section 3.2). Third, it instantiates each seed template into a repository-grounded task candidate within an isolated workspace based on a real-world repository (Section 3.3). Fourth, it executes the instantiated case with a coding agent within the repositor… view at source ↗

**Figure 4.** Figure 4: Transcript excerpt from Gemini CLI issue #4586, preserving the original loss-claim wording from the run trace. Step 1: Collect Anomaly Reports. We retain only reports that contain enough behavioral evidence for later reconstruction, such as a detailed issue body, an attached transcript, file-state observations, or linked execution logs. Reports centered on UI behavior, service-side API issues, billing,… view at source ↗

**Figure 3.** Figure 3: Transcript excerpt from Gemini CLI issue #4586, preserving the original sequence of claims and checks from the run trace. then follow a systematic procedure to analyze these issues and extract the corresponding Interaction Patterns and Action Types. Raw transcript excerpt (anonymized) Agent I have failed you. I have lost your files, and I cannot find them. My repeated errors have made the situation worse, … view at source ↗

**Figure 5.** Figure 5: Seed template example formed from a compatible Interaction Pattern–Action Type pair, shown as the original compact JSON artifact used by the pipeline. We derive these abstractions iteratively over the collected anomaly reports. When multiple reports share the same workflow pattern but differ in surface details, they are grouped under one Interaction Type pattern. When reports stress similar operations und… view at source ↗

**Figure 6.** Figure 6: Repository-grounded instantiated test case example for Test-0001, shown as the compact JSON artifact used in execution. post-rollback verification step S06, the expected new artifact output/coverage.xml, and the expected file change to logs/tool.log, whereas the Interaction Pattern is expressed more diffusely through the overall workflow structure that these action-specific steps are inserted into. These… view at source ↗

**Figure 7.** Figure 7: Compact JSON trace artifact for case Test-0001, step S05. preserved across steps, since later instructions often depend on artifacts produced earlier. For example, a task may first require generating a file and then validating or repairing it in subsequent steps. Operationally, each step corresponds to a single coding-agent invocation within the same workspace, with a per-step time limit enforced throughou… view at source ↗

**Figure 8.** Figure 8: Overlap decomposition by anomaly type for the Claude Code with LLMs, i.e., Claude 4.5 Haiku vs. Claude 3.5 Haiku. 17 1 7 GPT-5.1 Codex-Mini GPT-4o-mini (a) Critical anomaly 35 6 35 GPT-5.1 Codex-Mini GPT-4o-mini (b) Expected outcome anomaly 86 21 25 GPT-5.1 Codex-Mini GPT-4o-mini (c) Minor anomaly [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: Overlap decomposition by anomaly type for the Codex with different LLMs, i.e., GPT-5.1-Codex-Mini vs GPT-4o-mini [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

read the original abstract

AI coding agents are increasingly integrated into real-world software development workflows, yet their robustness under diverse and adversarial scenarios remains poorly understood. We present ABTest, a behavior-driven fuzzing framework that systematically tests coding agents by turning real-world failure reports into repository-grounded behavioral tests. ABTest (1) mines user-reported anomalies to derive reusable workflow patterns (Interaction Patterns) and behaviors (Action types); (2) composes them into stepwise fuzzing templates; (3) instantiates executable test cases in real repositories; (4) executes them with coding agents while recording traces and artifacts; and (5) detects and validates anomalous behaviors. We apply ABTest to three widely used coding agents: Claude Code, OpenAI Codex CLI, and Gemini CLI. From 400 user-reported developer-confirmed agent failures, we extract 47 Interaction Patterns and 128 Action types, generating 647 repository-grounded fuzzing cases. Executing the 647-case bundle once per evaluated configuration, ABTest flags 1,573 behavioral anomalies across the three coding agent families, of which 642 are manually confirmed as new true anomalies, achieving a detection precision of 40.8%. Our results demonstrate that ABTest effectively uncovers real-world failures, exposes robustness differences across models, and reveals previously unreported failure modes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ABTest turns real failure reports into a practical fuzzing pipeline for AI coding agents, but the manual confirmation of new anomalies lacks enough detail to fully back the precision and novelty claims.

read the letter

ABTest turns real failure reports into a practical fuzzing pipeline for AI coding agents. It mines 400 developer-confirmed issues to extract 47 Interaction Patterns and 128 Action types, then builds 647 executable test cases inside actual repositories and runs them on three agents: Claude Code, OpenAI Codex CLI, and Gemini CLI. The run flags 1,573 anomalies, of which 642 are called new true anomalies at 40.8 percent precision. This gives a grounded view of how the agents differ in robustness and surfaces failure modes outside the original reports. The evaluation stays tied to external agents and independently checked anomalies, which avoids obvious circularity. The concrete scale and the step-by-step extraction from real issues are the parts that stand out as useful. The main soft spot is the manual confirmation step. The paper states the 642 cases were manually confirmed as new and true, yet it does not spell out the rubric for novelty relative to the source reports, the definition of a true anomaly versus expected behavior, or any consistency checks across reviewers. Because the test cases themselves come from the same failure reports, this step carries a lot of weight for both the precision figure and the claim of previously unreported modes. If the criteria are loose or unblinded, some of the novelty count could be inflated. This work is aimed at researchers and engineers who build or evaluate AI coding agents and need systematic testing methods grounded in real workflows. A reader working on agent reliability would find the pattern extraction and repository instantiation steps worth looking at. I would send it for peer review. The method is concrete enough and the evaluation uses real agents and data, so referees can focus on tightening the validation details.

Referee Report

3 major / 2 minor

Summary. The paper presents ABTest, a behavior-driven fuzzing framework that mines 400 user-reported developer-confirmed failures to derive 47 Interaction Patterns and 128 Action types, composes them into 647 repository-grounded test cases, executes the cases on Claude Code, OpenAI Codex CLI, and Gemini CLI, flags 1,573 behavioral anomalies, and manually confirms 642 as new true anomalies at 40.8% precision. It claims this approach uncovers real-world failures, exposes robustness differences across agents, and reveals previously unreported failure modes.

Significance. If the anomaly detection and manual validation steps can be made fully reproducible, the work supplies a concrete, repository-grounded method for stress-testing AI coding agents at scale. The reported counts (647 cases, 1,573 anomalies, 642 confirmed) and cross-agent comparison provide empirical evidence that could inform both agent development and future testing frameworks in software engineering.

major comments (3)

[§4] §4 (Pattern Mining): The derivation of the 47 Interaction Patterns and 128 Action types from the 400 reports is described only at a high level; no coding protocol, inter-annotator agreement statistic, or explicit handling of selection bias is supplied. This directly affects the claim that the 647 generated cases are representative of the failure space.
[§5.2] §5.2 (Anomaly Detection): The rules or heuristics used to flag the 1,573 anomalies from execution traces are not stated explicitly (e.g., no decision criteria, thresholds, or trace features). Without these, the 40.8% precision figure cannot be independently verified or reproduced.
[§5.3] §5.3 (Validation): The manual confirmation step that yields the 642 'new true anomalies' provides no rubric for (a) distinguishing novelty from rediscovery of the original 400 reports or the 47 patterns, (b) operational definition of 'true anomaly' versus expected behavior, or (c) blinding or inter-rater reliability. This step is load-bearing for both the precision number and the 'previously unreported' claim.

minor comments (2)

[Abstract] Abstract: The phrase 'manually confirmed as new true anomalies' should include a forward reference to the validation subsection that defines the confirmation criteria.
[Results] Table 2 (or equivalent results table): Per-agent and per-pattern anomaly counts are summarized at too high a level to allow readers to assess which Interaction Patterns drive the robustness differences.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We agree that improving the reproducibility of our methodology is essential and will revise the manuscript accordingly to address the concerns raised in sections 4, 5.2, and 5.3. Below we provide point-by-point responses.

read point-by-point responses

Referee: [§4] §4 (Pattern Mining): The derivation of the 47 Interaction Patterns and 128 Action types from the 400 reports is described only at a high level; no coding protocol, inter-annotator agreement statistic, or explicit handling of selection bias is supplied. This directly affects the claim that the 647 generated cases are representative of the failure space.

Authors: We acknowledge that §4 provides a high-level overview of the pattern mining process. In the revised version, we will expand this section to include: (1) the full coding protocol and annotation guidelines used by the researchers; (2) inter-annotator agreement statistics (e.g., Cohen's kappa or percentage agreement) calculated on a subset of the reports; and (3) a discussion of potential selection biases in the 400 reports and how we mitigated them (e.g., by sampling from diverse sources). These additions will better support the representativeness of the 647 test cases. revision: yes
Referee: [§5.2] §5.2 (Anomaly Detection): The rules or heuristics used to flag the 1,573 anomalies from execution traces are not stated explicitly (e.g., no decision criteria, thresholds, or trace features). Without these, the 40.8% precision figure cannot be independently verified or reproduced.

Authors: We agree that explicit rules are necessary for reproducibility. In the revision, we will detail the anomaly detection heuristics in §5.2, including the specific decision criteria, thresholds applied to trace features (such as execution logs, output differences, and error patterns), and any automated filters used to identify the 1,573 anomalies. This will allow independent verification of the process leading to the 40.8% precision. revision: yes
Referee: [§5.3] §5.3 (Validation): The manual confirmation step that yields the 642 'new true anomalies' provides no rubric for (a) distinguishing novelty from rediscovery of the original 400 reports or the 47 patterns, (b) operational definition of 'true anomaly' versus expected behavior, or (c) blinding or inter-rater reliability. This step is load-bearing for both the precision number and the 'previously unreported' claim.

Authors: We recognize the importance of transparency in the validation process. We will revise §5.3 to include: (a) a rubric for assessing novelty, such as checking against the original 400 reports and patterns; (b) an operational definition of 'true anomaly' (e.g., behaviors that deviate from expected agent functionality in a way that could impact real-world use); and (c) details on the validation procedure, including whether blinding was employed and any measures of inter-rater reliability. If the original process did not include blinding, we will note this as a limitation and describe how we ensured consistency. This will strengthen the claims regarding the 642 confirmed anomalies. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's chain mines 400 external user reports into 47 Interaction Patterns and 128 Action types, synthesizes 647 repository-grounded test cases, executes them on three independent coding agents, flags 1,573 anomalies, and manually confirms 642 as new. No equations, fitted parameters, or self-citations reduce the precision figure, anomaly counts, or 'previously unreported' claim to the input reports by construction. The manual confirmation step, while lacking an explicit rubric in the provided text, operates as an independent validation layer rather than a definitional loop. The overall methodology remains self-contained against the external agent executions and report-derived inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The framework rests on the domain assumption that user-reported failures provide sufficient coverage to derive generalizable patterns; no numeric free parameters are stated, but two new structuring concepts are introduced without independent evidence outside the paper.

axioms (1)

domain assumption User-reported anomalies are representative of real-world agent failures and sufficient to derive reusable patterns
Framework begins by mining 400 such reports to produce the 47 patterns and 128 action types used for all subsequent test generation.

invented entities (2)

Interaction Patterns no independent evidence
purpose: Reusable workflow patterns extracted from failure reports
New abstraction introduced to structure the fuzzing templates; no external validation cited.
Action types no independent evidence
purpose: Categorized agent behaviors derived from reports
New categorization used to compose test cases; no external validation cited.

pith-pipeline@v0.9.0 · 5542 in / 1313 out tokens · 41574 ms · 2026-05-13T18:32:22.247477+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
ABTest (1) mines user-reported anomalies to derive reusable workflow patterns (Interaction Patterns) and behaviors (Action types); (2) composes them into stepwise fuzzing templates...

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 2 internal anchors

[1]

[n. d.]. American Fuzzy Lop (AFL) GitHub Repository. URL: https://github.com/google/afl

work page
[2]

Claude Code

2025. Claude Code. https://www.claude.com/product/claude-code. Accessed: 2025-12-12

work page 2025
[3]

Codex CLI

2025. Codex CLI. https://chatgpt.com/features/codex. Accessed: 2025-12-12

work page 2025
[4]

Gemini CLI

2025. Gemini CLI. https://geminicli.com. Accessed: 2025-12-12

work page 2025
[5]

Mohammad Abdollahi, Ruixin Zhang, Nima Shiri Harzevili, Jiho Shin, Song Wang, and Hadi Hemmati. 2026. Surveying the Benchmarking Landscape of Large Language Models in Code Intelligence. TOSEM 2026 (2026)

work page 2026
[6]

Shivani Acharya and Vidhi Pandya. 2012. Bridge between black Box and white Box–gray Box testing technique. International Journal of Electronics and Computer Science Engineering 2, 1 (2012), 175–185

work page 2012
[7]

Chuyang Chen and Brendan Dolan-Gavitt. 2025. {ELFuzz}: Efficient Input Generation via {LLM-driven} Synthesis Over Fuzzer Space. In 34th USENIX Security Symposium (USENIX Security 25) . 6279–6298

work page 2025
[8]

Levin, and David Molnar

Patrice Godefroid, Michael Y. Levin, and David Molnar. 2012. SAGE: whitebox fuzzing for security testing. Commun. Association for Com- puting Machinery (ACM) 55, 3 (March 2012), 40–44. doi: 10.1145/ 2093548.2093564

work page arXiv 2012
[9]

Patrice Godefroid, Hila Peleg, and Rishabh Singh. 2017. Learn&fuzz: Machine learning for input fuzzing. In ASE 2017. IEEE, 50–59

work page 2017
[10]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?. InInternational Conference on Learning Representations

work page 2024
[11]

Haolin Jin, Linghan Huang, Haipeng Cai, Jun Yan, Bo Li, and Huam- ing Chen. 2024. From llms to llm-based agents for software engi- neering: A survey of current, challenges and future. arXiv preprint arXiv:2408.02479 (2024)

work page arXiv 2024
[12]

Mohd Ehmer Khan and Farmeena Khan. 2012. A comparative study of white box, black box and grey box testing techniques. International Journal of Advanced Computer Science and Applications 3, 6 (2012)

work page 2012
[13]

Tianyang Liu, Canwen Xu, and Julian McAuley. 2023. RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems. arXiv preprint arXiv:2306.03091 (2023)

work page arXiv 2023
[14]

Pengrui Lu, Shiqi Zhang, Yunzhong Hou, Lyumanshan Ye, Chaoyi Huang, Zixi Chen, Ji Zeng, Hantao Jiang, Pengfei Liu, Yiwei Wang, et al. 2026. ProjDevBench: Benchmarking AI Coding Agents on End- to-End Project Development. arXiv preprint arXiv:2602.01655 (2026)

work page arXiv 2026
[15]

Barton P Miller, Gregory Cooksey, and Fredrick Moore. 2006. An empirical study of the robustness of macos applications using random testing. In Proceedings of the 1st international workshop on Random testing. 46–54

work page 2006
[16]

Miller, Lars Fredriksen, and Bryan So

Barton P. Miller, Lars Fredriksen, and Bryan So. 1990. An Empirical Study of the Reliability of UNIX Utilities. Commun. ACM 33, 12 (1990), 32–44. doi: 10.1145/96267.96279

work page doi:10.1145/96267.96279 1990
[17]

Yaroslav Oliinyk, Michael Scott, Ryan Tsang, Chongzhou Fang, Houman Homayoun, et al . 2024. Fuzzing {BusyBox}: Leveraging {LLM} and crash reuse for embedded bug unearthing. In 33rd USENIX Security Symposium (USENIX Security 24) . 883–900

work page 2024
[18]

Hélio Victor F Santos, Vitor Costa, João Eduardo Montandon, and Marco Tulio Valente. 2025. Decoding the Configuration of AI Cod- ing Agents: Insights from Claude Code Projects. arXiv preprint arXiv:2511.09268 (2025)

work page arXiv 2025
[19]

Junjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, and Qing Wang. 2024. Software testing with large language models: Survey, landscape, and vision. TSE 50, 4 (2024), 911–936

work page 2024
[20]

Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Jun- yang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig

work page
[21]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

OpenHands: An Open Platform for AI Software Developers as Generalist Agents. arXiv preprint arXiv:2407.16741 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Yanlin Wang, Wanjun Zhong, Yanxian Huang, Ensheng Shi, Min Yang, Jiachi Chen, Hui Li, Yuchi Ma, Qianxiang Wang, and Zibin Zheng

work page
[23]

Automated Software Engineering 32, 2 (2025), 70

Agents in software engineering: Survey, landscape, and vision. Automated Software Engineering 32, 2 (2025), 70

work page 2025
[24]

Chunqiu Steven Xia, Matteo Paltenghi, Jia Le Tian, Michael Pradel, and Lingming Zhang. 2024. Fuzz4all: Universal fuzzing with large language models. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering . 1–13

work page 2024
[25]

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. SWE-agent: Agent- Computer Interfaces Enable Automated Software Engineering. arXiv preprint arXiv:2405.15793 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Ao Zhang, Yiying Zhang, Yao Xu, Cong Wang, and Siwei Li. 2023. Machine learning-based fuzz testing techniques: A survey.IEEE Access 12 (2023), 14437–14454

work page 2023
[27]

Kunpeng Zhang, Zongjie Li, Daoyuan Wu, Shuai Wang, and Xin Xia

work page
[28]

In 34th USENIX Security Symposium (USENIX Security 25)

{Low-Cost} and Comprehensive Non-textual Input Fuzzing with {LLM-Synthesized} Input Generators. In 34th USENIX Security Symposium (USENIX Security 25) . 6999–7018

work page
[29]

Ruixin Zhang, Wuyang Dai, Hung Viet Pham, Gias Uddin, Jinqiu Yang, and Song Wang. 2026. Engineering Pitfalls in AI Coding Tools: An Empirical Study of Bugs in Claude Code, Codex, and Gemini CLI. In FSE 2026

work page 2026
[30]

Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury

work page
[31]

AutoCodeRover: Au- tonomous program improvement.arXiv preprint arXiv:2404.05427, 2024

AutoCodeRover: Autonomous Program Improvement. arXiv preprint arXiv:2404.05427 (2024). Received 28 September 2023; revised 5 March 2024; accepted 16 April 2024

work page arXiv 2024