pith. machine review for the scientific record. sign in

arxiv: 2604.18334 · v1 · submitted 2026-04-20 · 💻 cs.SE

Recognition: unknown

Reliability of AI Bots Footprints in GitHub Actions CI/CD Workflows

Abdul Ali Bangash (1) ((1) Lahore University of Management Sciences, Maryam Abdul Ghafoor (1), Muizz Hussain (1), Pakistan), Sehrish Habib (1), Syed Muhammad Ashhar Shah (1)

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:22 UTC · model grok-4.3

classification 💻 cs.SE
keywords AI agentsGitHub ActionsCI/CD workflowspull requestsworkflow reliabilityfailure taxonomyagentic contributionsCopilot
0
0 comments X

The pith

AI coding bots produce CI/CD workflow runs whose success rates drop as their pull requests become more frequent in a repository.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper analyzes tens of thousands of GitHub Actions runs triggered by pull requests from five AI bots to measure how reliably those runs complete. It reports clear differences across bots, with two of them succeeding most often, and a clear negative link at the repository level between how often AI agents contribute and how often the associated workflows pass. The authors also sort thousands of failures into a taxonomy of categories and track how those categories change over time. These patterns matter to any team that already uses or plans to use AI agents inside automated build and test pipelines, because the data point to concrete places where extra safeguards could prevent downstream breakage.

Core claim

Using pull requests, commits, and repositories from the AIDev dataset, the authors retrieved 61,837 GitHub Actions workflow runs across 2,355 repositories that had been triggered by PRs authored by Claude, Devin, Cursor, Copilot, or Codex. Copilot and Codex showed the highest success rates at roughly 93 percent and 94 percent. At the repository level they measured a negative correlation between the frequency of AI-agent contributions and workflow success rate, which they interpret as evidence that heavier use of these agents can reduce CI/CD reliability. They further classified 3,067 failed runs into 13 categories and observed shifts from functional to non-functional failure types over time,

What carries the argument

The negative correlation between AI-agent PR frequency and repository-level CI/CD workflow success rate, measured by linking AIDev pull requests to GitHub Actions run outcomes via the public API.

Load-bearing premise

The AIDev dataset and its 2,355 repositories accurately represent typical AI-bot usage, and the five bots can be identified as PR authors without meaningful mislabeling or selection bias.

What would settle it

A larger or differently sampled collection of repositories that showed no negative correlation, or a positive correlation, between AI PR frequency and workflow success rate would falsify the central repository-level claim.

Figures

Figures reproduced from arXiv: 2604.18334 by Abdul Ali Bangash (1) ((1) Lahore University of Management Sciences, Maryam Abdul Ghafoor (1), Muizz Hussain (1), Pakistan), Sehrish Habib (1), Syed Muhammad Ashhar Shah (1).

Figure 1
Figure 1. Figure 1: Overview of the CI/CD Reliability Analysis Workflow [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Scatter plot of Agentic PRs vs. workflow success [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Month-by-Month Trends in PR Categories Leading [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
read the original abstract

Continuous Integration and Deployment (CI/CD) workflows are central to modern software delivery, yet the reliability of agentic AI bots operating within these workflows remain underexplored. Using pull requests (PRs), commits, and repositories from the AIDev dataset, we retrieved associated CI/CD workflow runs via the GitHub Actions API and analyzed 61,837 runs from 2,355 repositories, all triggered by PRs generated by five AI bots: Claude, Devin, Cursor, Copilot, and Codex. We observed substantial agent-dependent differences in workflow reliability, with Copilot and Codex achieving the highest success rates ~93% and ~94% respectively. At the repository level, we find a negative correlation between AI agent contribution frequency and workflow success rate, suggesting that a higher frequency of Agentic PRs may hinder CI/CD workflow reliability. We defined a taxonomy of 13 categories against 3,067 agentic PRs whose associated workflows failed, and observed a trend analysis that indicates visually observable shifts from functional to non-functional PR categories over time, although these trends are not statistically significant. Our findings motivate the need for actionable guidance on integrating AI agents into CI/CD workflows and prioritizing safeguards in workflows where failures are most likely to occur.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper conducts an observational study of 61,837 GitHub Actions CI/CD workflow runs triggered by PRs from five AI bots (Claude, Devin, Cursor, Copilot, Codex) across 2,355 repositories drawn from the AIDev dataset. It reports bot-specific success rates (highest for Copilot and Codex at ~93-94%), a negative correlation at the repository level between AI-agent PR frequency and workflow success rate, and applies a 13-category failure taxonomy to 3,067 failed runs, noting a visually observable but non-statistically-significant shift in failure categories over time.

Significance. If the negative correlation is robust to controls for repository size/activity and to accurate bot attribution, the work would supply concrete empirical grounding for concerns about AI-bot integration in production CI/CD pipelines and would motivate targeted safeguards and usage guidelines in software engineering practice.

major comments (3)
  1. [Abstract (repository-level analysis) and associated results] The repository-level negative correlation between AI contribution frequency and workflow success rate is presented without reported p-values, confidence intervals, or controls for confounding variables such as repository size, activity level, or language ecosystem; this omission directly weakens the claim that higher frequency 'may hinder' reliability.
  2. [Data and methods (AIDev dataset usage)] PR authorship attribution to the five bots is load-bearing for both the frequency variable and the success-rate variable, yet the manuscript provides no validation metrics, false-positive/negative rates, or sensitivity analysis for the AIDev dataset labeling heuristics; systematic misattribution would distort the central correlation.
  3. [Failure taxonomy and trend analysis] The trend analysis of the 13-category failure taxonomy reports a 'visually observable' shift from functional to non-functional categories over time while explicitly noting that the trends are not statistically significant; this framing risks overstating an acknowledged non-result.
minor comments (2)
  1. [Methods] Clarify the exact operational definition of 'workflow success rate' (e.g., whether it is per-run or per-PR aggregated) and how 'AI agent contribution frequency' is normalized across repositories.
  2. [Data collection] Provide the total number of repositories screened versus the final 2,355 used, and any inclusion/exclusion criteria applied to the AIDev dataset.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive and detailed feedback on our manuscript. We address each major comment below and have revised the manuscript to strengthen the statistical reporting, clarify limitations, and avoid potential overstatement of results.

read point-by-point responses
  1. Referee: [Abstract (repository-level analysis) and associated results] The repository-level negative correlation between AI contribution frequency and workflow success rate is presented without reported p-values, confidence intervals, or controls for confounding variables such as repository size, activity level, or language ecosystem; this omission directly weakens the claim that higher frequency 'may hinder' reliability.

    Authors: We agree that formal statistical measures and controls would strengthen the repository-level analysis. In the revised manuscript we have added a multiple linear regression with AI PR frequency as the predictor and workflow success rate as the outcome. Controls include repository size (log stars and contributors), activity level (total PRs), and language ecosystem (categorical variable for primary language). The negative coefficient for AI frequency remains significant after controls; we now report the p-value and 95% confidence interval in the results section and have updated the abstract to reference the controlled analysis. revision: yes

  2. Referee: [Data and methods (AIDev dataset usage)] PR authorship attribution to the five bots is load-bearing for both the frequency variable and the success-rate variable, yet the manuscript provides no validation metrics, false-positive/negative rates, or sensitivity analysis for the AIDev dataset labeling heuristics; systematic misattribution would distort the central correlation.

    Authors: The AIDev dataset labels are produced by heuristics described in its source publication; we did not collect independent ground-truth validation for this study. We have added an explicit limitations paragraph on heuristic-based attribution and performed a sensitivity analysis that randomly flips 5% and 10% of bot labels before re-estimating the correlation. The direction and significance of the negative association are robust under these perturbations. We have also cited the original labeling procedure more prominently in the methods. revision: partial

  3. Referee: [Failure taxonomy and trend analysis] The trend analysis of the 13-category failure taxonomy reports a 'visually observable' shift from functional to non-functional categories over time while explicitly noting that the trends are not statistically significant; this framing risks overstating an acknowledged non-result.

    Authors: We accept that the phrase 'visually observable' could be read as giving undue weight to a non-significant result. In the revision we have replaced it with: 'A visual inspection of the time-series plot indicates a possible shift from functional to non-functional failure categories, yet the trend does not reach statistical significance (p > 0.05) and remains exploratory.' We have also added a sentence underscoring that no causal or definitive conclusions should be drawn from this observation. revision: yes

Circularity Check

0 steps flagged

No circularity: purely observational analysis of external dataset and API data

full rationale

The paper conducts an observational study by retrieving 61,837 GitHub Actions workflow runs from the AIDev dataset and GitHub API for PRs attributed to five AI bots. It computes success rates, repository-level correlations, and a taxonomy of failure categories directly from the measured data. No equations, fitted parameters, predictions, self-citations, or derivations are present that reduce any result to an input by construction. The negative correlation and agent-dependent differences are reported as empirical observations without any self-referential modeling steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claims rest on the assumption that the AIDev dataset accurately labels AI-bot PRs and that GitHub workflow-run outcomes are correctly retrieved and attributed; no free parameters are fitted, no new physical entities are postulated, and the only invented construct is the 13-category taxonomy defined post hoc from the failure data.

axioms (2)
  • domain assumption Pull-request authorship labels in the AIDev dataset correctly identify the five AI bots without substantial error.
    Required for all bot-specific success-rate comparisons.
  • domain assumption GitHub Actions API returns complete and accurate workflow-run status for the sampled PRs.
    Underpins the 61,837-run success-rate calculations.
invented entities (1)
  • 13-category failure taxonomy no independent evidence
    purpose: Classify the 3,067 failed agentic PRs into functional and non-functional groups.
    Defined by the authors against the observed failures; no independent validation or external reference taxonomy is mentioned.

pith-pipeline@v0.9.0 · 5556 in / 1500 out tokens · 68565 ms · 2026-05-10T04:22:50.117295+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 18 canonical work pages · 3 internal anchors

  1. [1]

    Aho, Monica S

    Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman. 2006.Compilers: Principles, Techniques, and Tools. Pearson, Boston, MA. Reliability of AI Bots Footprints in GitHub Actions CI/CD Workflows MSR ’26, April 13–14, 2026, Rio de Janeiro, Brazil

  2. [2]

    Henri Aïdasso, Francis Bordeleau, and Ali Tizghadam. 2025. On the Illusion of Success: An Empirical Study of Build Reruns and Silent Failures in Industrial CI. arXiv:2509.14347 [cs.SE] https://arxiv.org/abs/2509.14347

  3. [3]

    Mohammad Baqar, Saba Naqvi, and Rajat Khanda. 2025. AI-Augmented CI/CD Pipelines: From Code Commit to Production with Autonomous Decisions. arXiv:2508.11867 [cs.SE] https://arxiv.org/abs/2508.11867

  4. [4]

    James, and Nadia Polikarpova

    Shraddha Barke, Michael B. James, and Nadia Polikarpova. 2022. Grounded Copilot: How Programmers Interact with Code-Generating Models. arXiv:2206.15000 [cs.HC] https://arxiv.org/abs/2206.15000

  5. [5]

    Moritz Beller, Georgios Gousios, and Andy Zaidman. 2017. Oops, My Tests Broke the Build: An Explorative Analysis of Travis CI with GitHub. In2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR). 356–367. doi:10.1109/MSR.2017.62

  6. [6]

    Virginia Braun and Victoria Clarke. 2006. Using thematic analysis in psy- chology.Qualitative Research in Psychology3 (01 2006), 77–101. doi:10.1191/ 1478088706qp063oa

  7. [7]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  8. [8]

    Moataz Chouchen, Narjes Bessghaier, Mahi Begoug, Ali Ouni, Eman Alomar, and Mohamed Wiem Mkaouer. 2024. How Do Software Developers Use Chat- GPT? An Exploratory Study on GitHub Pull Requests. InProceedings of the 21st International Conference on Mining Software Repositories(Lisbon, Portugal) (MSR ’24). Association for Computing Machinery, New York, NY, US...

  9. [9]

    Domenico Cotroneo, Cristina Improta, and Pietro Liguori. 2025. Human-Written vs. AI-Generated Code: A Large-Scale Study of Defects, Vulnerabilities, and Complexity. arXiv:2508.21634 [cs.SE] https://arxiv.org/abs/2508.21634

  10. [10]

    Matheus De Morais Leça, Mariana Bento, and Ronnie De Souza Santos. 2025. Responsible AI in the Software Industry: A Practitioner-Centered Perspective. In2025 IEEE/ACM International Workshop on Responsible AI Engineering (RAIE). 37–40. doi:10.1109/RAIE66699.2025.00011

  11. [11]

    Ahmad Fadavi and Mohsen Alizadeh. 2024. Ownership of Artificial Intelligence- Generated Works: An Overview of the Emerging Intellectual Property Challenges in the Technology Era. (11 2024). doi:10.30497/law.2024.245828.3492

  12. [12]

    Keheliya Gallaba and Shane McIntosh. 2018. Use and Misuse of Continuous Integration Features: An Empirical Study of Projects That (Mis)Use Travis CI. IEEE Transactions on Software EngineeringPP (05 2018), 1–1. doi:10.1109/TSE. 2018.2838131

  13. [13]

    Hassan, Hao Li, Dayi Lin, Bram Adams, Tse-Hsun Chen, Yutaro Kashiwa, and Dong Qiu

    Ahmed E. Hassan, Hao Li, Dayi Lin, Bram Adams, Tse-Hsun Chen, Yutaro Kashiwa, and Dong Qiu. 2025. Agentic Software Engineering: Foundational Pillars and a Research Roadmap. arXiv:2509.06216 [cs.SE] https://arxiv.org/abs/ 2509.06216

  14. [14]

    Michael Hilton, Timothy Tunnell, Kai Huang, Darko Marinov, and Danny Dig

  15. [15]

    InProceedings of the 31st IEEE/ACM International Conference on Automated Soft- ware Engineering(Singapore, Singapore)(ASE ’16)

    Usage, costs, and benefits of continuous integration in open-source projects. InProceedings of the 31st IEEE/ACM International Conference on Automated Soft- ware Engineering(Singapore, Singapore)(ASE ’16). Association for Computing Machinery, New York, NY, USA, 426–437. doi:10.1145/2970276.2970358

  16. [16]

    SAIL Research. 2026. AIDev Challenge Dataset: AI Teammates in Software Engineering 3.0. https://github.com/SAILResearch/AI_Teammates_in_SE3/blob/ main/AIDev_Challenge.pdf. Accessed: 2025-10-04

  17. [17]

    Sakshi Kini Sana Ansari. 2024. The World’s First AI Software Engineer, DEVIN AI.INTERNATIONAL JOURNAL OF INNOV ATIVE RESEARCH IN TECHNOLOGY (2024)

  18. [18]

    Jadson Santos, Daniel Alencar da Costa, and Uirá Kulesza. 2022. Investigating the Impact of Continuous Integration Practices on the Productivity and Quality of Open-Source Projects. InProceedings of the 16th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM ’22). ACM, 137–147. doi:10.1145/3544902.3546244

  19. [19]

    Syed Muhammad Ashhar Shah, Sehrish Habib, Muizz Hussain, Maryam Ab- dul Ghafoor, and Abdul Ali Bangash. 2025. Reliability of AI Bots Footprints in GitHub Actions CI/CD Workflows - Replication Package. doi:10.5281/zenodo. 17867301

  20. [20]

    Tao Xiao, Youmei Fan, Fabio Calefato, Christoph Treude, Raula Gaikovina Kula, Hideaki Hata, and Sebastian Baltes. 2025. Self-Admitted GenAI Usage in Open- Source Software. arXiv:2507.10422 [cs.SE] https://arxiv.org/abs/2507.10422

  21. [21]

    Andrew Yang

    Alan Yang and T. Andrew Yang. 2024. Social Dangers of Generative Artifi- cial Intelligence: Review and Guidelines. InProceedings of the 25th Annual In- ternational Conference on Digital Government Research(Taipei, Taiwan)(dg.o ’24). Association for Computing Machinery, New York, NY, USA, 654–658. doi:10.1145/3657054.3664243

  22. [22]

    SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

    John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. SWE-agent: Agent-Computer In- terfaces Enable Automated Software Engineering. arXiv:2405.15793 [cs.SE] https://arxiv.org/abs/2405.15793

  23. [23]

    Zhimin Zhao, Yihao Chen, Abdul Ali Bangash, Bram Adams, and Ahmed E. Hassan. 2024. An empirical study of challenges in machine learning asset man- agement.Empirical Software Engineering29, 4 (June 2024). doi:10.1007/s10664- 024-10474-4