pith. sign in

arxiv: 2606.12231 · v1 · pith:2RQAS2SXnew · submitted 2026-06-10 · 💻 cs.SE · cs.AI

Rule Taxonomy and Evolution in AI IDEs: A Mining and Survey Study

Pith reviewed 2026-06-27 09:04 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords AI IDErule taxonomyrule evolutionartifact complianceempirical studyprompt engineering
0
0 comments X

The pith

Updating rules in AI IDEs raises average artifact compliance from 49% to 72%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The study mines 83 open-source repositories containing 7,310 rules for AI IDEs and surveys 99 practitioners to create a taxonomy of 5 primary and 25 secondary rule categories. It finds a mismatch where repositories favor low-level workflow and formatting rules while practitioners prioritize architectural constraints. Analysis of 1,540 evolution events shows rules change often, mainly through expansions and enrichments in repository data but to fix AI errors in survey responses. The central result is that updates to rules measurably improve how well generated software artifacts follow project constraints.

Core claim

An assessment of 160 rule evolution events shows that updating rules improves adherence of software artifacts, raising the average compliance rate by 22.99 percentage points from 49.14% to 72.13%.

What carries the argument

The artifact compliance assessment that measures software artifact adherence before and after each of 160 rule evolution events.

If this is right

  • Rule files in practice contain mostly low-level constraints even though developers rate architectural constraints as more important.
  • Rule evolution occurs frequently and is driven by context expansions and enrichments according to repository data.
  • Practitioners mainly add new negative constraints to correct AI errors rather than editing existing rules.
  • The measured compliance gains suggest that maintaining and updating rules can serve as an effective prompting strategy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Automated tools could monitor rule files for opportunities to trigger updates that would most improve compliance.
  • Taxonomy categories might be used to detect conflicts between rules from different sources or team members.
  • The gap between stated priorities and actual rule content points to a need for better default rule templates in AI IDEs.

Load-bearing premise

The manual or semi-automated classification of 7310 rules into the 5 primary and 25 secondary categories is accurate and consistent, and the 83 mined repositories plus 99 survey respondents are representative of typical AI IDE usage.

What would settle it

A replication that applies the same before-and-after compliance measurement to 160 new rule evolution events and finds no statistically significant rise in adherence rates.

Figures

Figures reproduced from arXiv: 2606.12231 by Guangzong Cai, Mojtaba Shahin, Peng Liang, Ruiyin Li, Zengyang Li.

Figure 1
Figure 1. Figure 1: An example of a rule file defining coding standards. [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the mixed-methods research process. [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: An example of segmenting rule text into distinct semantic units (rules). [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: An example of extracting changes from rule Diffs. [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The prompt structure used for LLM-based reason determinability filtering. [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Overview of the rule evolution analysis and compliance assessment process. [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The prompt structure used for LLM-based verifiability filtering. [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The prompt structure used for rule compliance assessment. [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Overview of the survey questionnaire [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Distribution of primary languages and project domains in the selected projects. [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Distribution of commit count, number of contributors, and project creation month in [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Overview of the number of rule files and rule entries in each project. [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Overview of countries of survey participants. [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Overview of education, professional experience and rule file usage duration of survey [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Overview of roles in the team, project domains developed using the AI IDE, and AI [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Rule count and importance mean score across secondary categories. [PITH_FULL_IMAGE:figures/full_fig_p031_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Change rate and change type distribution of AI IDE rules. [PITH_FULL_IMAGE:figures/full_fig_p033_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Triangulation of driving reasons for rule evolution: mining data vs. survey responses. [PITH_FULL_IMAGE:figures/full_fig_p035_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Co-occurrence heatmap of subjective driving reasons based on Phi coefficients. [PITH_FULL_IMAGE:figures/full_fig_p037_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Longitudinal trends of artifact compliance rates across 5 commits before and after rule [PITH_FULL_IMAGE:figures/full_fig_p038_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Changes in average compliance rates for selected secondary categories before and [PITH_FULL_IMAGE:figures/full_fig_p039_21.png] view at source ↗
read the original abstract

The adoption of AI-powered Integrated Development Environments (AI IDEs) has introduced "Rules" as a novel software artifact, allowing developers to persistently inject project-specific constraints and architectural guidelines into the context of Large Language Models (LLMs). Despite their role in aligning AI behavior with developer intent, the taxonomy, evolution, and practical impact of these rules remain largely unexplored. To bridge this gap, we conducted a mixed-methods empirical study on AI IDE rules. By mining 83 open-source projects and extracting 7,310 rules, we established a comprehensive taxonomy comprising 5 primary and 25 secondary categories. We then triangulated these artifacts with survey responses from 99 practitioners. Our analysis identified a contrast between developer priorities and actual configurations: while practitioners rate architectural constraints as highly important, rule files in repositories primarily consist of low-level workflow and code formatting constraints. Furthermore, our analysis of 1,540 rule evolution events revealed that rules are updated frequently. Repository data further indicate that rule evolution is primarily driven by constructive context expansions (29.17%) and enrichments (26.59%). In contrast, surveyed developers reported modifying rules primarily to correct AI errors (77.78%), typically by adding new negative constraints rather than editing existing ones. Finally, an artifact compliance assessment of 160 rule evolution events revealed that updating rules significantly improves the adherence of software artifacts, with the average artifact compliance rate increasing by 22.99% (from 49.14% to 72.13%) following an update. Our study provides empirical insights that can help developers optimize prompting strategies and guide tool builders in designing automated conflict-detection and context-management mechanisms for AI IDEs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper reports a mixed-methods empirical study of rules in AI IDEs. It mines 83 open-source repositories to extract 7,310 rules, derives a taxonomy of 5 primary and 25 secondary categories, triangulates with a survey of 99 practitioners, analyzes 1,540 rule-evolution events, and performs an artifact-compliance assessment on 160 events. Key findings include a mismatch between practitioner-rated importance of architectural constraints and the prevalence of low-level workflow rules, frequent rule updates driven by context expansion, and a 22.99% average increase in artifact compliance (49.14% to 72.13%) after rule updates.

Significance. If the compliance assessment is reproducible, the reported 22.99% improvement would supply concrete evidence that rule evolution measurably affects artifact adherence in AI IDEs, informing both developer prompting practices and tool design for conflict detection. The taxonomy and priority-configuration contrast also offer a baseline for future longitudinal studies of rule maintenance.

major comments (2)
  1. [artifact compliance assessment] The artifact compliance assessment (abstract and § on compliance results): the selection protocol for the 160 events out of 1,540 evolution events is not described, nor is the operational definition of the compliance metric (e.g., fraction of rules satisfied by concrete code artifacts) or any inter-rater reliability statistics. Without these, the 22.99% delta cannot be attributed to rule updates rather than measurement artifact.
  2. [taxonomy and mining sections] Taxonomy construction and repository sampling (§ on mining and taxonomy): the paper supplies no inter-rater agreement figures for the manual or semi-automated classification of 7,310 rules into the 5+25 categories, nor explicit inclusion/exclusion criteria or sampling frame for the 83 projects. These omissions directly affect the reliability of the reported category distributions and the contrast with survey priorities.
minor comments (2)
  1. [survey section] The survey response rate and any statistical tests for the 99 responses are not reported, which would help assess representativeness.
  2. [evolution events analysis] Clarify whether the 1,540 evolution events were exhaustively extracted or sampled, and provide the exact definition of 'constructive context expansions' versus 'enrichments'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will incorporate the requested clarifications to enhance the manuscript's methodological transparency and reproducibility.

read point-by-point responses
  1. Referee: [artifact compliance assessment] The artifact compliance assessment (abstract and § on compliance results): the selection protocol for the 160 events out of 1,540 evolution events is not described, nor is the operational definition of the compliance metric (e.g., fraction of rules satisfied by concrete code artifacts) or any inter-rater reliability statistics. Without these, the 22.99% delta cannot be attributed to rule updates rather than measurement artifact.

    Authors: We agree that the selection protocol, operational definition of the compliance metric, and inter-rater reliability statistics were omitted from the original submission. In the revised manuscript we will describe the protocol used to select the 160 events from the 1,540 evolution events, provide the precise definition of the compliance metric as the fraction of rules satisfied by the concrete code artifacts, and report inter-rater reliability statistics for the compliance judgments. These additions will allow readers to evaluate the reliability of the reported 22.99% improvement. revision: yes

  2. Referee: [taxonomy and mining sections] Taxonomy construction and repository sampling (§ on mining and taxonomy): the paper supplies no inter-rater agreement figures for the manual or semi-automated classification of 7,310 rules into the 5+25 categories, nor explicit inclusion/exclusion criteria or sampling frame for the 83 projects. These omissions directly affect the reliability of the reported category distributions and the contrast with survey priorities.

    Authors: We concur that inter-rater agreement figures and explicit sampling details should have been reported. The revised manuscript will include inter-rater agreement statistics for the classification of the 7,310 rules into the taxonomy categories, together with the inclusion/exclusion criteria and sampling frame applied to select the 83 projects. These additions will strengthen the credibility of the category distributions and the observed contrast with practitioner priorities. revision: yes

Circularity Check

0 steps flagged

No circularity: purely observational empirical study

full rationale

The paper reports a mixed-methods study involving repository mining of 7310 rules from 83 projects, manual taxonomy construction into 5 primary/25 secondary categories, a survey of 99 practitioners, analysis of 1540 evolution events, and a compliance assessment on 160 events. No equations, fitted parameters, predictions, uniqueness theorems, or ansatzes appear. Claims rest on direct observation and triangulation rather than any reduction of outputs to inputs by definition or self-citation. The compliance delta is presented as a measured empirical outcome, not derived from prior fitted values or author self-references.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical study; no mathematical free parameters, axioms, or invented entities are introduced beyond standard assumptions of representative sampling and accurate manual classification.

pith-pipeline@v0.9.1-grok · 5843 in / 1108 out tokens · 21664 ms · 2026-06-27T09:04:20.965197+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

103 extracted references · 11 linked inside Pith

  1. [1]

    Shyam Agarwal, Hao He, and Bogdan Vasilescu. 2026. AI IDEs or Autonomous Agents? Measuring the Impact of Coding Agents on Software Development. InProceedings of the 23rd International Conference on Mining Software Repositories (MSR). ACM

  2. [2]

    Sirwan Khalid Ahmed. 2024. How to choose a sampling technique and determine sample size for research: A simplified guide for researchers.Oral Oncology Reports12 (2024), 100662

  3. [3]

    Alibaba. 2026. Qoder - Changelog. https://qoder.com/changelog Accessed: February 12, 2026

  4. [4]

    Alibaba. 2026. Qoder - Rules. https://docs.qoder.com/user-guide/rules Accessed: February 12, 2026

  5. [5]

    Alibaba. 2026. Qoder - The Agentic Coding Platform. https://qoder.com/ Accessed: February 12, 2026

  6. [6]

    Amazon. 2026. Kiro - Changelog. https://kiro.dev/changelog Accessed: February 12, 2026

  7. [7]

    Amazon. 2026. Kiro - Steering. https://kiro.dev/docs/steering/ Accessed: February 12, 2026

  8. [8]

    Amazon. 2026. Kiro: Agentic AI Development from Prototype to Production. https://kiro.dev/ Accessed: February 12, 2026

  9. [9]

    Anysphere. 2026. Cursor - Changelog. https://cursor.com/changelog Accessed: February 12, 2026

  10. [10]

    Anysphere. 2026. Cursor - Rules. https://cursor.com/docs/context/rules/ Accessed: February 12, 2026

  11. [11]

    Anysphere. 2026. Cursor - The AI Code Editor. https://www.cursor.com/

  12. [12]

    Richard A Armstrong. 2014. When to use the B onferroni correction.Ophthalmic and Physiological Optics34, 5 (2014), 502–508

  13. [13]

    2009.Software Architecture Knowledge Management

    Muhammad Ali Babar, Torgeir Dingsøyr, Patricia Lago, and Hans Van Vliet. 2009.Software Architecture Knowledge Management. Springer

  14. [14]

    Shraddha Barke, Michael B James, and Nadia Polikarpova. 2023. Grounded copilot: How programmers interact with code-generating models.Proceedings of the ACM on Programming Languages7, OOPSLA (2023), 85–111

  15. [15]

    Joel Becker, Nate Rush, Elizabeth Barnes, and David Rein. 2025. Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity.arXiv preprint arXiv:2507.09089(2025)

  16. [16]

    Alexander L Burton. 2021. OLS (Linear) regression.The Encyclopedia of Research Methods in Crimi- nology and Criminal Justice2 (2021), 509–514

  17. [17]

    ByteDance. 2026. Trae - Changelog. https://www.trae.ai/changelog Accessed: February 12, 2026

  18. [18]

    ByteDance. 2026. Trae - Collaborate with Intelligence. https://www.trae.ai/ Accessed: February 12, 2026. ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: June 2026. 48 Cai et al

  19. [19]

    ByteDance. 2026. Trae - Rules. https://docs.trae.ai/ide/rules?_lang=en Accessed: February 12, 2026

  20. [20]

    Birgitta Böckeler. 2026. Harness engineering for coding agent users. https://martinfowler.com/ articles/harness-engineering.html Accessed: May 1, 2026

  21. [21]

    Rule Taxonomy and Evolution in AI IDEs: A Mining and Survey Study

    Guangzong Cai, Ruiyin Li, Peng Liang, Zengyang Li, and Mojtaba Shahin. 2026. Replication package for the paper “Rule Taxonomy and Evolution in AI IDEs: A Mining and Survey Study”. https: //github.com/breezesway/Rules_in_AI_IDEs

  22. [22]

    John L Campbell, Charles Quincy, Jordan Osserman, and Ove K Pedersen. 2013. Coding in-depth semistructured interviews: Problems of unitization and intercoder reliability and agreement.Socio- logical Methods & Research42, 3 (2013), 294–320

  23. [23]

    Worawalan Chatlatanagulchai, Hao Li, Yutaro Kashiwa, Brittany Reid, et al. 2025. Agent READMEs: An Empirical Study of Context Files for Agentic Coding.arXiv preprint arXiv:2511.12884(2025)

  24. [24]

    Worawalan Chatlatanagulchai, Kundjanasith Thonglek, Brittany Reid, Yutaro Kashiwa, Pattara Leelaprute, Arnon Rungsawang, Bundit Manaskasemsak, and Hajimu Iida. 2025. On the Use of Agentic Coding Manifests: An Empirical Study of Claude Code. InProceedings of the 26th International Conference on Product-Focused Software Process Improvement (PROFES). Springe...

  25. [25]

    Qiu, Arran Zeyu Wang, Zilong Wang, and Yuqing Yang

    Nan Chen, Luna K. Qiu, Arran Zeyu Wang, Zilong Wang, and Yuqing Yang. 2025. Screen Reader Users in the Vibe Coding Era: Adaptation, Empowerment, and New Accessibility Landscape.arXiv preprint arXiv:2506.13270(2025)

  26. [26]

    Xiongzhi Chen and Sanat K Sarkar. 2020. On Benjamini–Hochberg procedure applied to mid p-values. Journal of Statistical Planning and Inference205 (2020), 34–45

  27. [27]

    Yi-Hung Chou, Boyuan Jiang, Yi Wen Chen, Mingyue Weng, Victoria Jackson, Thomas Zimmermann, and James Jones. 2026. Building Software by Rolling the Dice: A Qualitative Study of Vibe Coding. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE). ACM

  28. [28]

    Codeium. 2026. Windsurf - Changelog. https://windsurf.com/changelog Accessed: February 12, 2026

  29. [29]

    Codeium. 2026. Windsurf - Rules. https://docs.windsurf.com/windsurf/cascade/memories/ Accessed: February 12, 2026

  30. [30]

    Codeium. 2026. Windsurf - The best AI for Coding. https://windsurf.com/ Accessed: February 12, 2026

  31. [31]

    J. Cohen. 1960. A coefficient of agreement for nominal scales.Educational and Psychological Measure- ment20, 1 (1960), 37–46

  32. [32]

    Giuseppe Colavito, Filippo Lanubile, and Nicole Novielli. 2025. Benchmarking large language models for automated labeling: The case of issue report classification.Information and Software Technology 184 (2025), 107758

  33. [33]

    Giuseppe Colavito, Filippo Lanubile, Nicole Novielli, and Luigi Quaranta. 2024. Leveraging gpt-like llms to automate issue labeling. InProceedings of the 21st International Conference on Mining Software Repositories (MSR). ACM, 469–480

  34. [34]

    Ozren Dabic, Emad Aghajani, and Gabriele Bavota. 2021. Sampling projects in GitHub for MSR studies. InProceedings of the 18th International Conference on Mining Software Repositories (MSR). IEEE, 560–564

  35. [35]

    Vincenzo De Martino, Joel Castaño, Fabio Palomba, Xavier Franch, and Silverio Martínez-Fernández

  36. [36]

    InProceedings of the 2nd IEEE/ACM International Workshop on Methodological Issues with Empirical Studies in Software Engineering (WSESE)

    A Framework for Using LLMs for Repository Mining Studies in Empirical Software Engineering. InProceedings of the 2nd IEEE/ACM International Workshop on Methodological Issues with Empirical Studies in Software Engineering (WSESE). IEEE, 6–11

  37. [37]

    Antonio Della Porta, Stefano Lambiase, and Fabio Palomba. 2024. Do Prompt Patterns Affect Code Quality? A First Empirical Assessment of ChatGPT-Generated Code. InProceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering (EASE). ACM, 181– 191

  38. [38]

    Rosa Falotico and Piero Quatto. 2015. Fleiss’ kappa statistic without paradoxes.Quality & Quantity 49, 2 (2015), 463–470. ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: June 2026. Rule Taxonomy and Evolution in AI IDEs: A Mining and Survey Study 49

  39. [39]

    Jennifer Fereday and Eimear Muir-Cochrane. 2006. Demonstrating rigor using thematic analysis: A hybrid approach of inductive and deductive coding and theme development.International Journal of Qualitative Methods5, 1 (2006), 80–92

  40. [40]

    Ehsan Firouzi and Mohammad Ghafari. 2026. Persistent Human Feedback, LLMs, and Static Analyzers for Secure Code Generation and Vulnerability Detection.arXiv preprint arXiv:2602.05868(2026)

  41. [41]

    Matthias Galster, Seyedmoein Mohsenimofidi, Jai Lal Lulla, Muhammad Auwal Abubakar, Christoph Treude, and Sebastian Baltes. 2026. Configuring Agentic AI Coding Tools: An Exploratory Study. arXiv preprint arXiv:2602.14690(2026)

  42. [42]

    Yuyao Ge, Lingrui Mei, Zenghao Duan, Tianhao Li, Yujia Zheng, Yiwei Wang, Lexin Wang, Jiayu Yao, Tianyu Liu, Yujun Cai, Baolong Bi, Fangda Guo, Jiafeng Guo, Shenghua Liu, and Xueqi Cheng. 2025. A Survey of Vibe Coding with Large Language Models.arXiv preprint arXiv:2510.12399(2025)

  43. [43]

    GitHub. 2026. GitHub Copilot - AI coding built your way. https://github.com/features/copilot/ai- code-editor Accessed: February 12, 2026

  44. [44]

    Thibaud Gloaguen, Niels Mündler, Mark Müller, Veselin Raychev, and Martin Vechev. 2026. Evalu- ating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?arXiv preprint arXiv:2602.11988(2026)

  45. [45]

    Guild.ai. 2026. AI IDE (Artificial Intelligence Integrated Development Environment). https://www. guild.ai/glossary/ai-ide Accessed: March 8, 2026

  46. [46]

    Hao He, Courtney Miller, Shyam Agarwal, Christian Kästner, and Bogdan Vasilescu. 2026. Speed at the Cost of Quality: How Cursor AI Increases Short-Term Velocity and Long-Term Complexity in Open-Source Projects. InProceedings of the 23rd International Conference on Mining Software Repositories (MSR). ACM

  47. [47]

    Junda He, Jieke Shi, Terry Yue Zhuo, Christoph Treude, Jiamou Sun, Zhenchang Xing, Xiaoning Du, and David Lo. 2026. LLM-as-a-judge for software engineering: Literature review, vision, and the road ahead.ACM Transactions on Software Engineering and Methodology(2026)

  48. [48]

    Andre Hora and Romain Robbes. 2026. Are Coding Agents Generating Over-Mocked Tests? An Empirical Study.arXiv preprint arXiv:2602.00409(2026)

  49. [49]

    Kosei Horikawa, Hao Li, Yutaro Kashiwa, Bram Adams, Hajimu Iida, and Ahmed E. Hassan. 2025. Agentic Refactoring: An Empirical Study of AI Coding Agents.arXiv preprint arXiv:2511.04824(2025)

  50. [50]

    Ruanqianqian Huang, Avery Reyna, Sorin Lerner, Haijun Xia, and Brian Hempel. 2025. Professional Software Developers Don’t Vibe, They Control: AI Agent Use for Coding in 2025.arXiv preprint arXiv:2512.14012(2025)

  51. [51]

    Shaokang Jiang and Daye Nam. 2026. Beyond the Prompt: An Empirical Study of Cursor Rules. In Proceedings of the 23rd International Conference on Mining Software Repositories (MSR). ACM

  52. [52]

    Syed Mohammad Kashif, Ruiyin Li, Peng Liang, Amjed Tahir, Qiong Feng, Zengyang Li, and Mojtaba Shahin. 2026. Beyond Functional Correctness: Design Issues in AI IDE-Generated Large-Scale Projects. arXiv preprint arXiv:2604.06373(2026)

  53. [53]

    Ranim Khojah, Francisco Gomes de Oliveira Neto, Mazen Mohamad, and Philipp Leitner. 2025. The Impact of Prompt Programming on Function-Level Code Generation.IEEE Transactions on Software Engineering51, 8 (2025), 2381–2395

  54. [54]

    Hae-Young Kim. 2017. Statistical notes for clinical researchers: Chi-squared test and Fisher’s exact test.Restorative Dentistry & Endodontics42, 2 (2017), 152

  55. [55]

    Aayush Kumar, Yasharth Bajpai, Sumit Gulwani, Gustavo Soares, and Emerson Murphy-Hill. 2025. Why AI Agents Still Need You: Findings from Developer-Agent Collaborations in the Wild. In Proceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 432–444

  56. [56]

    Kadakatla Pavan Kumar and Visweswararao Reddi. 2023. Significance of Spearman’s rank correlation coefficient.International Journal For Multidisciplinary Research5, 4 (2023), 1–4

  57. [57]

    Hao Li, Haoxiang Zhang, and Ahmed E. Hassan. 2025. The Rise of AI Teammates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshaping Software Engineering.arXiv preprint arXiv:2507.15003(2025). ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: June 2026. 50 Cai et al

  58. [58]

    Jie Li, Youyang Hou, Laura Lin, Ruihao Zhu, Hancheng Cao, and Abdallah El Ali. 2026. Vibe Coding in Product Teams: Reconfiguring AI-Assisted Workflows, Prototyping, and Collaboration.arXiv preprint arXiv:2509.10652(2026)

  59. [59]

    Zongwei Li, Zhonghang Li, Zirui Guo, Xubin Ren, and Chao Huang. 2025. DeepCode: Open Agentic Coding.arXiv preprint arXiv:2512.07921(2025)

  60. [60]

    Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics12 (2024), 157–173

  61. [61]

    Xinpeng Liu, Junming Liu, Peiyu Liu, Han Zheng, Qinying Wang, Mathias Payer, Shouling Ji, and Wenhai Wang. 2025. Cuckoo Attack: Stealthy and Persistent Attacks Against AI-IDE.arXiv preprint arXiv:2509.15572(2025)

  62. [62]

    Your AI, My Shell

    Yue Liu, Yanjie Zhao, Yunbo Lyu, Ting Zhang, Haoyu Wang, and David Lo. 2026. “Your AI, My Shell”: Demystifying Prompt Injection Attacks on Agentic AI Coding Editors.arXiv preprint arXiv:2509.22040 (2026)

  63. [63]

    Zhang, Sebastian Baltes, and Christoph Treude

    Jai Lal Lulla, Seyedmoein Mohsenimofidi, Matthias Galster, Jie M. Zhang, Sebastian Baltes, and Christoph Treude. 2026. On the Impact of AGENTS.md Files on the Efficiency of AI Coding Agents. arXiv preprint arXiv:2601.20404(2026)

  64. [64]

    Weidi Luo, Qiming Zhang, Tianyu Lu, Xiaogeng Liu, et al . 2025. Code Agent Can Be an End-to- End System Hacker: Benchmarking Real-World Threats of Computer-Use Agent.arXiv preprint arXiv:2510.06607(2025)

  65. [65]

    Damon McMillan. 2026. Instruction Adherence in Coding Agent Configuration Files.arXiv preprint arXiv:2605.10039(2026)

  66. [66]

    Damon McMillan. 2026. Structured Context Engineering for File-Native Agentic Systems: Evalu- ating Schema Accuracy, Format Effectiveness, and Multi-File Navigation at Scale.arXiv preprint arXiv:2602.05447(2026)

  67. [67]

    Seyedmoein Mohsenimofidi, Matthias Galster, Christoph Treude, and Sebastian Baltes. 2026. Context Engineering for AI Agents in Open-Source Software. InProceedings of the 23rd International Conference on Mining Software Repositories (MSR). ACM

  68. [68]

    Daye Nam, Andrew Macvean, Vincent Hellendoorn, Bogdan Vasilescu, and Brad Myers. 2024. Using an LLM to Help with Code Understanding. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering (ICSE). ACM, 1–13

  69. [69]

    Ben Nassi, Bruce Schneier, and Oleg Brodt. 2026. The Promptware Kill Chain: How Prompt Injections Gradually Evolved Into a Multi-Step Malware.arXiv preprint arXiv:2601.09625(2026)

  70. [70]

    Yuzhou Nie, Zhun Wang, Yu Yang, Ruizhe Jiang, Yuheng Tang, Xander Davies, Yarin Gal, Bo Li, Wenbo Guo, and Dawn Song. 2025. SECODEPLT: A Unified Benchmark for Evaluating the Security Risks and Capabilities of Code GenAI. InProceedings of the 39th Annual Conference on Neural Information Processing Systems (NeurIPS). OpenReview.net, 1–43

  71. [71]

    Junichiro Niimi. 2026. Distortion Instead of Hallucination: The Effect of Reasoning Under Strict Constraints.arXiv preprint arXiv:2601.01490(2026)

  72. [72]

    Selcan Yukcu, Mehmet Cevheri Bozoglan, and Mehmet S

    Amirkia Rafiei Oskooei, S. Selcan Yukcu, Mehmet Cevheri Bozoglan, and Mehmet S. Aktas. 2026. Natural Language Summarization Enables Multi-Repository Bug Localization by LLMs in Microservice Architectures. InProceedings of the 3rd International Workshop on Large Language Models for Code (LLM4Code). ACM, 1–9

  73. [73]

    Elise Paradis, Kate Grey, Quinn Madison, Daye Nam, Andrew Macvean, Vahid Meimand, Nan Zhang, Ben Ferrari-Church, and Satish Chandra. 2025. How Much Does AI Impact Development Speed? An Enterprise-Based Randomized Controlled Trial. InProceedings of the 47th IEEE/ACM International Conference on Software Engineering: Software Engineering in Practice (ICSE-SE...

  74. [74]

    Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and Ramesh Karri. 2022. Asleep at the Keyboard? Assessing the Security of GitHub Copilot’s Code Contributions. InProceedings of the 43rd IEEE Symposium on Security and Privacy (SP). IEEE, 754–768

  75. [75]

    Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. 2023. The Impact of AI on Developer Productivity: Evidence from GitHub Copilot.arXiv preprint arXiv:2302.06590(2023). ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: June 2026. Rule Taxonomy and Evolution in AI IDEs: A Mining and Survey Study 51

  76. [76]

    Zhiyuan Peng, Xin Yin, Pu Zhao, Fangkai Yang, et al. 2026. RepoGenesis: Benchmarking End-to-End Microservice Generation from README to Repository.arXiv preprint arXiv:2601.13943(2026)

  77. [77]

    Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. 2022. Red teaming language models with language models. In Proceedings of the 27th Conference on Empirical Methods in Natural Language Processing (EMNLP). ACL, 3419–3448

  78. [78]

    Veronica Pimenova, Sarah Fakhoury, Christian Bird, Margaret-Anne Storey, and Madeline Endres

  79. [79]

    Good Vibrations? A Qualitative Study of Co-Creation, Communication, Flow, and Trust in Vibe Coding.arXiv preprint arXiv:2509.12491(2025)

  80. [80]

    Roshani K Prematunga. 2012. Correlational Analysis.Australian Critical Care25, 3 (2012), 195–199

Showing first 80 references.