pith. sign in

arxiv: 2606.13449 · v1 · pith:FOTFLI5Xnew · submitted 2026-06-11 · 💻 cs.SE · cs.AI

Toward Instructions-as-Code: Understanding the Impact of Instruction Files on Agentic Pull Requests

Pith reviewed 2026-06-27 06:01 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords AI agentsinstruction filespull requestsmerge rateagentic PRsInstructions-as-Codesoftware engineeringcode churn
0
0 comments X

The pith

Creating instruction files for AI-agents yields mixed effects on pull request success across projects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies 15,549 agentic pull requests across 148 projects to test whether instruction files that guide AI-agents improve merge rates, code churn, and merge effort. It performs within-project before-and-after comparisons around the moment each project introduced such files. Roughly equal shares of projects (27.7 percent and 26.35 percent) experienced at least a 20 percent rise or fall in merge rate after the files appeared, with parallel mixed patterns for change volume and review time. Projects showing gains tended to have longer files divided into more sections. The results indicate that adding instructions alone does not reliably raise agent performance and therefore motivate systematic engineering of the files themselves.

Core claim

Specifying instructions for AI-agents does not necessarily lead to better results. With the instruction files, 27.7% of the projects increased their merge rate by at least 20%, while 26.35% decreased it. The same observation is seen with the amount of changes and with the efforts to merge an agentic PR. Projects that managed to increase their merge rate have substantially longer instruction files, which are also well structured into a higher number of sections and sub-sections.

What carries the argument

Within-project before-and-after comparison of three PR metrics (merge rate, code churn and modified files, merge time and comment count) around the creation of instruction files.

If this is right

  • Roughly one-quarter of projects can expect a meaningful rise in merge rate after adding instruction files, while a similar share can expect a decline.
  • Positive outcomes associate with longer files that contain more sections and subsections.
  • The mixed results across all three performance dimensions motivate treating instruction-file creation as an explicit software-engineering activity called Instructions-as-Code.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Projects may need to treat instruction files as living artifacts that receive updates and reviews like source code to capture the gains seen in the better-structured cases.
  • The length-and-structure observation suggests that simple presence of a file is insufficient; future work could test whether automated checks for file structure predict later PR improvements.
  • Neighboring problems such as prompt engineering for other agent tasks may benefit from the same before-after measurement approach used here.

Load-bearing premise

The before-and-after comparison within each project isolates the causal effect of instruction-file creation on the three PR metrics, with no other concurrent project changes confounding the observed differences.

What would settle it

A study that re-runs the before-after analysis after explicitly logging and controlling for any simultaneous changes in team size, review processes, or tooling would falsify the claim if the mixed outcome pattern disappears.

Figures

Figures reproduced from arXiv: 2606.13449 by Ali Arabat, Mohammed Sayagh.

Figure 1
Figure 1. Figure 1: The # of projects across merge rate improvement [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The # of projects across merge rate improvement [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 5
Figure 5. Figure 5: Characteristics of instruction files for projects with [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
read the original abstract

AI-agents (e.g., GitHub Copilot) collaborate as teammates in different software engineering tasks, including code generation proposed through pull requests (Agentic-PRs). For better agent efficiency, developers create instruction files that guide the AI-agents, including how to navigate the project, locate the right components, run tests, respect best practices, and more. In this paper, we investigate the relationship between the creation of these instructions and the performance of AI-agents in creating better pull requests, which have a higher chance of success (i.e., the merge rate), address more complex tasks (e.g., code churn), and require less effort to be merged (e.g., time to merge). To this end, we analyze 15,549 agentic PRs from 148 projects in the AIDev dataset. Using the three dimensions, we compare each project before and after the creation of the instruction files. We find that specifying instructions for AI-agents does not necessarily lead to better results. With the instruction files, 27.7\% of the projects increased their merge rate by at least 20\%, while 26.35\% decreased it. The same observation is seen with the amount of changes (e.g., code churn, number of modified files) and with the efforts to merge an agentic PR (e.g., merge time and number of comments). From a first exploration, we find that projects that managed to increase their merge rate have substantially longer instruction files, which are also well structured into a higher number of sections and sub-sections. Our results motivate the need for research to assist practitioners in framing the development of instruction files as a software engineering activity (aka, \textbf{Instructions-as-Code}).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper analyzes 15,549 agentic PRs from 148 projects in the AIDev dataset to assess the impact of instruction files on AI-agent PR performance. It compares merge rate, code churn, number of modified files, merge time, and number of comments before vs. after instruction-file creation within each project, reporting mixed results (e.g., 27.7% of projects show >=20% merge-rate increase while 26.35% show a decrease). An exploratory analysis links longer, better-structured files to improvements and motivates treating instruction development as 'Instructions-as-Code'.

Significance. If the before-after design can be shown to isolate the effect of instruction files, the mixed outcomes would be a useful empirical signal for AI-assisted software engineering, underscoring that such files are not a guaranteed win and highlighting the value of structured instructions. The scale of the AIDev dataset is a positive feature.

major comments (3)
  1. [Abstract] Abstract: the central mixed-result claim rests on raw percentages (27.7% increase, 26.35% decrease) with no statistical tests, confidence intervals, or description of project-selection criteria for the 148 projects or definition of before/after windows.
  2. [Abstract] Abstract / comparison design: the within-project before-after comparison does not discuss or adjust for concurrent events (team-size changes, process updates, new tooling, or AI-agent adoption itself) that could confound the observed differences in merge rate and other metrics.
  3. [Exploratory analysis] Exploratory analysis: the reported association between longer, multi-section instruction files and merge-rate gains is presented without formal controls, regression, or tests for project-level covariates, limiting its ability to support the 'Instructions-as-Code' recommendation.
minor comments (1)
  1. [Abstract] Abstract: the LaTeX markup '\textbf{Instructions-as-Code}' should be replaced with plain text or proper formatting for readability in non-LaTeX outputs.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments and for recognizing the value of the AIDev dataset scale and the mixed-results signal. We address each major comment below and outline planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central mixed-result claim rests on raw percentages (27.7% increase, 26.35% decrease) with no statistical tests, confidence intervals, or description of project-selection criteria for the 148 projects or definition of before/after windows.

    Authors: We agree the abstract would be improved by statistical support and clearer methodology. In revision we will add paired statistical tests (e.g., Wilcoxon signed-rank) with confidence intervals for the before-after differences, summarize the selection criteria (all 148 AIDev projects that added instruction files), and define before/after windows by the commit timestamp of the instruction file. These details exist in the methods but will be condensed for the abstract. revision: yes

  2. Referee: [Abstract] Abstract / comparison design: the within-project before-after comparison does not discuss or adjust for concurrent events (team-size changes, process updates, new tooling, or AI-agent adoption itself) that could confound the observed differences in merge rate and other metrics.

    Authors: This is a fair observation about observational before-after designs. Within-project comparison mitigates some project-level heterogeneity, yet time-varying confounders remain possible. We will expand the limitations and threats-to-validity section to explicitly discuss these issues (team changes, tooling, adoption timing) and emphasize that results are associative. Additional data would be required for full adjustment; we will note this limitation clearly. revision: yes

  3. Referee: [Exploratory analysis] Exploratory analysis: the reported association between longer, multi-section instruction files and merge-rate gains is presented without formal controls, regression, or tests for project-level covariates, limiting its ability to support the 'Instructions-as-Code' recommendation.

    Authors: We concur that the current exploratory analysis lacks formal controls. The manuscript already labels it a 'first exploration' and frames Instructions-as-Code as a research motivation rather than a validated claim. In revision we will add a regression model controlling for available project covariates (size, age, contributor count) or, if data limitations prevent this, strengthen the caveats around the association. This will better ground the forward-looking recommendation. revision: partial

Circularity Check

0 steps flagged

No circularity: purely observational before-after comparison

full rationale

The paper reports an empirical observational study that compares three PR metrics (merge rate, code churn, merge effort) within each of 148 projects before versus after instruction-file creation. No equations, fitted parameters, predictions, ansatzes, or derivation chains appear in the provided text or abstract. The central claim rests on direct data aggregation and percentage splits (27.7 % increase, 26.35 % decrease) rather than any self-referential modeling or self-citation load-bearing step. This is the normal case of a self-contained empirical analysis with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the AIDev dataset correctly timestamps instruction-file creation and that no other project-level changes coincide with that event; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption The AIDev dataset accurately records the timing of instruction-file creation and the associated agentic PR metrics for the 148 projects.
    All before-after comparisons depend on this dataset property.

pith-pipeline@v0.9.1-grok · 5847 in / 1213 out tokens · 30978 ms · 2026-06-27T06:01:27.079551+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 1 canonical work pages

  1. [1]

    Replication Package for: Toward Instructions-as-Code: Understanding the Impact of Instruction Files on Agentic Pull Requests

    2025. Replication Package for: Toward Instructions-as-Code: Understanding the Impact of Instruction Files on Agentic Pull Requests. doi:10.6084/m9.figshare. 30951143

  2. [2]

    Anthropic. 2024. The Claude 3 Model Family: Tech Report. Technical Report. https://shorturl.at/sdbBc

  3. [3]

    Anthropic / Claude Code Documentation. 2025. Use Claude Code in VS Code. https://code.claude.com/docs/en/vs-code. Accessed: 2025-12-25

  4. [4]

    Anysphere. 2024. Cursor: The AI-Native Code Editor. https://cursor.sh/

  5. [5]

    Hassan, and Hajimu Iida

    Worawalan Chatlatanagulchai, Hao Li, Yutaro Kashiwa, Brittany Reid, Kundjana- sith Thonglek, Pattara Leelaprute, Arnon Rungsawang, Bundit Manaskasemsak, Bram Adams, Ahmed E. Hassan, and Hajimu Iida. 2025. Agent READMEs: An Empirical Study of Context Files for Agentic Coding. arXiv:2511.12884 [cs.SE]

  6. [6]

    Umut Cihan, Vahid Haratian, Arda İçöz, Mert Kaan Gül, Ömercan Devran, Emir- can Furkan Bayendur, Baykal Mehmet Uçar, and Eray Tüzün. 2025. Automated Code Review in Practice. In 2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 425–436

  7. [7]

    Cursor Documentation. 2025. Rules — Cursor Docs. https://cursor.com/docs/ context/rules. Accessed: 2025-12-25

  8. [8]

    Devin.ai Documentation. 2025. GitHub Integration — Devin Docs. https://docs. devin.ai/integrations/gh#search-&-precedence-order. Accessed: 2025-12-25

  9. [9]

    Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. 2024. Evaluating Large Language Models in Class-Level Code Generation (ICSE ’24)

  10. [10]

    GitHub. 2024. Adding repository custom instructions for GitHub Copilot. https: //shorturl.at/DRhK3 GitHub Docs

  11. [11]

    GitHub Documentation. 2025. Adding Repository Custom Instructions for GitHub Copilot. https://docs.github.com/en/copilot/how-tos/configure-custom- instructions/add-repository-instructions. Accessed: 2025-12-25

  12. [12]

    Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large Language Models for Software Engineering: A Systematic Literature Review. ACM Trans. Softw. Eng. Methodol. (2024)

  13. [13]

    Hao Li, Haoxiang Zhang, and Ahmed E. Hassan. 2025. The Rise of AI Team- mates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshaping Software Engineering

  14. [14]

    Zhe Liu, Chunyang Chen, Junjie Wang, Mengzhuo Chen, Boyu Wu, Xing Che, Dandan Wang, and Qing Wang. 2024. Make LLM a Testing Expert: Bringing Human-like Interaction to Mobile GUI Testing via Functionality-aware Decisions (ICSE ’24)

  15. [15]

    Desmarais, and Zhen Ming (Jack) Jiang

    Arghavan Moradi Dakhel, Vahid Majdinasab, Amin Nikanjam, Foutse Khomh, Michel C. Desmarais, and Zhen Ming (Jack) Jiang. 2023. GitHub Copilot AI pair programmer: Asset or Liability? J. Syst. Softw. 203, C (09 2023), 23 pages

  16. [16]

    OpenAI Developers. 2025. Custom Instructions with AGENTS.md. https:// developers.openai.com/codex/guides/agents-md/. Accessed: 2025-12-25

  17. [17]

    Santos, Vitor Costa, Joao Eduardo Montandon, and Marco Tulio Valente

    Helio Victor F. Santos, Vitor Costa, Joao Eduardo Montandon, and Marco Tulio Valente. 2025. Decoding the Configuration of AI Coding Agents: Insights from Claude Code Projects. Received 25 December 2025; revised 19 Junuary 2026; accepted 19 January 2026