pith. machine review for the scientific record. sign in

arxiv: 2605.10039 · v1 · submitted 2026-05-11 · 💻 cs.SE · cs.CL

Recognition: no theorem link

Instruction Adherence in Coding Agent Configuration Files: A Factorial Study of Four File-Structure Variables

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:36 UTC · model grok-4.3

classification 💻 cs.SE cs.CL
keywords coding agentsinstruction followingconfiguration filesfactorial designnull resultsClaudesoftware engineering
0
0 comments X

The pith

Structural choices in coding agent configuration files produce no detectable differences in instruction adherence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Practitioners often assume that how they organize configuration files for coding agents—such as their length, the placement of instructions, the overall file architecture, or the presence of conflicting rules—will influence how well the agents follow those instructions. This study conducted a controlled factorial experiment manipulating four such variables across thousands of sessions with frontier models on TypeScript tasks. After correction for multiple tests, none of the structure variables or their interactions showed a reliable effect on compliance with a target instruction. The main observed patterns were lower adherence later in each session and differences across the specific coding tasks assigned.

Core claim

A factorial study of file size, instruction position, file architecture, and adjacent-file conflicts found no detectable effects on compliance rates after multiple-testing correction, with affirmative Bayes-factor support for the null on size and conflict. Compliance odds instead decline with each successive function generated within a session, a pattern that replicated on a second codebase and across models.

What carries the argument

The compliance outcome, defined as whether the agent follows a pre-inserted target annotation in its generated functions, analyzed via mixed-effects models that include session sequence as a predictor.

If this is right

  • Agents can be configured with simpler or more convenient file structures without measurable loss in basic instruction following.
  • Attention should shift to task design and managing session length rather than file layout.
  • Within-session degradation in adherence is a robust pattern worth addressing directly.
  • Results hold for the tested frontier models and codebases under the study conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The trivial nature of the target annotation may mask effects that would appear with more complex or cumulative instructions typical in real projects.
  • Non-monotonic within-session patterns hint at context accumulation or attention decay mechanisms inside the agent.
  • Extending the design to multi-file, evolving codebases could reveal interactions not visible in the isolated tasks used here.

Load-bearing premise

That success at obeying one isolated, trivial annotation in these controlled tasks accurately reflects how agents would adhere to detailed, interconnected instructions in actual software development.

What would settle it

Running the same structural variations but measuring adherence to a battery of realistic project rules instead of a single trivial annotation, and finding a large effect size.

Figures

Figures reproduced from arXiv: 2605.10039 by Damon McMillan.

Figure 2
Figure 2. Figure 2: Independent variable effects. Independent variable effects. Function-level ICR with Wilson 95% confidence intervals at each level of the four manipulated variables. Size (top-left), position (top-right), architecture (bottom-left) and conflict (bottom-right) all show no contrast surviving BH correction. Reference cell ME-03 (S3 / P1 / A1 / C0) is marked. 4.2 Task type as a single-contrast observation Acros… view at source ↗
Figure 3
Figure 3. Figure 3: Within-session attenuation slope reproduces in the same direction across the three CLI-matched cells. Forest plot of within-session attenuation slopes (log-odds per generation position) for the three CLI-matched cells, with 95% Wald confidence intervals shown as horizontal whiskers. All three slopes are negative and the confidence intervals overlap; the formal cell × generation_position likelihood-ratio te… view at source ↗
Figure 1
Figure 1. Figure 1: Within-session compliance attenuation. Within-session compliance attenuation on generation order, plotted across positions 1 to 25 (the full well-powered range plus a small slice of the noisy tail at high positions where sample sizes thin). Panel (a): pooled instruction compliance rate (ICR) by generation position within a session, with the Wilson 95% confidence ribbon, observed rates (circles, n ≥ 50 per … view at source ↗
Figure 5
Figure 5. Figure 5: First-omission dynamics. First-omission dynamics on generation order. Panel (a): Kaplan￾Meier-style survival of “still fully compliant”, by generation position within a session; the median-survival position is marked with a dashed line. Panel (b): histogram of first-omission generation positions among the 756 affected runs; the distribution centres on a median of 4. Acknowledgement. The agent acknowledged … view at source ↗
read the original abstract

Frontier coding agents read configuration files (CLAUDE$.$md, AGENTS$.$md, Cursor Rules) at session start and are expected to follow the conventions inside them. Practitioners assume that structural choices (file size, instruction position, file architecture, contradictions in adjacent files) measurably affect adherence. We report a systematic factorial study of these choices using four manipulated variables, measuring compliance with a trivial target annotation across 1,650 Claude Code CLI sessions (16,050 function-level observations) on two TypeScript codebases, three frontier models (primarily Sonnet 4.6, with Opus 4.6 as a CLI-matched cross-model check and Opus 4.7 reported descriptively under a CLI-version confound), and five coding tasks. We use mixed-effects models with a Bayesian companion. None of the four structural variables or three two-way interactions produces a detectable contrast after multiple-testing correction. Size and conflict nulls are supported by affirmative-null Bayes factors (BF10 between 0.05 and 0.10); position and architecture nulls are failures to reject without Bayes-factor support. The largest effect we measured is within-session: each additional function the agent generates is associated with approximately 5.6% lower odds of compliance per step (OR = 0.944) within the session-length range we tested, though the relationship is non-monotonic rather than a constant per-step effect. This reproduces on a second TypeScript codebase and on Opus 4.6 at matched configuration; it was identified during analysis rather than pre-specified. Within the conditions tested, file-structure variables did not produce detectable contrasts; compliance varies systematically between coding tasks and across each session's sequence of generated functions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript reports a factorial study of four file-structure variables (size, position, architecture, and conflict) on coding agents' adherence to configuration files, using compliance with one trivial target annotation as the outcome. Across 1,650 sessions and 16,050 observations with frontier models on two TypeScript codebases, none of the structural variables or their two-way interactions yield detectable effects after multiple-testing correction; size and conflict nulls receive affirmative Bayes-factor support. The largest observed effect is a post-hoc within-session decay (OR=0.944), which replicates across codebases and a matched model; all findings are qualified to the simple tasks and annotation tested.

Significance. If the results hold, the work supplies large-scale empirical evidence that practitioner assumptions about structural choices in agent config files may not translate to measurable adherence differences, at least under the conditions examined, while underscoring session-level dynamics. Credit is due for the substantial sample, mixed-effects modeling, Bayesian companion analysis, and explicit replication on a second codebase and model.

major comments (2)
  1. [Abstract and Results] The central null claims for the four structural variables rest on a proxy (binary compliance with a single pre-specified trivial annotation) whose sensitivity to adherence differences under realistic, multi-rule instructions is untested. Although the proxy detects within-session decay, this does not establish that it would register structural effects if present; the manuscript's qualification to 'simple' tasks leaves the load-bearing interpretation of the nulls open to the concern that the measure may be insensitive rather than that the variables truly have no effect.
  2. [Results] The within-session decay (OR=0.944 per additional function) is identified as the largest effect and is described as non-monotonic, yet it was discovered post-hoc rather than pre-specified. This exploratory status requires stronger caveats in the results and discussion, including explicit discussion of how the post-hoc identification and the non-monotonic pattern affect the reliability and generalizability of the finding relative to the pre-planned structural contrasts.
minor comments (3)
  1. [Methods] Clarify the precise operational definition of the trivial target annotation, the five coding tasks, and how compliance was coded at the function level to permit readers to judge the proxy's scope.
  2. [Methods] The handling of the CLI-version confound for Opus 4.7 (reported only descriptively) should be explained in more detail, including why it precludes a full cross-model comparison.
  3. [Discussion] Add a brief statement in the discussion on the extent to which the null structural findings can be expected to generalize to complex, multi-rule instructions typical of real development workflows.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the insightful comments, which help clarify the scope and limitations of our findings. We address the two major comments below. In response, we will revise the manuscript to include stronger caveats regarding the proxy measure's sensitivity and the exploratory nature of the within-session decay effect.

read point-by-point responses
  1. Referee: [Abstract and Results] The central null claims for the four structural variables rest on a proxy (binary compliance with a single pre-specified trivial annotation) whose sensitivity to adherence differences under realistic, multi-rule instructions is untested. Although the proxy detects within-session decay, this does not establish that it would register structural effects if present; the manuscript's qualification to 'simple' tasks leaves the load-bearing interpretation of the nulls open to the concern that the measure may be insensitive rather than that the variables truly have no effect.

    Authors: We chose the single trivial annotation as the outcome measure to ensure it was objective, pre-specifiable, and independent of the coding task difficulty, allowing us to isolate the effects of the file-structure variables. The detection of the within-session decay demonstrates that the measure is capable of registering adherence changes under the tested conditions. We acknowledge, however, that its sensitivity to differences in adherence under more complex, multi-rule instructions has not been directly tested. The manuscript already qualifies the findings to the simple tasks and annotation used. To address the concern, we will expand the Discussion section to more explicitly discuss the potential for the proxy to be insensitive to structural effects in richer instruction settings and to frame the null results accordingly as applying to this specific adherence metric. revision: yes

  2. Referee: [Results] The within-session decay (OR=0.944 per additional function) is identified as the largest effect and is described as non-monotonic, yet it was discovered post-hoc rather than pre-specified. This exploratory status requires stronger caveats in the results and discussion, including explicit discussion of how the post-hoc identification and the non-monotonic pattern affect the reliability and generalizability of the finding relative to the pre-planned structural contrasts.

    Authors: The within-session decay was indeed identified post-hoc during the analysis phase rather than as a pre-registered hypothesis. The abstract already states that 'it was identified during analysis rather than pre-specified,' but we agree that additional caveats are warranted in the Results and Discussion. We will revise these sections to include a dedicated paragraph on the exploratory status of this finding, discuss how the non-monotonic pattern (rather than a strictly linear decay) influences interpretation, and compare its reliability and generalizability to the pre-planned structural variable contrasts. This will help readers appropriately weight the finding. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical factorial study with statistical modeling of observed data

full rationale

The paper conducts a controlled factorial experiment across 1,650 sessions and 16,050 observations, fitting mixed-effects models and reporting Bayes factors on the collected compliance data. No derivations, equations, or predictions are claimed; the central null findings on structural variables are direct statistical outcomes of the experiment rather than reductions to fitted inputs or self-citations. The within-session decay effect was identified post-hoc but is presented as an empirical observation, not a self-referential prediction. No load-bearing self-citations or ansatzes appear in the reported chain.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central null claims rest on standard statistical modeling assumptions and the validity of the chosen compliance measure rather than new theoretical constructs or free parameters.

axioms (2)
  • standard math Assumptions of mixed-effects logistic regression hold (e.g., conditional independence given random effects)
    Invoked for analysis of compliance across sessions and tasks.
  • domain assumption The trivial target annotation serves as a valid measure of instruction adherence
    Central to the outcome variable across all 16,050 observations.

pith-pipeline@v0.9.0 · 5610 in / 1339 out tokens · 60401 ms · 2026-05-12T02:36:56.130399+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 3 internal anchors

  1. [1]

    Available: https://arxiv.org/abs/2511.12884

  2. [2]

    On the use of agentic coding manifests: An empirical study of Claude Code,

    W. Chatlatanagulchai et al., “On the use of agentic coding manifests: An empirical study of Claude Code,” in Proceedings of the international conference on product-focused software process improvement (PROFES 2025), short papers and posters, Sep

  3. [3]

    Available: https://arxiv.org/abs/2509.14744

  4. [4]

    Available: https://arxiv.org/abs/2602.11988

  5. [5]

    Available: https://yajin.org/blog/2026-03-22-why-ai-agents-break-rules/

  6. [6]

    Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

    N. F. Liu et al., “Lost in the middle: How language models use long contexts,” Transactions of the Association for Computational Linguistics, vol. 12, pp. 157–173, 2024, doi: 10.1162/tacl_a_00638

  7. [7]

    Available: https://arxiv.org/abs/2602.15028

  8. [8]

    Available: https://arxiv.org/abs/2509.21361

  9. [9]

    Available: https://arxiv.org/abs/2404.06654

  10. [10]

    Available: https://arxiv.org/abs/2602.05447

  11. [11]

    Available: https://arxiv.org/abs/2411.07037

  12. [12]

    Available: https://arxiv.org/abs/2410.04199

  13. [13]

    Available: https://arxiv.org/abs/2404.13208

  14. [14]

    Available: https://arxiv.org/abs/2502.15851

  15. [15]

    Available: https://arxiv.org/abs/2502.01951

  16. [16]

    Bianchi, M

    Available: https://arxiv.org/abs/2505.18148 Appendix A: Full Condition Matrix Table A1 lists every experimental condition across the primary (ixartz / Sonnet 4.6), ecological (Umami / Sonnet 4.6), and two cross-model datasets (ixartz / Opus 4.6 and ixartz / Opus 4.7). Each ixartz Sonnet, Umami, Opus 4.6 and Opus 4.7 condition was executed with 50 runs. Co...