pith. sign in

arxiv: 2606.15828 · v4 · pith:A5ZWIMT5new · submitted 2026-06-14 · 💻 cs.SE

Configuration Smells in AGENTS.md Files: Common Mistakes in Configuring Coding Agents

Pith reviewed 2026-06-27 03:43 UTC · model grok-4.3

classification 💻 cs.SE
keywords configuration smellscoding agentsAGENTS.mdCLAUDE.mdrepository mininggrey literaturesoftware configuration
0
0 comments X

The pith

Six configuration smells appear in AGENTS.md files that guide coding agents, detectable by automated heuristics and present in most examined repositories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper catalogs six configuration smells in files that instruct coding agents on architecture, workflows, and practices. It derives the smells from a grey literature review and repository mining, then builds automated heuristics to find them. When the heuristics run on 100 popular open-source repositories, they reveal that smells are common, with Lint Leakage in 62 percent of files, Context Bloat in 42 percent, and Skill Leakage in 35 percent. Several smells also tend to appear together. If correct, the work shows that many agent configurations contain avoidable issues that could affect how the agents perform software engineering tasks.

Core claim

We identified six configuration smells and proposed automated heuristics to detect them. To evaluate the prevalence of the proposed smells, we analyzed 100 popular open-source repositories containing either an AGENTS.md or a CLAUDE.md file. Our results show that configuration smells are widespread. Lint Leakage was the most common smell, affecting 62% of the files, followed by Context Bloat (42%) and Skill Leakage (35%). We further show that several smells frequently co-occur, particularly Context Bloat, Skill Leakage, and Conflicting Instructions.

What carries the argument

The six configuration smells together with the automated heuristics that detect them from the content of AGENTS.md and CLAUDE.md files.

If this is right

  • Configuration smells occur in the large majority of the 100 repositories studied.
  • Lint Leakage, Context Bloat, and Skill Leakage are the three most frequent smells.
  • Context Bloat, Skill Leakage, and Conflicting Instructions often appear in the same files.
  • The proposed heuristics can locate these smells without manual inspection of each file.
  • The smells can be catalogued and measured at scale using the methods described.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Teams that maintain AGENTS.md files could run the heuristics as part of their regular checks to reduce avoidable agent misbehavior.
  • The co-occurrence pattern suggests that addressing one smell may reduce the chance of others appearing in the same file.
  • Future work could test whether removing these smells produces measurable improvements in the quality of agent-generated code or task completion rates.

Load-bearing premise

The grey literature review and repository mining captured the main configuration problems that actually occur when people use these files with coding agents.

What would settle it

Applying the heuristics to a fresh collection of several hundred repositories and obtaining smell rates below 20 percent for all six types would indicate the reported prevalence does not generalize.

Figures

Figures reproduced from arXiv: 2606.15828 by Helio Victor F. dos Santos, Joao Eduardo Montandon, Luciana Lourdes Silva, Marco Tulio Valente, Vitor Costa.

Figure 1
Figure 1. Figure 1: Example of AGENTS.md file the agent to summarize the project structure and module organization, build and testing commands, coding and naming conventions, testing guidelines, and pull request requirements, while keeping the resulting document concise, instructional, and tailored to the analyzed repository. It is also possible to use both files in the same repository. In this case, one file should simply po… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the grey literature steps Specifically, the search query, as presented in [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Search string used in the literature review [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Prompt to detect Skill Leakage, Lint Leakage, Blind References and [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Example added to the prompt to detect Blind References [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Example of Skill Leakage (quickemu-project/quickemu) [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 9
Figure 9. Figure 9: Blind Reference (SuperClaude-Org/SuperClaude [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗
Figure 7
Figure 7. Figure 7: Example of Lint Leakage (google/adk-python) [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 10
Figure 10. Figure 10: Number of changes in [PITH_FULL_IMAGE:figures/full_fig_p007_10.png] view at source ↗
Figure 8
Figure 8. Figure 8: External reference described with context (browser-use/browser-use) [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 11
Figure 11. Figure 11: Number of commits in projects exhibiting [PITH_FULL_IMAGE:figures/full_fig_p008_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Example of Conflicting Instructions (inkline/inkline) [PITH_FULL_IMAGE:figures/full_fig_p008_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Smell accumulation in AGENTS.md files are highly dynamic and frequently updated by their providers. Another key factor is the non-deterministic nature of LLMs, which poses challenges for exact replication. Consequently, the performance and results of code smell detection in con￾figuration files may evolve over time. To mitigate this issue, we set the model temperature to 0 and carefully analyzed the respo… view at source ↗
read the original abstract

Coding agents are increasingly used to automate software engineering tasks. To guide their behavior, these agents commonly rely on configuration files, typically named AGENTS. md or CLAUDE. md, which provide instructions about architecture, workflows, coding conventions, and testing practices. Despite their growing importance, little is known about common problems affecting the definition and maintenance of these files. In this paper, we present the first catalog of smells for coding-agent configuration files. To identify such smells, we first conducted a grey literature review and a repository mining analysis. As a result, we identified six configuration smells and proposed automated heuristics to detect them. To evaluate the prevalence of the proposed smells, we analyzed 100 popular open-source repositories containing either an AGENTS. md or a CLAUDE. md file. Our results show that configuration smells are widespread. Lint Leakage was the most common smell, affecting 62% of the files, followed by Context Bloat (42%) and Skill Leakage (35%). We further show that several smells frequently co-occur, particularly Context Bloat, Skill Leakage, and Conflicting Instructions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper presents the first catalog of six configuration smells in AGENTS.md and CLAUDE.md files for coding agents. Smells were identified via grey literature review and repository mining; automated heuristics were proposed for detection; and prevalence was measured by applying the heuristics to 100 popular open-source repositories, yielding rates such as 62% for Lint Leakage, 42% for Context Bloat, and 35% for Skill Leakage, with noted co-occurrences.

Significance. If the detection heuristics prove reliable, the work supplies a timely, practical catalog that could guide configuration practices and support future linting tools for agent instructions. The empirical scale (100 repositories) and focus on an emerging SE artifact add relevance, but the absence of any reported validation for the heuristics directly limits the trustworthiness of the prevalence numbers that form the central empirical result.

major comments (1)
  1. [Methods] Methods section (and abstract): the prevalence claims rest entirely on the automated heuristics, yet the manuscript provides no validation step—no manual labeling of a sample, no precision/recall figures, no inter-rater agreement, and no error analysis for any of the six smells. This is load-bearing because the reported percentages (e.g., Lint Leakage at 62%) cannot be interpreted without evidence that the heuristics are accurate rather than inflated by false positives.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting this critical methodological gap. We agree that the prevalence results cannot be fully trusted without evidence of heuristic accuracy and will revise the paper accordingly.

read point-by-point responses
  1. Referee: [Methods] Methods section (and abstract): the prevalence claims rest entirely on the automated heuristics, yet the manuscript provides no validation step—no manual labeling of a sample, no precision/recall figures, no inter-rater agreement, and no error analysis for any of the six smells. This is load-bearing because the reported percentages (e.g., Lint Leakage at 62%) cannot be interpreted without evidence that the heuristics are accurate rather than inflated by false positives.

    Authors: We fully agree. The heuristics were constructed from the grey-literature review and initial repository mining but were applied to the 100-repository corpus without any subsequent manual validation or error analysis. In the revised manuscript we will (1) add a dedicated validation subsection describing a stratified random sample of 30 files per smell (or the entire set if smaller), (2) report independent labeling by two authors, (3) compute precision, recall, and Cohen’s kappa, (4) include a concise error analysis for each smell, and (5) update the abstract and methods to reflect these new results. We will also add an explicit threats-to-validity paragraph acknowledging that the original prevalence figures were heuristic-based only. revision: yes

Circularity Check

0 steps flagged

Empirical prevalence study with no circular derivations

full rationale

The paper identifies smells via grey-literature review plus repository mining, defines heuristics from those steps, then measures prevalence by applying the heuristics to 100 separate repositories. No equations, no fitted parameters renamed as predictions, no self-citation chains, and no self-definitional loops appear in the derivation. Prevalence percentages are direct counts from external data under the stated heuristics; the chain is self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that grey literature and repository mining yield a complete and actionable set of smells, plus the implicit assumption that the 100-repo sample is representative of broader usage. No free parameters or invented entities beyond the defined smells themselves.

axioms (1)
  • domain assumption Grey literature review combined with repository mining can identify the relevant configuration smells for coding agents
    This premise is used to derive the six smells and the detection heuristics.

pith-pipeline@v0.9.1-grok · 5743 in / 1263 out tokens · 47530 ms · 2026-06-27T03:43:37.917899+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 3 canonical work pages · 2 internal anchors

  1. [1]

    Evaluating Large Language Models Trained on Code

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman et al., “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021

  2. [2]

    Prompt Engineering or Fine Tuning: An Empirical As- sessment of Large Language Models in Automated Software Engineering Tasks,

    J. Shin, C. Tang, T. Mohati, M. Nayebi, S. Wang, and H. Hem- mati, “Prompt Engineering or Fine Tuning: An Empirical As- sessment of Large Language Models in Automated Software Engineering Tasks,”ArXiv, 2023

  3. [3]

    Using Transfer Learning for Code-Related Tasks,

    A. Mastropaolo, N. Cooper, D. N. Palacio, S. Scalabrino, D. Poshyvanyk, R. Oliveto, and G. Bavota, “Using Transfer Learning for Code-Related Tasks,”IEEE Transactions on Soft- ware Engineering, 2023

  4. [4]

    Using Large Language Models to Generate JUnit Tests: An Empirical Study,

    M. L. Siddiq, J. C. S. Santos, R. H. Tanvir, N. Ulfat, F. A. Rifat, and V . C. Lopes, “Using Large Language Models to Generate JUnit Tests: An Empirical Study,” in28th International Con- ference on Evaluation and Assessment in Software Engineering (EASE), 2024

  5. [5]

    Automated Unit Test Improvement Using Large Language Models at Meta,

    N. Alshahwan, J. Chheda, A. Finegenova, B. Gokkaya, M. Har- man, I. Harper, A. Marginean, S. Sengupta, and E. Wang, “Automated Unit Test Improvement Using Large Language Models at Meta,” in32nd ACM Symposium on the Foundations of Software Engineering (FSE), 2024

  6. [6]

    LLaMA-Reviewer: Advancing Code Review Automation with Large Language Models through Parameter-Efficient Fine-Tuning,

    J. Lu, L. Yu, X. Li, L. Yang, and C. Zuo, “LLaMA-Reviewer: Advancing Code Review Automation with Large Language Models through Parameter-Efficient Fine-Tuning,” inIEEE 34th International Symposium on Software Reliability Engineering (ISSRE), 2023

  7. [7]

    On the Quality of AI-Generated Source Code Comments: A Compre- hensive Evaluation,

    I. Guelman, A. G. Leal, L. Xavier, and M. T. Valente, “On the Quality of AI-Generated Source Code Comments: A Compre- hensive Evaluation,” in1st International Workshop on AI for Software Quality Evaluation (AI-SQE), 2026

  8. [8]

    Large Language Models for Software Engineering: A Systematic Literature Review,

    X. Hou, Y . Zhao, Y . Liu, Z. Yang, K. Wang, L. Li, X. Luo, D. Lo, J. Grundy, and H. Wang, “Large Language Models for Software Engineering: A Systematic Literature Review,”ACM Transactions on Software Engineering and Methodology, 2024

  9. [9]

    Using Copilot Agent Mode to Automate Library Migration: A Quantitative Assess- ment,

    A. Almeida, L. Xavier, and M. T. Valente, “Using Copilot Agent Mode to Automate Library Migration: A Quantitative Assess- ment,” in1st International Workshop on Agentic Engineering, 2026

  10. [10]

    Migrating Code At Scale With LLMs At Google,

    C. Ziftci, S. Nikolov, A. Sj ¨ovall, B. Kim, D. Codecasa, and M. Kim, “Migrating Code At Scale With LLMs At Google,” in Proceedings of the 33rd ACM International Conference on the Foundations of Software Engineering, 2025

  11. [11]

    Detecting Code Smells Using ChatGPT: Initial Insights,

    L. L. Silva, J. R. d. Silva, J. E. Montandon, M. Andrade, and M. T. Valente, “Detecting Code Smells Using ChatGPT: Initial Insights,” in18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), 2024

  12. [12]

    The rise and potential of large language model based agents: a survey,

    Z. Xi, W. Chen, X. Guo, W. He, Y . Ding, B. Hong, M. Zhang, J. Wang, S. Jin, E. Zhou, R. Zheng, X. Fan, X. Wang, L. Xiong, Y . Zhou, W. Wang, C. Jiang, Y . Zou, X. Liu, Z. Yin, S. Dou, R. Weng, W. Qin, Y . Zheng, X. Qiu, X. Huang, Q. Zhang, and T. Gui, “The rise and potential of large language model based agents: a survey,”Science China Information Sciences, 2025

  13. [13]

    Decoding the Configuration of AI Coding Agents: Insights from Claude Code Projects,

    H. V . F. Santos, V . Costa, J. E. Montandon, and M. T. Valente, “Decoding the Configuration of AI Coding Agents: Insights from Claude Code Projects,” in1st International Workshop on Agentic Engineering, 2026

  14. [14]

    Context engineering for ai agents in open-source software,

    S. Mohsenimofidi, M. Galster, C. Treude, and S. Baltes, “Con- text engineering for ai agents in open-source software,”arXiv preprint arXiv:2510.21413, 2025

  15. [15]

    Guidelines for including grey literature and conducting multivocal literature reviews in software engineering,

    V . Garousi, M. Felderer, and M. V . M ¨antyl¨a, “Guidelines for including grey literature and conducting multivocal literature reviews in software engineering,”Information and Software Technology, 2019

  16. [16]

    Code Smells in Elixir: Early Results from a Grey Literature Review,

    L. Vegi and M. T. Valente, “Code Smells in Elixir: Early Results from a Grey Literature Review,” in30th International Conference on Program Comprehension (ICPC), 2022

  17. [17]

    How developers search for code: a case study,

    C. Sadowski, K. T. Stolee, and S. Elbaum, “How developers search for code: a case study,” inProceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, 2015

  18. [18]

    Huyen,AI Engineering: Building Applications with Founda- tion Models

    C. Huyen,AI Engineering: Building Applications with Founda- tion Models. O’Reilly, 2025

  19. [19]

    Fast discovery of association rules,

    R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo, “Fast discovery of association rules,”Advances in Knowledge Discovery and Data Mining, 1996

  20. [20]

    An Empirical Study on Code Smells Co-occurrences in Android Applications,

    O. Hamdi, A. Ouni, E. A. AlOmar, and M. W. Mkaouer, “An Empirical Study on Code Smells Co-occurrences in Android Applications,” in36th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW), 2021

  21. [21]

    On the Prevalence, Impact, and Evolution of SQL Code Smells in Data-Intensive Systems,

    B. A. Muse, M. M. Rahman, C. Nagy, A. Cleve, F. Khomh, and G. Antoniol, “On the Prevalence, Impact, and Evolution of SQL Code Smells in Data-Intensive Systems,” in17th International Conference on Mining Software Repositories (MSR), 2020

  22. [22]

    Investigating Code Smell Co-Occurrences Using Association Rule Learning: A Replicated Study,

    F. Palomba, R. Oliveto, and A. De Lucia, “Investigating Code Smell Co-Occurrences Using Association Rule Learning: A Replicated Study,” inIEEE Workshop on Machine Learning Techniques for Software Quality Evaluation (MaLTeSQuE), 2017

  23. [23]

    Using a Probabilistic Model to Predict Bug Fixes,

    M. Soto and C. Le Goues, “Using a Probabilistic Model to Predict Bug Fixes,” inIEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER), 2018

  24. [24]

    Bug Analysis in Jupyter Notebook Projects: An Empirical Study,

    T. L. De Santana, P. A. D. M. S. Neto, E. S. De Almeida, and I. Ahmed, “Bug Analysis in Jupyter Notebook Projects: An Empirical Study,”ACM Transactions on Software Engineering and Methodology (TOSEM), 2024

  25. [25]

    Using an LLM to Help With Code Understanding,

    D. Nam, A. Macvean, V . J. Hellendoorn, B. Vasilescu, and B. A. Myers, “Using an LLM to Help With Code Understanding,” in 46th International Conference on Software Engineering (ICSE), 2024

  26. [26]

    Lessons from Building CodeBuddy: A Contextualized AI Coding Assistant,

    G. Pinto, C. R. B. de Souza, J. B. Cordeiro Neto, A. de Souza, T. Gotto, and E. Monteiro, “Lessons from Building CodeBuddy: A Contextualized AI Coding Assistant,” in46th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), 2024

  27. [27]

    PromptSet: A Programmer’s Prompting Dataset,

    K. Pister, D. J. Paul, P. Brophy, and I. Joshi, “PromptSet: A Programmer’s Prompting Dataset,” inProceedings of the 1st ACM International Conference on AI-Powered Software (AIware), 2024

  28. [28]

    Landscape and Taxonomy of Prompt Engineer- ing Patterns in Software Engineering,

    Y . Sasaki, H. Washizaki, J. Li, N. Yoshioka, N. Ubayashi, and Y . Fukazawa, “Landscape and Taxonomy of Prompt Engineer- ing Patterns in Software Engineering,”IT Professional, 2025

  29. [29]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan, “SWE-bench: Can Language Models Resolve Real-World GitHub Issues?” in12th International Conference on Learning Representations (ICLR), 2024

  30. [30]

    On the use of agentic coding manifests: An empirical study of claude code,

    W. Chatlatanagulchai, K. Thonglek, B. Reid, Y . Kashiwa, P. Leelaprute, A. Rungsawang, B. Manaskasemsak, and H. Iida, “On the use of agentic coding manifests: An empirical study of claude code,” inInternational Conference on Product-Focused Software Process Improvement, 2025

  31. [31]

    When and Why Your Code Starts to Smell Bad,

    M. Tufano, F. Palomba, G. Bavota, R. Oliveto, M. Di Penta, A. De Lucia, and D. Poshyvanyk, “When and Why Your Code Starts to Smell Bad,” in37th IEEE/ACM International Conference on Software Engineering (ICSE), 2015

  32. [32]

    Do They Really Smell Bad? A Study on Developers’ Perception of Bad Code Smells,

    F. Palomba, G. Bavota, M. Di Penta, R. Oliveto, and A. De Lu- cia, “Do They Really Smell Bad? A Study on Developers’ Perception of Bad Code Smells,” inInternational Conference on Software Maintenance and Evolution (ICSME), 2014

  33. [33]

    A Systematic Literature Review on Bad Smells–5 W’s: Which, When, What, Who, Where,

    E. V . de Paulo Sobrinho, A. De Lucia, and M. de Almeida Maia, “A Systematic Literature Review on Bad Smells–5 W’s: Which, When, What, Who, Where,”IEEE Transactions on Software Engineering, 2021

  34. [34]

    On the Definition of Microservice Bad Smells,

    D. Taibi and V . Lenarduzzi, “On the Definition of Microservice Bad Smells,”IEEE Software, 2018

  35. [35]

    Fixing Dockerfile Smells: An Empirical Study,

    G. Rosa, F. Zappone, S. Scalabrino, and R. Oliveto, “Fixing Dockerfile Smells: An Empirical Study,”Empirical Software Engineering, 2024

  36. [36]

    Cache-Related Smells in GitLab CI/CD: Comprehensive Catalog, Automated Detection, and Empirical Evidence

    F. Urdih, T. Theodoropoulos, and U. Zdun, “Cache-Related Smells in GitLab CI/CD: Comprehensive Catalog, Auto- mated Detection, and Empirical Evidence,”arXiv preprint arXiv:2604.17890, 2026