Configuration Smells in AGENTS.md Files: Common Mistakes in Configuring Coding Agents
Pith reviewed 2026-06-27 03:43 UTC · model grok-4.3
The pith
Six configuration smells appear in AGENTS.md files that guide coding agents, detectable by automated heuristics and present in most examined repositories.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We identified six configuration smells and proposed automated heuristics to detect them. To evaluate the prevalence of the proposed smells, we analyzed 100 popular open-source repositories containing either an AGENTS.md or a CLAUDE.md file. Our results show that configuration smells are widespread. Lint Leakage was the most common smell, affecting 62% of the files, followed by Context Bloat (42%) and Skill Leakage (35%). We further show that several smells frequently co-occur, particularly Context Bloat, Skill Leakage, and Conflicting Instructions.
What carries the argument
The six configuration smells together with the automated heuristics that detect them from the content of AGENTS.md and CLAUDE.md files.
If this is right
- Configuration smells occur in the large majority of the 100 repositories studied.
- Lint Leakage, Context Bloat, and Skill Leakage are the three most frequent smells.
- Context Bloat, Skill Leakage, and Conflicting Instructions often appear in the same files.
- The proposed heuristics can locate these smells without manual inspection of each file.
- The smells can be catalogued and measured at scale using the methods described.
Where Pith is reading between the lines
- Teams that maintain AGENTS.md files could run the heuristics as part of their regular checks to reduce avoidable agent misbehavior.
- The co-occurrence pattern suggests that addressing one smell may reduce the chance of others appearing in the same file.
- Future work could test whether removing these smells produces measurable improvements in the quality of agent-generated code or task completion rates.
Load-bearing premise
The grey literature review and repository mining captured the main configuration problems that actually occur when people use these files with coding agents.
What would settle it
Applying the heuristics to a fresh collection of several hundred repositories and obtaining smell rates below 20 percent for all six types would indicate the reported prevalence does not generalize.
Figures
read the original abstract
Coding agents are increasingly used to automate software engineering tasks. To guide their behavior, these agents commonly rely on configuration files, typically named AGENTS. md or CLAUDE. md, which provide instructions about architecture, workflows, coding conventions, and testing practices. Despite their growing importance, little is known about common problems affecting the definition and maintenance of these files. In this paper, we present the first catalog of smells for coding-agent configuration files. To identify such smells, we first conducted a grey literature review and a repository mining analysis. As a result, we identified six configuration smells and proposed automated heuristics to detect them. To evaluate the prevalence of the proposed smells, we analyzed 100 popular open-source repositories containing either an AGENTS. md or a CLAUDE. md file. Our results show that configuration smells are widespread. Lint Leakage was the most common smell, affecting 62% of the files, followed by Context Bloat (42%) and Skill Leakage (35%). We further show that several smells frequently co-occur, particularly Context Bloat, Skill Leakage, and Conflicting Instructions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents the first catalog of six configuration smells in AGENTS.md and CLAUDE.md files for coding agents. Smells were identified via grey literature review and repository mining; automated heuristics were proposed for detection; and prevalence was measured by applying the heuristics to 100 popular open-source repositories, yielding rates such as 62% for Lint Leakage, 42% for Context Bloat, and 35% for Skill Leakage, with noted co-occurrences.
Significance. If the detection heuristics prove reliable, the work supplies a timely, practical catalog that could guide configuration practices and support future linting tools for agent instructions. The empirical scale (100 repositories) and focus on an emerging SE artifact add relevance, but the absence of any reported validation for the heuristics directly limits the trustworthiness of the prevalence numbers that form the central empirical result.
major comments (1)
- [Methods] Methods section (and abstract): the prevalence claims rest entirely on the automated heuristics, yet the manuscript provides no validation step—no manual labeling of a sample, no precision/recall figures, no inter-rater agreement, and no error analysis for any of the six smells. This is load-bearing because the reported percentages (e.g., Lint Leakage at 62%) cannot be interpreted without evidence that the heuristics are accurate rather than inflated by false positives.
Simulated Author's Rebuttal
We thank the referee for highlighting this critical methodological gap. We agree that the prevalence results cannot be fully trusted without evidence of heuristic accuracy and will revise the paper accordingly.
read point-by-point responses
-
Referee: [Methods] Methods section (and abstract): the prevalence claims rest entirely on the automated heuristics, yet the manuscript provides no validation step—no manual labeling of a sample, no precision/recall figures, no inter-rater agreement, and no error analysis for any of the six smells. This is load-bearing because the reported percentages (e.g., Lint Leakage at 62%) cannot be interpreted without evidence that the heuristics are accurate rather than inflated by false positives.
Authors: We fully agree. The heuristics were constructed from the grey-literature review and initial repository mining but were applied to the 100-repository corpus without any subsequent manual validation or error analysis. In the revised manuscript we will (1) add a dedicated validation subsection describing a stratified random sample of 30 files per smell (or the entire set if smaller), (2) report independent labeling by two authors, (3) compute precision, recall, and Cohen’s kappa, (4) include a concise error analysis for each smell, and (5) update the abstract and methods to reflect these new results. We will also add an explicit threats-to-validity paragraph acknowledging that the original prevalence figures were heuristic-based only. revision: yes
Circularity Check
Empirical prevalence study with no circular derivations
full rationale
The paper identifies smells via grey-literature review plus repository mining, defines heuristics from those steps, then measures prevalence by applying the heuristics to 100 separate repositories. No equations, no fitted parameters renamed as predictions, no self-citation chains, and no self-definitional loops appear in the derivation. Prevalence percentages are direct counts from external data under the stated heuristics; the chain is self-contained against external benchmarks and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Grey literature review combined with repository mining can identify the relevant configuration smells for coding agents
Reference graph
Works this paper leans on
-
[1]
Evaluating Large Language Models Trained on Code
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman et al., “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[2]
Prompt Engineering or Fine Tuning: An Empirical As- sessment of Large Language Models in Automated Software Engineering Tasks,
J. Shin, C. Tang, T. Mohati, M. Nayebi, S. Wang, and H. Hem- mati, “Prompt Engineering or Fine Tuning: An Empirical As- sessment of Large Language Models in Automated Software Engineering Tasks,”ArXiv, 2023
2023
-
[3]
Using Transfer Learning for Code-Related Tasks,
A. Mastropaolo, N. Cooper, D. N. Palacio, S. Scalabrino, D. Poshyvanyk, R. Oliveto, and G. Bavota, “Using Transfer Learning for Code-Related Tasks,”IEEE Transactions on Soft- ware Engineering, 2023
2023
-
[4]
Using Large Language Models to Generate JUnit Tests: An Empirical Study,
M. L. Siddiq, J. C. S. Santos, R. H. Tanvir, N. Ulfat, F. A. Rifat, and V . C. Lopes, “Using Large Language Models to Generate JUnit Tests: An Empirical Study,” in28th International Con- ference on Evaluation and Assessment in Software Engineering (EASE), 2024
2024
-
[5]
Automated Unit Test Improvement Using Large Language Models at Meta,
N. Alshahwan, J. Chheda, A. Finegenova, B. Gokkaya, M. Har- man, I. Harper, A. Marginean, S. Sengupta, and E. Wang, “Automated Unit Test Improvement Using Large Language Models at Meta,” in32nd ACM Symposium on the Foundations of Software Engineering (FSE), 2024
2024
-
[6]
LLaMA-Reviewer: Advancing Code Review Automation with Large Language Models through Parameter-Efficient Fine-Tuning,
J. Lu, L. Yu, X. Li, L. Yang, and C. Zuo, “LLaMA-Reviewer: Advancing Code Review Automation with Large Language Models through Parameter-Efficient Fine-Tuning,” inIEEE 34th International Symposium on Software Reliability Engineering (ISSRE), 2023
2023
-
[7]
On the Quality of AI-Generated Source Code Comments: A Compre- hensive Evaluation,
I. Guelman, A. G. Leal, L. Xavier, and M. T. Valente, “On the Quality of AI-Generated Source Code Comments: A Compre- hensive Evaluation,” in1st International Workshop on AI for Software Quality Evaluation (AI-SQE), 2026
2026
-
[8]
Large Language Models for Software Engineering: A Systematic Literature Review,
X. Hou, Y . Zhao, Y . Liu, Z. Yang, K. Wang, L. Li, X. Luo, D. Lo, J. Grundy, and H. Wang, “Large Language Models for Software Engineering: A Systematic Literature Review,”ACM Transactions on Software Engineering and Methodology, 2024
2024
-
[9]
Using Copilot Agent Mode to Automate Library Migration: A Quantitative Assess- ment,
A. Almeida, L. Xavier, and M. T. Valente, “Using Copilot Agent Mode to Automate Library Migration: A Quantitative Assess- ment,” in1st International Workshop on Agentic Engineering, 2026
2026
-
[10]
Migrating Code At Scale With LLMs At Google,
C. Ziftci, S. Nikolov, A. Sj ¨ovall, B. Kim, D. Codecasa, and M. Kim, “Migrating Code At Scale With LLMs At Google,” in Proceedings of the 33rd ACM International Conference on the Foundations of Software Engineering, 2025
2025
-
[11]
Detecting Code Smells Using ChatGPT: Initial Insights,
L. L. Silva, J. R. d. Silva, J. E. Montandon, M. Andrade, and M. T. Valente, “Detecting Code Smells Using ChatGPT: Initial Insights,” in18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), 2024
2024
-
[12]
The rise and potential of large language model based agents: a survey,
Z. Xi, W. Chen, X. Guo, W. He, Y . Ding, B. Hong, M. Zhang, J. Wang, S. Jin, E. Zhou, R. Zheng, X. Fan, X. Wang, L. Xiong, Y . Zhou, W. Wang, C. Jiang, Y . Zou, X. Liu, Z. Yin, S. Dou, R. Weng, W. Qin, Y . Zheng, X. Qiu, X. Huang, Q. Zhang, and T. Gui, “The rise and potential of large language model based agents: a survey,”Science China Information Sciences, 2025
2025
-
[13]
Decoding the Configuration of AI Coding Agents: Insights from Claude Code Projects,
H. V . F. Santos, V . Costa, J. E. Montandon, and M. T. Valente, “Decoding the Configuration of AI Coding Agents: Insights from Claude Code Projects,” in1st International Workshop on Agentic Engineering, 2026
2026
-
[14]
Context engineering for ai agents in open-source software,
S. Mohsenimofidi, M. Galster, C. Treude, and S. Baltes, “Con- text engineering for ai agents in open-source software,”arXiv preprint arXiv:2510.21413, 2025
-
[15]
Guidelines for including grey literature and conducting multivocal literature reviews in software engineering,
V . Garousi, M. Felderer, and M. V . M ¨antyl¨a, “Guidelines for including grey literature and conducting multivocal literature reviews in software engineering,”Information and Software Technology, 2019
2019
-
[16]
Code Smells in Elixir: Early Results from a Grey Literature Review,
L. Vegi and M. T. Valente, “Code Smells in Elixir: Early Results from a Grey Literature Review,” in30th International Conference on Program Comprehension (ICPC), 2022
2022
-
[17]
How developers search for code: a case study,
C. Sadowski, K. T. Stolee, and S. Elbaum, “How developers search for code: a case study,” inProceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, 2015
2015
-
[18]
Huyen,AI Engineering: Building Applications with Founda- tion Models
C. Huyen,AI Engineering: Building Applications with Founda- tion Models. O’Reilly, 2025
2025
-
[19]
Fast discovery of association rules,
R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo, “Fast discovery of association rules,”Advances in Knowledge Discovery and Data Mining, 1996
1996
-
[20]
An Empirical Study on Code Smells Co-occurrences in Android Applications,
O. Hamdi, A. Ouni, E. A. AlOmar, and M. W. Mkaouer, “An Empirical Study on Code Smells Co-occurrences in Android Applications,” in36th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW), 2021
2021
-
[21]
On the Prevalence, Impact, and Evolution of SQL Code Smells in Data-Intensive Systems,
B. A. Muse, M. M. Rahman, C. Nagy, A. Cleve, F. Khomh, and G. Antoniol, “On the Prevalence, Impact, and Evolution of SQL Code Smells in Data-Intensive Systems,” in17th International Conference on Mining Software Repositories (MSR), 2020
2020
-
[22]
Investigating Code Smell Co-Occurrences Using Association Rule Learning: A Replicated Study,
F. Palomba, R. Oliveto, and A. De Lucia, “Investigating Code Smell Co-Occurrences Using Association Rule Learning: A Replicated Study,” inIEEE Workshop on Machine Learning Techniques for Software Quality Evaluation (MaLTeSQuE), 2017
2017
-
[23]
Using a Probabilistic Model to Predict Bug Fixes,
M. Soto and C. Le Goues, “Using a Probabilistic Model to Predict Bug Fixes,” inIEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER), 2018
2018
-
[24]
Bug Analysis in Jupyter Notebook Projects: An Empirical Study,
T. L. De Santana, P. A. D. M. S. Neto, E. S. De Almeida, and I. Ahmed, “Bug Analysis in Jupyter Notebook Projects: An Empirical Study,”ACM Transactions on Software Engineering and Methodology (TOSEM), 2024
2024
-
[25]
Using an LLM to Help With Code Understanding,
D. Nam, A. Macvean, V . J. Hellendoorn, B. Vasilescu, and B. A. Myers, “Using an LLM to Help With Code Understanding,” in 46th International Conference on Software Engineering (ICSE), 2024
2024
-
[26]
Lessons from Building CodeBuddy: A Contextualized AI Coding Assistant,
G. Pinto, C. R. B. de Souza, J. B. Cordeiro Neto, A. de Souza, T. Gotto, and E. Monteiro, “Lessons from Building CodeBuddy: A Contextualized AI Coding Assistant,” in46th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), 2024
2024
-
[27]
PromptSet: A Programmer’s Prompting Dataset,
K. Pister, D. J. Paul, P. Brophy, and I. Joshi, “PromptSet: A Programmer’s Prompting Dataset,” inProceedings of the 1st ACM International Conference on AI-Powered Software (AIware), 2024
2024
-
[28]
Landscape and Taxonomy of Prompt Engineer- ing Patterns in Software Engineering,
Y . Sasaki, H. Washizaki, J. Li, N. Yoshioka, N. Ubayashi, and Y . Fukazawa, “Landscape and Taxonomy of Prompt Engineer- ing Patterns in Software Engineering,”IT Professional, 2025
2025
-
[29]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan, “SWE-bench: Can Language Models Resolve Real-World GitHub Issues?” in12th International Conference on Learning Representations (ICLR), 2024
2024
-
[30]
On the use of agentic coding manifests: An empirical study of claude code,
W. Chatlatanagulchai, K. Thonglek, B. Reid, Y . Kashiwa, P. Leelaprute, A. Rungsawang, B. Manaskasemsak, and H. Iida, “On the use of agentic coding manifests: An empirical study of claude code,” inInternational Conference on Product-Focused Software Process Improvement, 2025
2025
-
[31]
When and Why Your Code Starts to Smell Bad,
M. Tufano, F. Palomba, G. Bavota, R. Oliveto, M. Di Penta, A. De Lucia, and D. Poshyvanyk, “When and Why Your Code Starts to Smell Bad,” in37th IEEE/ACM International Conference on Software Engineering (ICSE), 2015
2015
-
[32]
Do They Really Smell Bad? A Study on Developers’ Perception of Bad Code Smells,
F. Palomba, G. Bavota, M. Di Penta, R. Oliveto, and A. De Lu- cia, “Do They Really Smell Bad? A Study on Developers’ Perception of Bad Code Smells,” inInternational Conference on Software Maintenance and Evolution (ICSME), 2014
2014
-
[33]
A Systematic Literature Review on Bad Smells–5 W’s: Which, When, What, Who, Where,
E. V . de Paulo Sobrinho, A. De Lucia, and M. de Almeida Maia, “A Systematic Literature Review on Bad Smells–5 W’s: Which, When, What, Who, Where,”IEEE Transactions on Software Engineering, 2021
2021
-
[34]
On the Definition of Microservice Bad Smells,
D. Taibi and V . Lenarduzzi, “On the Definition of Microservice Bad Smells,”IEEE Software, 2018
2018
-
[35]
Fixing Dockerfile Smells: An Empirical Study,
G. Rosa, F. Zappone, S. Scalabrino, and R. Oliveto, “Fixing Dockerfile Smells: An Empirical Study,”Empirical Software Engineering, 2024
2024
-
[36]
F. Urdih, T. Theodoropoulos, and U. Zdun, “Cache-Related Smells in GitLab CI/CD: Comprehensive Catalog, Auto- mated Detection, and Empirical Evidence,”arXiv preprint arXiv:2604.17890, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.