pith. sign in

arxiv: 2606.13298 · v1 · pith:FAZONVLLnew · submitted 2026-06-11 · 💻 cs.SE · cs.AI

Mining Architectural Quality Under Agentic AI Adoption: A Causal Study of Java Repositories

Pith reviewed 2026-06-27 06:03 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords agentic AIarchitectural smellscausal inferencedifference-in-differencesJava repositoriessmell densitysoftware architecturepropensity matching
0
0 comments X

The pith

Agentic AI adoption increases code volume without increasing architectural smell counts, causing a density decline as a denominator effect.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates the causal impact of adopting agentic AI coding tools on the architectural quality of Java software projects. Using a staggered difference-in-differences approach on 151 repositories, it shows that total architectural smells remain essentially unchanged while lines of code increase substantially. This results in a reduction in architectural smell density that stems from the larger codebase size rather than any improvement in architecture. The findings indicate that relying on density metrics alone can be misleading in studies of AI tool effects when those tools also influence system scale.

Core claim

Adoption of agentic AI produces no statistically significant change in total architectural smell counts (+1.1%, p=0.82) but drives a 12.8% increase in lines of code (p=0.003), yielding a 6.7% decline in smell density (p=0.004) that is attributable to the denominator rather than fewer smells per unit of code.

What carries the argument

Staggered difference-in-differences estimator with Borusyak imputation applied to monthly architectural smell density measurements from Arcan, after propensity score matching of treated and control repositories.

If this is right

  • Density-based architectural metrics require explicit decomposition into numerator and denominator effects when treatments alter code volume.
  • Raw smell counts and size metrics should be reported separately in causal evaluations of AI coding tools.
  • The pattern holds across per-type smell estimates and multiple robustness checks including wild cluster bootstrap and Lee bounds.
  • Pre-trends are consistent with parallel trends assumption in the matched sample.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar effects may appear in other programming languages or with different AI tools if they encourage code expansion.
  • Future studies could track individual developer productivity alongside architecture to see if the size growth reflects added features or duplication.
  • The result implies that AI-assisted development might scale systems faster than it introduces architectural debt in the short term.

Load-bearing premise

Repositories are correctly classified as having adopted agentic AI based on configuration files and commit trailers, and propensity matching sufficiently balances unobserved confounders for the causal estimator.

What would settle it

Finding a treated repository where lines of code do not increase after adoption but total smells rise significantly would contradict the denominator-effect explanation.

Figures

Figures reproduced from arXiv: 2606.13298 by Mahyar T. Moghaddam, Oliver Aleksander Larsen.

Figure 1
Figure 1. Figure 1: Study pipeline with sample sizes at each stage. 3 Mining Design and Statistical Model 3.1 Study Design We employ a staggered difference-in-differences (DiD) design to estimate the causal effect of observable agentic AI adoption on architectural quality. Fig￾ure 1 summarizes the four phases (mining, matching, extraction, analysis) whose sample sizes we trace below. Treatment is the first detectable adoption… view at source ↗
Figure 2
Figure 2. Figure 2: Per-type treatment effects (Borusyak imputation) with 95% CIs. Filled markers: significant after Holm correction. Raw UD counts show a suggestive increase of 5.8% (raw p = 0.032; Holm￾adjusted across four raw-count tests, p = 0.128). The density null masks this marginal signal, discussed in Sec. 5.1. Density and count tests answer distinct mechanistic questions, with family-wise error controlled within eac… view at source ↗
Figure 3
Figure 3. Figure 3: Event study: ASD effect by relative month. Pre-period coefficients near zero (Wald p = 0.896); post-adoption effect grows to ∼−9.5% at h = 6. Sun & Abraham overlay (grey dashed) confirms concordance. Per-type raw counts confirm the pattern: CD, HL, and GC show no significant absolute changes (all p > 0.25). UD shows a suggestive increase (+5.8%, raw p = 0.032; Holm-adjusted p = 0.128), the strongest per-ty… view at source ↗
Figure 4
Figure 4. Figure 4: Decomposition: separate DiD on raw smell counts (top) and LOC (bottom). Smells unchanged (+1.1%, p = 0.82); LOC grows significantly (+12.8%, p = 0.003). (−6.3%/−7.2%); balanced ASD (−4.6%, p < 0.001) and a multi-event restric￾tion excluding repositories with a single detection event (≥ 2 events, −6.8%, p = 0.020) retain significance, indicating the result is not driven by one-off configuration commits. Con… view at source ↗
read the original abstract

AI coding tools are now used by a majority of developers, and agentic use of these tools has popularized the practice colloquially called "vibe coding". Yet causal evidence on their effect on software architecture is scarce. Prior causal work has measured code-level outcomes (complexity, static analysis warnings); whether such degradation propagates to architecture-level outcomes remains unknown. We mine 151 open-source Java repositories, 74 with detectable agentic AI adoption (identified via configuration files and Co-Authored-By commit trailers) and 77 propensity-matched controls, across a 13-month per-repository window yielding 1,811 monthly Arcan snapshots. We estimate the causal effect of adoption on architectural smell density (ASD) with a staggered difference-in-differences design and the Borusyak imputation estimator, applying a causal design recently used for code-level metrics to the architecture level. Total smell counts are essentially unchanged (+1.1%, p = 0.82) while lines of code grow +12.8% (p = 0.003); the resulting 6.7% ASD decline (p = 0.004) is therefore a denominator effect rather than an architectural improvement. Per-type estimates and robustness checks (wild cluster bootstrap, Lee bounds, stale-observation sensitivity) corroborate the pattern; pre-trends are flat (Wald p = 0.90), consistent with parallel trends. Density-normalized outcomes can mislead when treatment affects system size: raw counts and explicit decomposition are required for causal mining studies of AI tool adoption. The complete replication package, including the curated 151-repository monthly panel, is publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript presents a causal analysis of agentic AI adoption on architectural smell density in 151 Java repositories using staggered difference-in-differences and the Borusyak imputation estimator. Treated units (74 repositories) are identified via configuration files and Co-Authored-By trailers, propensity-matched to controls. The main result is that smell counts are stable (+1.1%, p=0.82) while LOC increases (+12.8%, p=0.003), yielding a 6.7% ASD reduction (p=0.004) attributed to the denominator effect. Multiple robustness checks and a public replication package are provided.

Significance. Should the identification assumptions hold, the findings highlight a methodological point for causal mining studies: density metrics can produce misleading conclusions when the intervention affects system size. The work extends prior code-level analyses to architecture and supplies reproducible data, which strengthens its contribution to empirical software engineering.

major comments (1)
  1. [§3 (Identification Strategy)] The detection of agentic AI adoption via configuration files and Co-Authored-By commit trailers is central to assigning treatment status for the staggered DiD. The manuscript provides no validation, sensitivity analysis to alternative proxies, or discussion of potential false positives (e.g., config files present without active use). This is load-bearing for the claim that the observed patterns are caused by adoption rather than selection or measurement error, as differential misclassification would bias the Borusyak estimates and the parallel trends test.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the identification strategy. The concern about validation of treatment assignment is well-taken and directly relevant to the credibility of the staggered DiD estimates. We respond point-by-point below and commit to revisions that add the requested validation and sensitivity checks.

read point-by-point responses
  1. Referee: [§3 (Identification Strategy)] The detection of agentic AI adoption via configuration files and Co-Authored-By commit trailers is central to assigning treatment status for the staggered DiD. The manuscript provides no validation, sensitivity analysis to alternative proxies, or discussion of potential false positives (e.g., config files present without active use). This is load-bearing for the claim that the observed patterns are caused by adoption rather than selection or measurement error, as differential misclassification would bias the Borusyak estimates and the parallel trends test.

    Authors: We agree that explicit validation of the treatment proxy is necessary given its central role. The current manuscript relies on two observable signals (presence of agentic configuration files such as .cursor or .windsurf and Co-Authored-By trailers) that have been used in prior mining studies of AI tool adoption, but we did not report sample-level validation or alternative definitions. In the revision we will add: (1) a new subsection in §3 reporting manual inspection of a random sample of 20 treated repositories to confirm active use (via commit messages, PR descriptions, and configuration contents); (2) sensitivity analyses re-estimating the main models under stricter proxies (e.g., requiring both signals or at least three Co-Authored-By trailers) and under a looser proxy (config file only); and (3) explicit discussion of the direction and magnitude of potential misclassification bias, including how it would affect the Borusyak imputation estimator and the parallel-trends test. These additions will be supported by updated tables in the robustness section. We believe this directly mitigates the referee's concern without altering the core findings. revision: yes

Circularity Check

0 steps flagged

Empirical causal study with standard DiD estimator; no load-bearing circularity in derivation

full rationale

The paper applies a staggered difference-in-differences design with the Borusyak imputation estimator to a mined panel of 151 Java repositories. Treatment classification relies on observable proxies (configuration files and Co-Authored-By trailers), and outcomes (smell counts, LOC, ASD) are computed directly from Arcan snapshots. The central claim—that ASD decline is a denominator effect—is an arithmetic decomposition of the estimated treatment effects on raw counts versus size, not a self-referential equation or fitted parameter renamed as prediction. The mention of a 'causal design recently used for code-level metrics' is a methodological reference rather than a load-bearing self-citation that justifies the result. No self-definitional loops, ansatz smuggling, or uniqueness theorems appear. The analysis is self-contained against external benchmarks (public replication package) and receives a minimal score only for possible minor self-citation of the method.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the adoption detection method and the parallel trends assumption of the difference-in-differences design. No free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption The parallel trends assumption holds between treated and control repositories in the absence of treatment.
    Invoked by the staggered DiD design; supported by the reported flat pre-trends (Wald p = 0.90).

pith-pipeline@v0.9.1-grok · 5831 in / 1369 out tokens · 32156 ms · 2026-06-27T06:03:56.821335+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 18 canonical work pages · 1 internal anchor

  1. [1]

    Agarwal, S., He, H., Vasilescu, B.: AI IDEs or autonomous agents? Measuring the impact of coding agents on software development (2026), accepted at MSR 2026

  2. [2]

    Journal of Econometrics1(1), 49–59 (1973).https://doi.org/10

    Aigner, D.J.: Regression with a binary independent variable subject to errors of observation. Journal of Econometrics1(1), 49–59 (1973).https://doi.org/10. 1016/0304-4076(73)90005-5

  3. [3]

    In: Software Architecture: ECSA 2025 Tracks and Workshops

    Amasanti, G., Jahi´ c, J.: The impact of AI-generated solutions on software archi- tecture and productivity: Results from a survey study. In: Software Architecture: ECSA 2025 Tracks and Workshops. Lecture Notes in Computer Science, vol. 15982, pp. 89–104. Springer (2025).https://doi.org/10.1007/978-3-032-04403-7_10 Agentic AI and Architectural Quality 15

  4. [4]

    Software tool (2025)

    Anthropic: Claude Code: An agentic coding tool. Software tool (2025)

  5. [5]

    Software tool (2024)

    Anysphere: Cursor: The AI code editor. Software tool (2024)

  6. [6]

    In: Proceedings of the IEEE International Conference on Software Architecture Workshops (ICSAW)

    Arcelli Fontana, F., Pigazzini, I., Roveda, R., Tamburri, D.A., Zanoni, M., Di Nitto, E.: Arcan: A Tool for Architectural Smells Detection. In: Proceedings of the IEEE International Conference on Software Architecture Workshops (ICSAW). pp. 282– 285 (2017).https://doi.org/10.1109/ICSAW.2017.16

  7. [7]

    Pharmaceutical Statistics10(2), 150–161 (2011).https://doi.org/10.1002/pst

    Austin, P.C.: Optimal caliper widths for propensity-score matching when estimat- ing differences in means and differences in proportions in observational studies. Pharmaceutical Statistics10(2), 150–161 (2011).https://doi.org/10.1002/pst. 433

  8. [8]

    Addison- Wesley, 4th edn

    Bass, L., Clements, P., Kazman, R.: Software Architecture in Practice. Addison- Wesley, 4th edn. (2021)

  9. [9]

    Review of Economic Studies91(6), 3253–3285 (2024).https: //doi.org/10.1093/restud/rdae007

    Borusyak, K., Jaravel, X., Spiess, J.: Revisiting event-study designs: Robust and efficient estimation. Review of Economic Studies91(6), 3253–3285 (2024).https: //doi.org/10.1093/restud/rdae007

  10. [10]

    The Review of Economics and Statistics90(3), 414–427 (2008).https://doi.org/10.1162/rest.90.3.414

    Cameron, A.C., Gelbach, J.B., Miller, D.L.: Bootstrap-based improvements for inference with clustered errors. The Review of Economics and Statistics90(3), 414–427 (2008).https://doi.org/10.1162/rest.90.3.414

  11. [11]

    Cotroneo, C

    Cotroneo, D., Improta, C., Liguori, P.: Human-written vs. AI-generated code: A large-scale study of defects, vulnerabilities, and complexity. In: Proceedings of IS- SRE. pp. 252–263 (2025).https://doi.org/10.1109/ISSRE66568.2025.00035

  12. [12]

    American Economic Review110(9), 2964–2996 (2020).https://doi.org/10.1257/aer.20181169

    de Chaisemartin, C., d’Haultfœuille, X.: Two-way fixed effects estimators with heterogeneous treatment effects. American Economic Review110(9), 2964–2996 (2020).https://doi.org/10.1257/aer.20181169

  13. [13]

    Software Quality Journal 33(4), 33 (2025).https://doi.org/10.1007/s11219-025-09730-7

    Esposito, M., Robredo, M., Arcelli Fontana, F., Lenarduzzi, V.: On the correlation between architectural smells and static analysis warnings. Software Quality Journal 33(4), 33 (2025).https://doi.org/10.1007/s11219-025-09730-7

  14. [14]

    In: Proceedings of ICSE-SEIP (2026), arXiv:2510.00328, to appear

    Fawzy, A., Tahir, A., Blincoe, K.: Vibe coding in practice: Motivations, challenges, and a future outlook—a grey literature review. In: Proceedings of ICSE-SEIP (2026), arXiv:2510.00328, to appear

  15. [15]

    In: Proceedings of the International Conference on the Qual- ity of Software Architectures (QoSA)

    Garcia, J., Popescu, D., Edwards, G., Medvidovic, N.: Toward a catalogue of archi- tectural bad smells. In: Proceedings of the International Conference on the Qual- ity of Software Architectures (QoSA). pp. 146–162 (2009).https://doi.org/10. 1007/978-3-642-02351-4_10

  16. [16]

    Software tool (2025)

    GitHub: GitHub Copilot: Your AI pair programmer. Software tool (2025)

  17. [17]

    Journal of Systems and Software217, 112170 (2024).https://doi.org/10.1016/j.jss.2024.112170

    Gnoyke, P., Schulze, S., Kr¨ uger, J.: Evolution patterns of software-architecture smells: An empirical study of intra- and inter-version smells. Journal of Systems and Software217, 112170 (2024).https://doi.org/10.1016/j.jss.2024.112170

  18. [18]

    Lawrence, P

    Goodman-Bacon, A.: Difference-in-differences with variation in treatment timing. Journal of Econometrics225(2), 254–277 (2021).https://doi.org/10.1016/j. jeconom.2021.03.014

  19. [19]

    Journal of Systems and Software61(2), 105–119 (2002)

    van Gurp, J., Bosch, J.: Design erosion: Problems and causes. Journal of Systems and Software61(2), 105–119 (2002)

  20. [20]

    He, H., Miller, C., Agarwal, S., K¨ astner, C., Vasilescu, B.: Speed at the cost of quality: How Cursor AI increases short-term velocity and long-term complexity in open-source projects (2025), accepted at MSR 2026

  21. [21]

    In- formation and Software Technology47(10), 643–656 (2005)

    Hochstein, L., Lindvall, M.: Combating architectural degeneration: A survey. In- formation and Software Technology47(10), 643–656 (2005)

  22. [22]

    In: Proceedings of the 23rd International Conference on Mining Software Repositories (MSR) (2026), arXiv:2512.18925, to appear 16 O

    Jiang, S., Nam, D.: Beyond the prompt: An empirical study of Cursor Rules. In: Proceedings of the 23rd International Conference on Mining Software Repositories (MSR) (2026), arXiv:2512.18925, to appear 16 O. A. Larsen and M. T. Moghaddam

  23. [23]

    Journal of Systems and Software 225, 112382 (2025).https://doi.org/10.1016/j.jss.2025.112382

    Jolak, R., Karlsson, S., Dobslaw, F.: An empirical investigation of the impact of architectural smells on software maintainability. Journal of Systems and Software 225, 112382 (2025).https://doi.org/10.1016/j.jss.2025.112382

  24. [24]

    Empiri- cal Software Engineering21(5), 2035–2071 (2016).https://doi.org/10.1007/ s10664-015-9393-5

    Kalliamvakou, E., Gousios, G., Blincoe, K., Singer, L., German, D.M., Damian, D.: An in-depth study of the promises and perils of mining GitHub. Empiri- cal Software Engineering21(5), 2035–2071 (2016).https://doi.org/10.1007/ s10664-015-9393-5

  25. [25]

    X (formerly Twitter) (February 2025),https://x

    Karpathy, A.: Vibe coding. X (formerly Twitter) (February 2025),https://x. com/karpathy/status/1886192184808149383

  26. [26]

    Review of Economic Studies76(3), 1071–1102 (2009).https: //doi.org/10.1111/j.1467-937X.2009.00536.x

    Lee, D.S.: Training, wages, and sample selection: Estimating sharp bounds on treatment effects. Review of Economic Studies76(3), 1071–1102 (2009).https: //doi.org/10.1111/j.1467-937X.2009.00536.x

  27. [27]

    John Wiley & Sons (2006)

    Lippert, M., Roock, S.: Refactoring in Large Software Projects: Performing Com- plex Restructurings Successfully. John Wiley & Sons (2006)

  28. [28]

    Prentice Hall (2003)

    Martin, R.C.: Agile Software Development: Principles, Patterns, and Practices. Prentice Hall (2003)

  29. [29]

    ACM SIGSOFT Software Engineering Notes17(4), 40–52 (1992)

    Perry, D.E., Wolf, A.L.: Foundations for the study of software architecture. ACM SIGSOFT Software Engineering Notes17(4), 40–52 (1992)

  30. [30]

    Investigating the relationship between biodiversity and ecosystem multifunctionality: Challenges and solutions

    Rosenbaum, P.R., Rubin, D.B.: Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. The Ameri- can Statistician39(1), 33–38 (1985).https://doi.org/10.1080/00031305.1985. 10479383

  31. [31]

    Biometrics36(2), 293–298 (1980).https://doi.org/10.2307/2529981

    Rubin, D.B.: Bias reduction using Mahalanobis-metric matching. Biometrics36(2), 293–298 (1980).https://doi.org/10.2307/2529981

  32. [32]

    IEEE Transactions on Software Engineering49(8), 4169–4195 (2023).https://doi.org/10.1109/TSE.2023.3286179

    Sas, D., Avgeriou, P.: An architectural technical debt index based on machine learn- ing and architectural smells. IEEE Transactions on Software Engineering49(8), 4169–4195 (2023).https://doi.org/10.1109/TSE.2023.3286179

  33. [33]

    Empirical Software Engineering27, 86 (2022)

    Sas, D., Avgeriou, P., Uyumaz, U.: On the evolution and impact of architectural smells—an industrial case study. Empirical Software Engineering27, 86 (2022)

  34. [34]

    Schmid, L., Hey, T., Armbruster, M., Corallo, S., Fuchß, D., Keim, J., Liu, H., Koziolek, A.: Software architecture meets LLMs: A systematic literature review (2025)

  35. [35]

    Online survey (2024)

    Stack Overflow: Stack overflow developer survey 2024. Online survey (2024)

  36. [36]

    Statistical Science25(1), 1–21 (2010).https://doi.org/10.1214/09-STS313

    Stuart, E.A.: Matching methods for causal inference: A review and a look forward. Statistical Science25(1), 1–21 (2010).https://doi.org/10.1214/09-STS313

  37. [37]

    Journal of Econometrics225(2), 175–199 (2021)

    Sun, L., Abraham, S.: Estimating dynamic treatment effects in event studies with heterogeneous treatment effects. Journal of Econometrics225(2), 175–199 (2021). https://doi.org/10.1016/j.jeconom.2020.09.006

  38. [38]

    In: Proceedings of the 19th European Conference on Software Architecture (ECSA)

    Tessa, C., Bochicchio, M., Arcelli Fontana, F.: Exploring architectural smells detec- tion through LLMs. In: Proceedings of the 19th European Conference on Software Architecture (ECSA). Lecture Notes in Computer Science, vol. 15929, pp. 90–98. Springer (2025).https://doi.org/10.1007/978-3-032-02138-0_6

  39. [39]

    Waseem, M., Ahmad, A., Kemell, K.K., Rasku, J., Lahti, S., M¨ akel¨ a, K., Abra- hamsson, P.: Vibe coding in practice: Flow, technical debt, and guidelines for sustainable use (2025)

  40. [40]

    Cana- dian Journal of Economics56(3), 839–858 (2023).https://doi.org/10.1111/ caje.12661

    Webb, M.D.: Reworking wild bootstrap-based inference for clustered errors. Cana- dian Journal of Economics56(3), 839–858 (2023).https://doi.org/10.1111/ caje.12661

  41. [41]

    Yeti¸ stiren, B.,¨Ozsoy, I., Ayerdem, M., T¨ uz¨ un, E.: Evaluating the code quality of AI-assisted code generation tools: An empirical study on GitHub Copilot, Amazon CodeWhisperer, and ChatGPT (2023)