pith. machine review for the scientific record. sign in

arxiv: 2604.17890 · v1 · submitted 2026-04-20 · 💻 cs.SE

Recognition: unknown

Cache-Related Smells in GitLab CI/CD: Comprehensive Catalog, Automated Detection, and Empirical Evidence

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:36 UTC · model grok-4.3

classification 💻 cs.SE
keywords cache smellsCI/CD pipelinesGitLab CIautomated detectionpipeline performancesoftware smellsempirical studyconfiguration smells
0
0 comments X

The pith

Ten cache misconfigurations in GitLab CI/CD pipelines are common and detectable automatically.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper catalogs ten specific cache-related smells in GitLab CI/CD that can slow builds or reduce reliability. It introduces CROSSER, a detector that identifies seven of these smells in pipeline configuration files. Tests on 82 labeled projects reach an overall F1 score of 0.98. An examination of 228 mature open-source projects finds the smells in 89 percent of them, with only 11 percent free of any. The work also notes that many developers overlook higher-level caching options available in the platform.

Core claim

We present a comprehensive catalog of ten cache-related smells in GitLab CI/CD that negatively impact performance and reliability, validated on a corpus of grey literature. To address the smells, we propose CROSSER, a tool that automatically detects seven of the ten smells. We evaluate CROSSER on a manually labeled dataset of 82 mature projects, achieving an overall F1 score of 0.98. Finally, we investigate the presence of smells across a large dataset of 228 mature open-source projects and outline our empirical findings. Our results show a widespread frequency of the smells, as only 11% of the projects do not present any. We also show that developers may not be aware of higher-level caching

What carries the argument

The catalog of ten cache-related smells, which are misconfigurations or suboptimal uses of caching in GitLab CI/CD pipeline files, together with the rule-based detector CROSSER that scans .gitlab-ci.yml files to flag seven of them.

Load-bearing premise

That the ten smells extracted from grey literature genuinely and consistently degrade pipeline performance and reliability, and that the manual labeling of the 82-project dataset accurately captures smell presence without significant subjectivity.

What would settle it

A controlled measurement of pipeline run times and failure rates on the same projects before and after each of the ten smells is fixed, to check whether the claimed performance and reliability penalties actually appear.

Figures

Figures reproduced from arXiv: 2604.17890 by Francesco Urdih, Theodoros Theodoropoulos, Uwe Zdun.

Figure 1
Figure 1. Figure 1: The methodology applied in this work. (e.g., blog posts, release notes) gathered with search engines (e.g., Bing, Google Search). Focusing on these official sources ensures that the identified smells accurately reflect the platform’s recom￾mended practices and potential pitfalls. This enhances the catalog’s credibility and relevance for real-world GitLab CI/CD usage. The smells we identified are derived fr… view at source ↗
Figure 2
Figure 2. Figure 2: The relative distribution of smelly jobs across the [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
read the original abstract

Continuous Integration and Deployment (CI/CD) facilitate rapid software delivery, making fast feedback and minimal downtime essential. While caching has been shown to be an effective technique for tackling pipeline performance and reliability issues, existing works have primarily focused on missing dependency caches, ignoring other types of caches and cache misconfigurations. In this paper, we present a comprehensive catalog of ten cache-related smells in GitLab CI/CD that negatively impact performance and reliability, validated on a corpus of grey literature. To address the smells, we propose CROSSER, a tool that automatically detects seven of the ten smells. We evaluate CROSSER on a manually labeled dataset of 82 mature projects, achieving an overall F1 score of 0.98. Finally, we investigate the presence of smells across a large dataset of 228 mature open-source projects and outline our empirical findings. Our results show a widespread frequency of the smells, as only 11% of the projects do not present any. We also show that developers may not be aware of higher-level caching functionalities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents a catalog of ten cache-related smells in GitLab CI/CD pipelines, extracted and validated from grey literature, that are asserted to negatively impact performance and reliability. It introduces the CROSSER tool to automatically detect seven of the ten smells, evaluates the tool on a manually labeled dataset of 82 mature projects (overall F1 of 0.98), and reports prevalence statistics across 228 mature open-source projects (only 11% smell-free). It additionally notes limited developer awareness of higher-level caching features.

Significance. The high detection accuracy and large-scale prevalence data would provide practitioners with actionable insights for GitLab CI/CD optimization if the catalog's impact claims hold. The use of separate labeled and large-scale datasets for evaluation strengthens the empirical component, though the absence of direct runtime or reliability measurements limits the strength of conclusions about performance degradation.

major comments (2)
  1. [Abstract and catalog description] Abstract and catalog description: the assertion that the ten smells 'negatively impact performance and reliability' rests solely on grey-literature extraction with no direct measurement (e.g., execution-time deltas, cache-hit rates, or failure-rate correlations) collected or reported for the 82- or 228-project corpora.
  2. [Evaluation section on the 82-project dataset] Evaluation section on the 82-project dataset: the manual labeling process used to create ground truth is described but provides no inter-rater agreement statistics (such as Cohen's kappa) or explicit decision rules for smell identification, which is required to substantiate the reliability of the reported F1 score of 0.98.
minor comments (1)
  1. [Empirical findings] The observation that 'developers may not be aware of higher-level caching functionalities' would be strengthened by citing specific examples or prevalence data from the analyzed projects rather than remaining at a high level.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review of our manuscript. We address each of the major comments below and outline the revisions we plan to make.

read point-by-point responses
  1. Referee: [Abstract and catalog description] Abstract and catalog description: the assertion that the ten smells 'negatively impact performance and reliability' rests solely on grey-literature extraction with no direct measurement (e.g., execution-time deltas, cache-hit rates, or failure-rate correlations) collected or reported for the 82- or 228-project corpora.

    Authors: The referee is correct that we do not provide direct measurements of performance or reliability impacts within the 82- or 228-project datasets. The negative impacts attributed to the smells are based on the practitioner reports and discussions extracted from the grey literature during catalog construction. Our empirical evaluation focuses on detection accuracy and prevalence rather than quantifying the impacts. In the revised version, we will update the abstract and introduction to explicitly state that the impacts are as documented in the grey literature sources. We will also add a new subsection in the discussion or threats to validity to acknowledge the absence of direct runtime measurements and to suggest this as an avenue for future research. This constitutes a partial revision. revision: partial

  2. Referee: [Evaluation section on the 82-project dataset] Evaluation section on the 82-project dataset: the manual labeling process used to create ground truth is described but provides no inter-rater agreement statistics (such as Cohen's kappa) or explicit decision rules for smell identification, which is required to substantiate the reliability of the reported F1 score of 0.98.

    Authors: We agree that including inter-rater agreement metrics and explicit decision rules would strengthen the description of the ground truth creation process. Upon review, the labeling was performed by the authors with domain expertise in CI/CD, following a set of decision rules derived from the catalog definitions. In the revised manuscript, we will expand the evaluation section to include: (1) the explicit decision rules used for each smell, (2) details on the labeling procedure, and (3) inter-rater agreement statistics (Cohen's kappa) calculated from a subset of projects labeled independently by two authors. We will also make the labeled dataset publicly available to support reproducibility and independent assessment. revision: yes

Circularity Check

0 steps flagged

No circularity in catalog derivation, tool evaluation, or prevalence analysis

full rationale

The paper extracts its ten-smell catalog from grey literature, implements the CROSSER detector for seven of them, reports F1=0.98 on a separately manually labeled 82-project dataset, and counts prevalence (11% clean) across 228 projects. No equations, fitted parameters, or predictions appear; the F1 is a standard detection metric on held-out labels rather than any reduction to prior fits. No self-citations are invoked to justify uniqueness or load-bearing premises, and the central claims rest on external grey-literature sources plus independent large-scale counting. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on domain assumptions about what constitutes harmful cache usage in CI/CD and on the representativeness of grey literature and the chosen project corpus.

axioms (1)
  • domain assumption Properly configured caching improves CI/CD pipeline performance and reliability
    Stated as background motivation in the abstract and used to justify why the smells matter.

pith-pipeline@v0.9.0 · 5485 in / 1194 out tokens · 42194 ms · 2026-05-10T04:36:51.830816+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 3 canonical work pages

  1. [1]

    Usage of CI/CD tools in companies of more than 1K employees

    2023. Usage of CI/CD tools in companies of more than 1K employees. https: //www.developernation.net/developer-reports/dn25/. Accessed: 2026-01-10

  2. [2]

    Rabe Abdalkareem, Suhaib Mujahid, Emad Shihab, and Juergen Rilling. 2019. Which commits can be CI skipped?IEEE Transactions on Software Engineering 47, 3 (2019), 448–463

  3. [3]

    Various authors. 2025. SnakeYAML. https://bitbucket.org/snakeyaml/snakeyaml/. Accessed: 2026-01-10

  4. [4]

    Islem Bouzenia and Michael Pradel. 2024. Resource usage and optimization opportunities in workflows of github actions. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering. 1–12

  5. [5]

    Ahmet Celik, Alex Knaust, Aleksandar Milicevic, and Milos Gligoric. 2016. Build system with lazy retrieval for Java projects. InProceedings of the 2016 24th ACM SIGSOFT international symposium on foundations of software engineering. 643– 654

  6. [6]

    Lianping Chen. 2015. Continuous delivery: Huge benefits, but challenges too. IEEE software32, 2 (2015), 50–54

  7. [7]

    Thomas F Düllmann, Oliver Kabierschke, and Andre Van Hoorn. 2021. Stalkcd: A model-driven framework for interoperability and analysis of ci/cd pipelines. In2021 47th Euromicro Conference on Software Engineering and Advanced Appli- cations (SEAA). IEEE, 214–223

  8. [8]

    2010.Continuous integration: Patterns and anti-patterns

    Paul M Duvall. 2010.Continuous integration: Patterns and anti-patterns. DZone, Incorporated

  9. [9]

    2007.Continuous integration: improving software quality and reducing risk

    Paul M Duvall, Steve Matyas, and Andrew Glover. 2007.Continuous integration: improving software quality and reducing risk. Pearson Education

  10. [10]

    Paul M Duvall and Michael Olson. 2011. Continuous delivery: Patterns and antipatterns in the software life cycle.DZone refcard145 (2011), 64

  11. [11]

    Hamed Esfahani, Jonas Fietz, Qi Ke, Alexei Kolomiets, Erica Lan, Erik Mavrinac, Wolfram Schulte, Newton Sanches, and Srikanth Kandula. 2016. CloudBuild: Microsoft’s distributed and caching build service. InProceedings of the 38th International conference on software engineering companion. 11–20

  12. [12]

    Jeffrey Fairbanks, Akshharaa Tharigonda, and Nasir U Eisty. 2023. Analyzing the Effects of CI/CD on Open Source Repositories in GitHub and GitLab. In 2023 IEEE/ACIS 21st International Conference on Software Engineering Research, Management and Applications (SERA). IEEE, 176–181

  13. [13]

    Keheliya Gallaba, John Ewart, Yves Junqueira, and Shane Mcintosh. 2020. Accel- erating continuous integration by caching environments and inferring depen- dencies.IEEE Transactions on Software Engineering48, 6 (2020), 2040–2052

  14. [14]

    Keheliya Gallaba and Shane McIntosh. 2018. Use and misuse of continuous integration features: An empirical study of projects that (mis) use Travis CI.IEEE Transactions on Software Engineering46, 1 (2018), 33–50

  15. [15]

    Vahid Garousi, Michael Felderer, and Mika V Mäntylä. 2019. Guidelines for including grey literature and conducting multivocal literature reviews in software engineering.Information and software technology106 (2019), 101–121

  16. [16]

    Taher Ahmed Ghaleb, Daniel Alencar Da Costa, and Ying Zou. 2019. An empirical study of the long duration of continuous integration builds.Empirical Software Engineering24 (2019), 2102–2139

  17. [17]

    Taher Ahmed Ghaleb, Safwat Hassan, and Ying Zou. 2022. Studying the interplay between the durations and breakages of continuous integration builds.IEEE Transactions on Software Engineering49, 4 (2022), 2476–2497

  18. [18]

    Michael Hilton, Nicholas Nelson, Timothy Tunnell, Darko Marinov, and Danny Dig. 2017. Trade-offs in continuous integration: assurance, security, and flexi- bility. InProceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. 197–207

  19. [19]

    Michael Hilton, Timothy Tunnell, Kai Huang, Darko Marinov, and Danny Dig

  20. [20]

    InProceedings of the 31st IEEE/ACM international conference on automated software engineering

    Usage, costs, and benefits of continuous integration in open-source projects. InProceedings of the 31st IEEE/ACM international conference on automated software engineering. 426–437

  21. [21]

    Docker Hub. 2025. Mirror the Docker Hub library. https://docs.docker.com/ docker-hub/image-library/mirror/. Accessed: 2026-01-10

  22. [22]

    Anaconda Inc. 2025. Conda Documentation. https://docs.conda.io/en/latest/. Accessed: 2026-01-10

  23. [23]

    GitLab Inc. 2025. GitLab CI/CD documentation. https://docs.gitlab.com/ci/. Accessed: 2026-01-10

  24. [24]

    GitLab Inc. 2025. GitLab Lint API. https://docs.gitlab.com/api/lint/. Accessed: 2026-01-10

  25. [25]

    Nuthan Munaiah, Steven Kroh, Craig Cabrey, and Meiyappan Nagappan. 2017. Curating github for engineered software projects.Empirical Software Engineering 22 (2017), 3219–3253

  26. [26]

    Ansong Ni and Ming Li. 2017. Cost-effective build outcome prediction using cascaded classifiers. In2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR). IEEE, 455–458

  27. [27]

    Evangelos Ntentos, Stephen John Warnett, and Uwe Zdun. 2024. Supporting architectural decision making on training strategies in reinforcement learning architectures. In2024 IEEE 21st International Conference on Software Architecture (ICSA). IEEE, 90–100

  28. [28]

    Doriane Olewicki, Mathieu Nayrolles, and Bram Adams. 2022. Towards language- independent brown build detection. InProceedings of the 44th International Con- ference on Software Engineering. 2177–2188

  29. [29]

    Moses Openja, Forough Majidi, Foutse Khomh, Bhagya Chembakottu, and Heng Li. 2022. Studying the practices of deploying machine learning projects on docker. InProceedings of the 26th international conference on evaluation and assessment in software engineering. 190–200

  30. [30]

    Surya Oruganti. 2025. A Developer’s Guide to Speeding Up GitHub Ac- tions. https://web.archive.org/web/20240713003758/https://www.warpbuild.com/ blog/github-actions-speeding-up. Accessed: 2026-01-10

  31. [31]

    PyPA. 2025. Pip Documentation. https://pip.pypa.io/en/stable/index.html. Ac- cessed: 2026-01-10

  32. [32]

    Austen Rainer and Ashley Williams. 2019. Using blog-like documents to investi- gate software practice: Benefits, challenges, and research directions.Journal of Software: Evolution and Process31, 11 (2019), e2197

  33. [33]

    Thomas Rausch, Waldemar Hummer, Philipp Leitner, and Stefan Schulte. 2017. An empirical analysis of build failures in the continuous integration workflows of java-based open-source software. In2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR). IEEE, 345–355

  34. [34]

    Filippo Ricca, Alessandro Marchetto, and Andrea Stocco. 2025. A multi-year grey literature review on AI-assisted test automation.Information and Software Technology(2025), 107799

  35. [35]

    1996.Object-oriented design heuristics

    Arthur J Riel. 1996.Object-oriented design heuristics. Addison-Wesley Longman Publishing Co., Inc

  36. [36]

    Camil Sadiki. 2023. Learn how to speed up Gitlab CI. https://web.archive.org/web/ 20240910042326/https://cloud.theodo.com/en/blog/gitlab-ci-optimization. Ac- cessed: 2026-01-10

  37. [37]

    Scipy. 2025. Scipy - Installing system-wide via a system package manager. https: //scipy.org/install/. Accessed: 2026-01-10

  38. [38]

    Mojtaba Shahin, Muhammad Ali Babar, and Liming Zhu. 2017. Continuous integration, delivery and deployment: a systematic review on approaches, tools, challenges and practices.IEEE access5 (2017), 3909–3943

  39. [39]

    Tushar Sharma, Marios Fragkoulis, and Diomidis Spinellis. 2016. Does your configuration code smell?. InProceedings of the 13th international conference on mining software repositories. 189–200

  40. [40]

    Daniel Ståhl and Jan Bosch. 2013. Experienced benefits of continuous integration in industry software product development: A case study. InThe 12th iasted international conference on software engineering,(innsbruck, austria, 2013). 736– 743

  41. [41]

    Klaas-Jan Stol, Paul Ralph, and Brian Fitzgerald. 2016. Grounded theory in software engineering research: a critical review and guidelines. InProceedings of the 38th International conference on software engineering. 120–131

  42. [42]

    GMBH Travis CI. 2025. Travis CI User Documentation. https://docs.travis-ci.com/. Accessed: 2026-01-10

  43. [43]

    Francesco Urdih, Theodoros Theodoropoulos, and Uwe Zdun. 2025. Architectural Design Decisions and Best Practices for Fast and Efficient CI/CD Pipelines. In European Conference on Software Architecture. Springer, 297–305

  44. [44]

    Francesco Urdih, Theodoros Theodoropoulos, and Uwe Zdun. 2026. Replication Package for ’Cache-Related Smells in GitLab CI/CD: Comprehensive Catalog, Automated Detection, and Empirical Evidence’. https://doi.org/10.5281/zenodo. 19130470

  45. [45]

    Bogdan Vasilescu, Yue Yu, Huaimin Wang, Premkumar Devanbu, and Vladimir Filkov. 2015. Quality and productivity outcomes relating to continuous integra- tion in GitHub. InProceedings of the 2015 10th joint meeting on foundations of software engineering. 805–816

  46. [46]

    Carmine Vassallo, Sebastian Proksch, Harald C Gall, and Massimiliano Di Penta

  47. [47]

    In2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE)

    Automated reporting of anti-patterns and decay in continuous integration. In2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 105–115

  48. [48]

    Carmine Vassallo, Sebastian Proksch, Anna Jancso, Harald C Gall, and Massim- iliano Di Penta. 2020. Configuration smells in continuous delivery pipelines: a linter and a six-month study on GitLab. InProceedings of the 28th ACM joint meet- ing on european software engineering conference and symposium on the foundations of software engineering. 327–337

  49. [49]

    Stephen John Warnett and Uwe Zdun. 2022. Architectural design decisions for machine learning deployment. In2022 IEEE 19th International Conference on Software Architecture (ICSA). IEEE, 90–100

  50. [50]

    David Gray Widder, Michael Hilton, Christian Kästner, and Bogdan Vasilescu

  51. [51]

    InProceedings of the 2019 27th acm joint meeting on european software engineering conference and symposium on the foundations of software engineering

    A conceptual replication of continuous integration pain points in the context of Travis CI. InProceedings of the 2019 27th acm joint meeting on european software engineering conference and symposium on the foundations of software engineering. 647–658

  52. [52]

    Mingyang Yin, Yutaro Kashiwa, Keheliya Gallaba, Mahmoud Alfadel, Yasutaka Kamei, and Shane McIntosh. 2024. Developer-Applied Accelerations in Continu- ous Integration: A Detection Approach and Catalog of Patterns. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 1655–1666. EASE 2026, 9–12 June, 2026, Glasgow, ...

  53. [53]

    Fiorella Zampetti, Carmine Vassallo, Sebastiano Panichella, Gerardo Canfora, Harald Gall, and Massimiliano Di Penta. 2020. An empirical characterization of bad practices in continuous integration.Empirical Software Engineering25, 2 (2020), 1095–1135

  54. [54]

    Chen Zhang, Bihuan Chen, Linlin Chen, Xin Peng, and Wenyun Zhao. 2019. A large-scale empirical study of compiler errors in continuous integration. In Proceedings of the 2019 27th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering. 176–187

  55. [55]

    Chen Zhang, Bihuan Chen, Junhao Hu, Xin Peng, and Wenyun Zhao. 2022. BuildSonic: Detecting and Repairing Performance-Related Configuration Smells for Continuous Integration Builds. 37th IEEE. InACM International Conference on Automated Software Engineering, https://doi. org/10.1145/3551349.3556923, Vol. 10