Recognition: unknown
Cache-Related Smells in GitLab CI/CD: Comprehensive Catalog, Automated Detection, and Empirical Evidence
Pith reviewed 2026-05-10 04:36 UTC · model grok-4.3
The pith
Ten cache misconfigurations in GitLab CI/CD pipelines are common and detectable automatically.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present a comprehensive catalog of ten cache-related smells in GitLab CI/CD that negatively impact performance and reliability, validated on a corpus of grey literature. To address the smells, we propose CROSSER, a tool that automatically detects seven of the ten smells. We evaluate CROSSER on a manually labeled dataset of 82 mature projects, achieving an overall F1 score of 0.98. Finally, we investigate the presence of smells across a large dataset of 228 mature open-source projects and outline our empirical findings. Our results show a widespread frequency of the smells, as only 11% of the projects do not present any. We also show that developers may not be aware of higher-level caching
What carries the argument
The catalog of ten cache-related smells, which are misconfigurations or suboptimal uses of caching in GitLab CI/CD pipeline files, together with the rule-based detector CROSSER that scans .gitlab-ci.yml files to flag seven of them.
Load-bearing premise
That the ten smells extracted from grey literature genuinely and consistently degrade pipeline performance and reliability, and that the manual labeling of the 82-project dataset accurately captures smell presence without significant subjectivity.
What would settle it
A controlled measurement of pipeline run times and failure rates on the same projects before and after each of the ten smells is fixed, to check whether the claimed performance and reliability penalties actually appear.
Figures
read the original abstract
Continuous Integration and Deployment (CI/CD) facilitate rapid software delivery, making fast feedback and minimal downtime essential. While caching has been shown to be an effective technique for tackling pipeline performance and reliability issues, existing works have primarily focused on missing dependency caches, ignoring other types of caches and cache misconfigurations. In this paper, we present a comprehensive catalog of ten cache-related smells in GitLab CI/CD that negatively impact performance and reliability, validated on a corpus of grey literature. To address the smells, we propose CROSSER, a tool that automatically detects seven of the ten smells. We evaluate CROSSER on a manually labeled dataset of 82 mature projects, achieving an overall F1 score of 0.98. Finally, we investigate the presence of smells across a large dataset of 228 mature open-source projects and outline our empirical findings. Our results show a widespread frequency of the smells, as only 11% of the projects do not present any. We also show that developers may not be aware of higher-level caching functionalities.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a catalog of ten cache-related smells in GitLab CI/CD pipelines, extracted and validated from grey literature, that are asserted to negatively impact performance and reliability. It introduces the CROSSER tool to automatically detect seven of the ten smells, evaluates the tool on a manually labeled dataset of 82 mature projects (overall F1 of 0.98), and reports prevalence statistics across 228 mature open-source projects (only 11% smell-free). It additionally notes limited developer awareness of higher-level caching features.
Significance. The high detection accuracy and large-scale prevalence data would provide practitioners with actionable insights for GitLab CI/CD optimization if the catalog's impact claims hold. The use of separate labeled and large-scale datasets for evaluation strengthens the empirical component, though the absence of direct runtime or reliability measurements limits the strength of conclusions about performance degradation.
major comments (2)
- [Abstract and catalog description] Abstract and catalog description: the assertion that the ten smells 'negatively impact performance and reliability' rests solely on grey-literature extraction with no direct measurement (e.g., execution-time deltas, cache-hit rates, or failure-rate correlations) collected or reported for the 82- or 228-project corpora.
- [Evaluation section on the 82-project dataset] Evaluation section on the 82-project dataset: the manual labeling process used to create ground truth is described but provides no inter-rater agreement statistics (such as Cohen's kappa) or explicit decision rules for smell identification, which is required to substantiate the reliability of the reported F1 score of 0.98.
minor comments (1)
- [Empirical findings] The observation that 'developers may not be aware of higher-level caching functionalities' would be strengthened by citing specific examples or prevalence data from the analyzed projects rather than remaining at a high level.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review of our manuscript. We address each of the major comments below and outline the revisions we plan to make.
read point-by-point responses
-
Referee: [Abstract and catalog description] Abstract and catalog description: the assertion that the ten smells 'negatively impact performance and reliability' rests solely on grey-literature extraction with no direct measurement (e.g., execution-time deltas, cache-hit rates, or failure-rate correlations) collected or reported for the 82- or 228-project corpora.
Authors: The referee is correct that we do not provide direct measurements of performance or reliability impacts within the 82- or 228-project datasets. The negative impacts attributed to the smells are based on the practitioner reports and discussions extracted from the grey literature during catalog construction. Our empirical evaluation focuses on detection accuracy and prevalence rather than quantifying the impacts. In the revised version, we will update the abstract and introduction to explicitly state that the impacts are as documented in the grey literature sources. We will also add a new subsection in the discussion or threats to validity to acknowledge the absence of direct runtime measurements and to suggest this as an avenue for future research. This constitutes a partial revision. revision: partial
-
Referee: [Evaluation section on the 82-project dataset] Evaluation section on the 82-project dataset: the manual labeling process used to create ground truth is described but provides no inter-rater agreement statistics (such as Cohen's kappa) or explicit decision rules for smell identification, which is required to substantiate the reliability of the reported F1 score of 0.98.
Authors: We agree that including inter-rater agreement metrics and explicit decision rules would strengthen the description of the ground truth creation process. Upon review, the labeling was performed by the authors with domain expertise in CI/CD, following a set of decision rules derived from the catalog definitions. In the revised manuscript, we will expand the evaluation section to include: (1) the explicit decision rules used for each smell, (2) details on the labeling procedure, and (3) inter-rater agreement statistics (Cohen's kappa) calculated from a subset of projects labeled independently by two authors. We will also make the labeled dataset publicly available to support reproducibility and independent assessment. revision: yes
Circularity Check
No circularity in catalog derivation, tool evaluation, or prevalence analysis
full rationale
The paper extracts its ten-smell catalog from grey literature, implements the CROSSER detector for seven of them, reports F1=0.98 on a separately manually labeled 82-project dataset, and counts prevalence (11% clean) across 228 projects. No equations, fitted parameters, or predictions appear; the F1 is a standard detection metric on held-out labels rather than any reduction to prior fits. No self-citations are invoked to justify uniqueness or load-bearing premises, and the central claims rest on external grey-literature sources plus independent large-scale counting. The derivation chain is therefore self-contained and non-circular.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Properly configured caching improves CI/CD pipeline performance and reliability
Reference graph
Works this paper leans on
-
[1]
Usage of CI/CD tools in companies of more than 1K employees
2023. Usage of CI/CD tools in companies of more than 1K employees. https: //www.developernation.net/developer-reports/dn25/. Accessed: 2026-01-10
2023
-
[2]
Rabe Abdalkareem, Suhaib Mujahid, Emad Shihab, and Juergen Rilling. 2019. Which commits can be CI skipped?IEEE Transactions on Software Engineering 47, 3 (2019), 448–463
2019
-
[3]
Various authors. 2025. SnakeYAML. https://bitbucket.org/snakeyaml/snakeyaml/. Accessed: 2026-01-10
2025
-
[4]
Islem Bouzenia and Michael Pradel. 2024. Resource usage and optimization opportunities in workflows of github actions. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering. 1–12
2024
-
[5]
Ahmet Celik, Alex Knaust, Aleksandar Milicevic, and Milos Gligoric. 2016. Build system with lazy retrieval for Java projects. InProceedings of the 2016 24th ACM SIGSOFT international symposium on foundations of software engineering. 643– 654
2016
-
[6]
Lianping Chen. 2015. Continuous delivery: Huge benefits, but challenges too. IEEE software32, 2 (2015), 50–54
2015
-
[7]
Thomas F Düllmann, Oliver Kabierschke, and Andre Van Hoorn. 2021. Stalkcd: A model-driven framework for interoperability and analysis of ci/cd pipelines. In2021 47th Euromicro Conference on Software Engineering and Advanced Appli- cations (SEAA). IEEE, 214–223
2021
-
[8]
2010.Continuous integration: Patterns and anti-patterns
Paul M Duvall. 2010.Continuous integration: Patterns and anti-patterns. DZone, Incorporated
2010
-
[9]
2007.Continuous integration: improving software quality and reducing risk
Paul M Duvall, Steve Matyas, and Andrew Glover. 2007.Continuous integration: improving software quality and reducing risk. Pearson Education
2007
-
[10]
Paul M Duvall and Michael Olson. 2011. Continuous delivery: Patterns and antipatterns in the software life cycle.DZone refcard145 (2011), 64
2011
-
[11]
Hamed Esfahani, Jonas Fietz, Qi Ke, Alexei Kolomiets, Erica Lan, Erik Mavrinac, Wolfram Schulte, Newton Sanches, and Srikanth Kandula. 2016. CloudBuild: Microsoft’s distributed and caching build service. InProceedings of the 38th International conference on software engineering companion. 11–20
2016
-
[12]
Jeffrey Fairbanks, Akshharaa Tharigonda, and Nasir U Eisty. 2023. Analyzing the Effects of CI/CD on Open Source Repositories in GitHub and GitLab. In 2023 IEEE/ACIS 21st International Conference on Software Engineering Research, Management and Applications (SERA). IEEE, 176–181
2023
-
[13]
Keheliya Gallaba, John Ewart, Yves Junqueira, and Shane Mcintosh. 2020. Accel- erating continuous integration by caching environments and inferring depen- dencies.IEEE Transactions on Software Engineering48, 6 (2020), 2040–2052
2020
-
[14]
Keheliya Gallaba and Shane McIntosh. 2018. Use and misuse of continuous integration features: An empirical study of projects that (mis) use Travis CI.IEEE Transactions on Software Engineering46, 1 (2018), 33–50
2018
-
[15]
Vahid Garousi, Michael Felderer, and Mika V Mäntylä. 2019. Guidelines for including grey literature and conducting multivocal literature reviews in software engineering.Information and software technology106 (2019), 101–121
2019
-
[16]
Taher Ahmed Ghaleb, Daniel Alencar Da Costa, and Ying Zou. 2019. An empirical study of the long duration of continuous integration builds.Empirical Software Engineering24 (2019), 2102–2139
2019
-
[17]
Taher Ahmed Ghaleb, Safwat Hassan, and Ying Zou. 2022. Studying the interplay between the durations and breakages of continuous integration builds.IEEE Transactions on Software Engineering49, 4 (2022), 2476–2497
2022
-
[18]
Michael Hilton, Nicholas Nelson, Timothy Tunnell, Darko Marinov, and Danny Dig. 2017. Trade-offs in continuous integration: assurance, security, and flexi- bility. InProceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. 197–207
2017
-
[19]
Michael Hilton, Timothy Tunnell, Kai Huang, Darko Marinov, and Danny Dig
-
[20]
InProceedings of the 31st IEEE/ACM international conference on automated software engineering
Usage, costs, and benefits of continuous integration in open-source projects. InProceedings of the 31st IEEE/ACM international conference on automated software engineering. 426–437
-
[21]
Docker Hub. 2025. Mirror the Docker Hub library. https://docs.docker.com/ docker-hub/image-library/mirror/. Accessed: 2026-01-10
2025
-
[22]
Anaconda Inc. 2025. Conda Documentation. https://docs.conda.io/en/latest/. Accessed: 2026-01-10
2025
-
[23]
GitLab Inc. 2025. GitLab CI/CD documentation. https://docs.gitlab.com/ci/. Accessed: 2026-01-10
2025
-
[24]
GitLab Inc. 2025. GitLab Lint API. https://docs.gitlab.com/api/lint/. Accessed: 2026-01-10
2025
-
[25]
Nuthan Munaiah, Steven Kroh, Craig Cabrey, and Meiyappan Nagappan. 2017. Curating github for engineered software projects.Empirical Software Engineering 22 (2017), 3219–3253
2017
-
[26]
Ansong Ni and Ming Li. 2017. Cost-effective build outcome prediction using cascaded classifiers. In2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR). IEEE, 455–458
2017
-
[27]
Evangelos Ntentos, Stephen John Warnett, and Uwe Zdun. 2024. Supporting architectural decision making on training strategies in reinforcement learning architectures. In2024 IEEE 21st International Conference on Software Architecture (ICSA). IEEE, 90–100
2024
-
[28]
Doriane Olewicki, Mathieu Nayrolles, and Bram Adams. 2022. Towards language- independent brown build detection. InProceedings of the 44th International Con- ference on Software Engineering. 2177–2188
2022
-
[29]
Moses Openja, Forough Majidi, Foutse Khomh, Bhagya Chembakottu, and Heng Li. 2022. Studying the practices of deploying machine learning projects on docker. InProceedings of the 26th international conference on evaluation and assessment in software engineering. 190–200
2022
- [30]
-
[31]
PyPA. 2025. Pip Documentation. https://pip.pypa.io/en/stable/index.html. Ac- cessed: 2026-01-10
2025
-
[32]
Austen Rainer and Ashley Williams. 2019. Using blog-like documents to investi- gate software practice: Benefits, challenges, and research directions.Journal of Software: Evolution and Process31, 11 (2019), e2197
2019
-
[33]
Thomas Rausch, Waldemar Hummer, Philipp Leitner, and Stefan Schulte. 2017. An empirical analysis of build failures in the continuous integration workflows of java-based open-source software. In2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR). IEEE, 345–355
2017
-
[34]
Filippo Ricca, Alessandro Marchetto, and Andrea Stocco. 2025. A multi-year grey literature review on AI-assisted test automation.Information and Software Technology(2025), 107799
2025
-
[35]
1996.Object-oriented design heuristics
Arthur J Riel. 1996.Object-oriented design heuristics. Addison-Wesley Longman Publishing Co., Inc
1996
-
[36]
Camil Sadiki. 2023. Learn how to speed up Gitlab CI. https://web.archive.org/web/ 20240910042326/https://cloud.theodo.com/en/blog/gitlab-ci-optimization. Ac- cessed: 2026-01-10
2023
-
[37]
Scipy. 2025. Scipy - Installing system-wide via a system package manager. https: //scipy.org/install/. Accessed: 2026-01-10
2025
-
[38]
Mojtaba Shahin, Muhammad Ali Babar, and Liming Zhu. 2017. Continuous integration, delivery and deployment: a systematic review on approaches, tools, challenges and practices.IEEE access5 (2017), 3909–3943
2017
-
[39]
Tushar Sharma, Marios Fragkoulis, and Diomidis Spinellis. 2016. Does your configuration code smell?. InProceedings of the 13th international conference on mining software repositories. 189–200
2016
-
[40]
Daniel Ståhl and Jan Bosch. 2013. Experienced benefits of continuous integration in industry software product development: A case study. InThe 12th iasted international conference on software engineering,(innsbruck, austria, 2013). 736– 743
2013
-
[41]
Klaas-Jan Stol, Paul Ralph, and Brian Fitzgerald. 2016. Grounded theory in software engineering research: a critical review and guidelines. InProceedings of the 38th International conference on software engineering. 120–131
2016
-
[42]
GMBH Travis CI. 2025. Travis CI User Documentation. https://docs.travis-ci.com/. Accessed: 2026-01-10
2025
-
[43]
Francesco Urdih, Theodoros Theodoropoulos, and Uwe Zdun. 2025. Architectural Design Decisions and Best Practices for Fast and Efficient CI/CD Pipelines. In European Conference on Software Architecture. Springer, 297–305
2025
-
[44]
Francesco Urdih, Theodoros Theodoropoulos, and Uwe Zdun. 2026. Replication Package for ’Cache-Related Smells in GitLab CI/CD: Comprehensive Catalog, Automated Detection, and Empirical Evidence’. https://doi.org/10.5281/zenodo. 19130470
-
[45]
Bogdan Vasilescu, Yue Yu, Huaimin Wang, Premkumar Devanbu, and Vladimir Filkov. 2015. Quality and productivity outcomes relating to continuous integra- tion in GitHub. InProceedings of the 2015 10th joint meeting on foundations of software engineering. 805–816
2015
-
[46]
Carmine Vassallo, Sebastian Proksch, Harald C Gall, and Massimiliano Di Penta
-
[47]
In2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE)
Automated reporting of anti-patterns and decay in continuous integration. In2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 105–115
-
[48]
Carmine Vassallo, Sebastian Proksch, Anna Jancso, Harald C Gall, and Massim- iliano Di Penta. 2020. Configuration smells in continuous delivery pipelines: a linter and a six-month study on GitLab. InProceedings of the 28th ACM joint meet- ing on european software engineering conference and symposium on the foundations of software engineering. 327–337
2020
-
[49]
Stephen John Warnett and Uwe Zdun. 2022. Architectural design decisions for machine learning deployment. In2022 IEEE 19th International Conference on Software Architecture (ICSA). IEEE, 90–100
2022
-
[50]
David Gray Widder, Michael Hilton, Christian Kästner, and Bogdan Vasilescu
-
[51]
InProceedings of the 2019 27th acm joint meeting on european software engineering conference and symposium on the foundations of software engineering
A conceptual replication of continuous integration pain points in the context of Travis CI. InProceedings of the 2019 27th acm joint meeting on european software engineering conference and symposium on the foundations of software engineering. 647–658
2019
-
[52]
Mingyang Yin, Yutaro Kashiwa, Keheliya Gallaba, Mahmoud Alfadel, Yasutaka Kamei, and Shane McIntosh. 2024. Developer-Applied Accelerations in Continu- ous Integration: A Detection Approach and Catalog of Patterns. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 1655–1666. EASE 2026, 9–12 June, 2026, Glasgow, ...
2024
-
[53]
Fiorella Zampetti, Carmine Vassallo, Sebastiano Panichella, Gerardo Canfora, Harald Gall, and Massimiliano Di Penta. 2020. An empirical characterization of bad practices in continuous integration.Empirical Software Engineering25, 2 (2020), 1095–1135
2020
-
[54]
Chen Zhang, Bihuan Chen, Linlin Chen, Xin Peng, and Wenyun Zhao. 2019. A large-scale empirical study of compiler errors in continuous integration. In Proceedings of the 2019 27th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering. 176–187
2019
-
[55]
Chen Zhang, Bihuan Chen, Junhao Hu, Xin Peng, and Wenyun Zhao. 2022. BuildSonic: Detecting and Repairing Performance-Related Configuration Smells for Continuous Integration Builds. 37th IEEE. InACM International Conference on Automated Software Engineering, https://doi. org/10.1145/3551349.3556923, Vol. 10
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.