Exploring Statistical Change Point Detection Techniques for Performance Anomaly Detection at Mozilla

Alexander Serebrenik; Dave Hunt; Diego Elias Costa; Gregory Mierzwinski; Mohamed Bilel Besbes; Philipp Leitner; Suhaib Mujahid

arxiv: 2606.18377 · v1 · pith:NMTOC5NJnew · submitted 2026-06-16 · 💻 cs.SE

Exploring Statistical Change Point Detection Techniques for Performance Anomaly Detection at Mozilla

Mohamed Bilel Besbes , Gregory Mierzwinski , Suhaib Mujahid , Philipp Leitner , Alexander Serebrenik , Dave Hunt , Diego Elias Costa This is my paper

Pith reviewed 2026-06-26 23:30 UTC · model grok-4.3

classification 💻 cs.SE

keywords change point detectionperformance anomaly detectionensemble methodssoftware performance regressiontime series analysisMozilla PerfherderF1-score

0 comments

The pith

Ensemble voting strategies raise the F1-score of performance regression detection by 11 percent over Mozilla's current T-test method.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Mozilla relies on a Student's T-test to flag performance regressions across hundreds of daily code changes, yet this produces 12.5 percent false-positive alert groups and misses 6.8 percent of real regressions. The paper tests 25 change-point detection methods plus 15 ensemble combinations on 174 time series that eleven performance engineers labeled by hand. Offline and hybrid methods catch more real issues than the existing approach but generate too many false alarms. Voting ensembles reduce that precision-recall trade-off while delivering an 11 percent F1-score gain. The authors confirm the results with a practitioner survey and describe how the stronger methods could be added to the live alert system.

Core claim

While offline and hybrid change-point detection methods improve recall over Mozilla's Student's T-test approach, they do so at a high cost to precision. Ensemble voting strategies alleviate this trade-off and offer more consistent performance, resulting in 11% improvement in the F1-score on the ground-truth dataset of 174 performance time series.

What carries the argument

Ensemble voting strategies that combine outputs from multiple change-point detection methods to decide when a performance time series contains a regression.

If this is right

Ensemble methods produce more reliable alerts that reduce wasted engineering effort on false positives.
The manually annotated dataset supplies a reusable benchmark for testing future change-point detectors in performance settings.
The best ensembles can be integrated into Perfherder while preserving the existing daily alert workflow.
Practitioner survey responses support that the measured F1 gains translate to reduced alert fatigue in real use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar ensemble voting could be applied to other software metrics such as reliability or security signals in the same continuous-integration pipelines.
Organizations running comparable performance-monitoring systems might see comparable gains by swapping in these ensembles without changing their data collection.
The 174-series benchmark makes it possible to test whether newer online change-point methods maintain the same 11 percent edge when run in streaming mode.

Load-bearing premise

The ground-truth labels created by the eleven Mozilla performance engineers accurately identify true performance regressions without systematic bias in the 174 time series.

What would settle it

Re-labeling the same 174 time series by a fresh group of engineers and finding that no ensemble reaches a higher F1-score than the original T-test method.

Figures

Figures reproduced from arXiv: 2606.18377 by Alexander Serebrenik, Dave Hunt, Diego Elias Costa, Gregory Mierzwinski, Mohamed Bilel Besbes, Philipp Leitner, Suhaib Mujahid.

**Figure 1.** Figure 1: Study Design Overview building blocks of our study. Problem Understanding. We investigate Mozilla’s source code and toolset, workflow documentation, and the dataset of Mozilla performance measurements [11] to develop an understanding of the Mozilla anomaly detection workflow (Section 3.1). By analyzing these resources, we consolidate Mozilla’s performance engineering process and we evaluate the scale of th… view at source ↗

**Figure 2.** Figure 2: Alert creation example 3 To what extent does Mozilla’s method produce false alerts and miss real ones? This section dives into the process adopted by Mozilla to identify performance anomalies during software development. We describe Mozilla’s performance testing workflow in Section 3.1 and dive into a preliminary analysis of the issues of the current alert system in Section 3.2. 3.1 The Mozilla’s Performan… view at source ↗

**Figure 3.** Figure 3: Simplified workflow of Mozilla’s alert creation process [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Experimental methodology overview Speedometer3 and TP6, respectively. By selecting time series associated with these subsets among the entire published dataset of 5655 time series, we obtain 2851 time series. From those, we randomly sample 400 for the annotation process to maintain a 95% confidence level and a 5% margin of error. In the performance time series dataset, some revisions are associated with mu… view at source ↗

**Figure 5.** Figure 5: Annotation user interface example Based on the feedback from a pilot study with two researchers and two practitioners, we add more detailed tutorial content and performance test specifications, and add capabilities to zoom on the y-axis to the tool. 4.3 Data Annotation Process We include the performance time series in the annotation platform without any prior alert-related labels. The goal of this process … view at source ↗

**Figure 6.** Figure 6: Hybrid CPD methods pipeline breakdown. We highlight the varying statistical tests and hyperparameter search space. Default [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: CPD agreement example: consensus threshold [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: OvR F1-score related statistics 5.2 Offline Methods’ Results [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Likert scale questions results The participant pool mirrors that of the annotation study ( [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗

**Figure 10.** Figure 10: MOS Performance Grading (Q16 & Q17) and Deployment Preferences (Q18) [PITH_FULL_IMAGE:figures/full_fig_p027_10.png] view at source ↗

read the original abstract

Software performance regressions can have significant business consequences, making automated detection a critical component of modern continuous integration pipelines. At Mozilla, performance anomaly detection is handled by Perfherder, Mozilla's performance engineering management system that relies on a Student's T-test-based approach to flag regressions across hundreds of daily code changes. However, our preliminary analysis of one year of Mozilla performance data reveals that 12.5% of generated alert groups are false positives, while approximately 6.8% of them contain regressions missed by the automated system. This paper presents an empirical study evaluating 25 change-point detection (CPD) methods and 15 ensemble approaches as alternatives to Mozilla's current method. We construct a ground-truth dataset of 174 performance time series manually annotated by eleven Mozilla performance engineers, representing one of the first practitioner-annotated CPD benchmarks for performance engineering. Our results show that while offline and hybrid CPD methods improve recall over Mozilla's method, they do so at a high cost to precision. Ensemble voting strategies alleviate this trade-off and offer more consistent performance, resulting in 11% improvement in the F1-score. We validate the experimental results through a practitioner survey and report on lessons learned from integrating the best methods into Mozilla's performance engineering system.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The new 174-series practitioner-labeled dataset is the real addition here, but the 11% F1 gain from ensembles sits on unexamined annotation quality.

read the letter

The paper's main deliverable is a ground-truth set of 174 performance time series from Mozilla, labeled by eleven engineers, plus a head-to-head test of 25 change-point methods and 15 ensembles against the existing T-test baseline. That dataset is new for this subfield and the abstract reports a concrete 11% F1 lift from voting ensembles, along with the baseline's 12.5% false-positive and 6.8% miss rates.

They also ran a follow-up practitioner survey, which is a reasonable way to check whether the quantitative results line up with what the engineers actually care about. The work stays empirical and avoids circular math, which keeps the claims testable.

The soft spot is the labels themselves. Every reported improvement is measured against those annotations, yet the abstract gives no inter-annotator agreement numbers, no blinding details, and no external check such as injected regressions. If the eleven engineers share consistent biases about what counts as a regression, then any detector that matches those biases will look stronger without being better in an objective sense. The data also come from a single organization over one year, so the ensemble advantage may not travel.

This is useful reading for people who maintain performance dashboards in CI pipelines and want a real-world benchmark to compare against. The dataset alone gives it enough substance that a serious editor should send it out for review rather than desk-reject, even though the labeling process will need more scrutiny in revision.

Referee Report

3 major / 2 minor

Summary. The manuscript evaluates 25 change-point detection methods and 15 ensemble approaches as alternatives to Mozilla's T-test baseline for performance anomaly detection in Perfherder. It constructs a ground-truth dataset of 174 time series annotated by eleven performance engineers, reports that offline/hybrid methods improve recall at the cost of precision while ensembles achieve an 11% F1-score gain with more consistent performance, and validates results via a practitioner survey.

Significance. If the ground-truth annotations prove reliable, the work supplies one of the first practitioner-annotated benchmarks for CPD in performance engineering and identifies deployable ensemble strategies that could reduce the observed 12.5% false-positive and 6.8% missed-regression rates in continuous-integration pipelines.

major comments (3)

[Dataset construction] Dataset construction (implicit in abstract and § on ground-truth): the central 11% F1 improvement is computed entirely against labels produced by eleven engineers, yet no inter-annotator agreement statistic (Fleiss' kappa, percentage agreement, or disagreement-resolution protocol) is reported. Without this, it is impossible to distinguish genuine detection quality from methods that simply reproduce annotator-specific biases.
[Results] Results (abstract and empirical comparison section): the 11% F1 gain for ensembles is stated without accompanying statistical significance tests (paired Wilcoxon or McNemar across the 174 series), confidence intervals, or per-series variance, so it is unclear whether the delta is robust or dataset-specific.
[Methods] Methods (implementation of the 25 CPD algorithms and 15 ensembles): the manuscript provides no explicit description of hyper-parameter selection, handling of missing values or trend components in the Mozilla time series, or the exact voting rules used in the ensembles, rendering the comparison non-reproducible and the fairness of the baseline evaluation difficult to assess.

minor comments (2)

[Abstract] Abstract states concrete percentages (12.5% false positives, 6.8% missed) but does not indicate the exact time window or alert-group definition used to compute them.
[Validation] The practitioner survey is mentioned as validation but its sample size, questions, and response rate are not summarized in the abstract or early sections.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for these constructive comments, which highlight important aspects of rigor for the ground-truth dataset, statistical validation of results, and reproducibility. We address each point below and will revise the manuscript to incorporate the requested details.

read point-by-point responses

Referee: [Dataset construction] Dataset construction (implicit in abstract and § on ground-truth): the central 11% F1 improvement is computed entirely against labels produced by eleven engineers, yet no inter-annotator agreement statistic (Fleiss' kappa, percentage agreement, or disagreement-resolution protocol) is reported. Without this, it is impossible to distinguish genuine detection quality from methods that simply reproduce annotator-specific biases.

Authors: We agree that reporting inter-annotator agreement is essential for establishing the reliability of the practitioner-annotated ground truth. The current manuscript does not include Fleiss' kappa, percentage agreement, or a detailed disagreement-resolution protocol. In the revision we will compute and report these statistics (including pairwise agreement rates) and describe the annotation process and any resolution steps used by the eleven engineers. revision: yes
Referee: [Results] Results (abstract and empirical comparison section): the 11% F1 gain for ensembles is stated without accompanying statistical significance tests (paired Wilcoxon or McNemar across the 174 series), confidence intervals, or per-series variance, so it is unclear whether the delta is robust or dataset-specific.

Authors: We concur that the reported 11% F1 improvement requires statistical support to demonstrate robustness. The manuscript currently presents only the aggregate F1 delta. We will add paired Wilcoxon signed-rank tests (and McNemar's test where suitable), 95% confidence intervals, and per-series variance measures across the 174 time series in the revised empirical section. revision: yes
Referee: [Methods] Methods (implementation of the 25 CPD algorithms and 15 ensembles): the manuscript provides no explicit description of hyper-parameter selection, handling of missing values or trend components in the Mozilla time series, or the exact voting rules used in the ensembles, rendering the comparison non-reproducible and the fairness of the baseline evaluation difficult to assess.

Authors: We acknowledge the need for greater methodological transparency. The current text does not detail hyper-parameter selection, missing-value handling, trend removal, or the precise ensemble voting rules. In revision we will expand the methods section with these specifics (including any grid-search or default-parameter choices, preprocessing steps, and voting implementations such as majority or weighted schemes) to enable full reproducibility. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison on external annotations

full rationale

The paper reports an empirical evaluation of 25 CPD methods and 15 ensembles on 174 time series whose ground-truth labels were created by eleven Mozilla engineers. The central result (11% F1 improvement of ensembles over the T-test baseline) is computed directly against these external labels. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The evaluation is therefore self-contained against the stated benchmark and does not reduce to any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an empirical comparison study; no free parameters, axioms, or invented entities are described in the abstract.

pith-pipeline@v0.9.1-grok · 5773 in / 1132 out tokens · 32167 ms · 2026-06-26T23:30:49.038696+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

96 extracted references · 71 canonical work pages · 2 internal anchors

[1]

Ryan Prescott Adams and David J. C. MacKay. 2007. Bayesian Online Changepoint Detection. arXiv:0710.3742 [stat.ML] https://arxiv.org/abs/0710.3 742

Pith/arXiv arXiv 2007
[2]

Mejbah Alam, Justin Gottschlich, Nesime Tatbul, Javier Turek, Timothy Mattson, and Abdullah Muzahid. 2019. A zero-positive learning approach for diagnosing software performance regressions. InAdvances in Neural Information Processing Systems, Vol. 32. Curran Associates Inc., Red Hook, NY, USA, Article 1043, 13 pages. doi:10.5555/3454287.3455330

work page doi:10.5555/3454287.3455330 2019
[3]

Samaneh Aminikhanghahi and Diane Cook. 2017. A Survey of Methods for Time Series Change Point Detection.Knowledge and Information Systems 51, 2 (05 2017), 339––367. doi:10.1007/s10115-016-0987-z

work page doi:10.1007/s10115-016-0987-z 2017
[4]

Andrea Arcuri and Lionel Briand. 2011. A practical guide for using statistical tests to assess randomized algorithms in software engineering. InProceedings of the 33rd International Conference on Software Engineering(Waikiki, Honolulu, HI, USA)(ICSE ’11). Association for Computing Machinery, New York, NY, USA, 1––10. doi:10.1145/1985793.1985795

work page doi:10.1145/1985793.1985795 2011
[5]

Zahra Atashgahi, Decebal Constantin Mocanu, Raymond Veldhuis, and Mykola Pechenizkiy. 2022. Memory-free Online Change-point Detection: A Novel Neural Network Approach. arXiv:2207.03932 [cs.LG] https://arxiv.org/abs/2207.03932

arXiv 2022
[6]

Auger and Charles E

Ivan E. Auger and Charles E. Lawrence. 1989. Algorithms for the Optimal Identification of Segment Neighborhoods.Bulletin of Mathematical Biology51, 1 (Jan. 1989), 39––54. doi:10.1007/BF02458835

work page doi:10.1007/bf02458835 1989
[7]

Sayandeep Banerjee, Bappa Basak, Sandip Mandal, Subhajit Manna, Shuvam Chakraborty, Sujata Ghatak, and Anirban Das. 2024. Real Time Anomaly Detection in Network Traffic: A Comparative Analysis of Machine Learning.International Research Journal on Advanced Engineering Hub (IRJAEH)2, 07 (Jul. 2024), 1968––1977. doi:10.47392/IRJAEH.2024.0269

work page doi:10.47392/irjaeh.2024.0269 2024
[8]

2024.WBS R Package

Rafal Baranowski and Piotr Fryzlewicz. 2024.WBS R Package. https://cran.r-project.org/web/packages/wbs/index.html

2024
[9]

Jonathan Bell and Gail Kaiser. 2014. Unit test virtualization with VMVM. InProceedings of the 36th International Conference on Software Engineering (Hyderabad, India)(ICSE 2014). Association for Computing Machinery, New York, NY, USA, 550––561. doi:10.1145/2568225.2568248

work page doi:10.1145/2568225.2568248 2014
[10]

Moritz Beller, Hongyu Li, Vivek Nair, Vijayaraghavan Murali, Imad Ahmad, Jürgen Cito, Drew Carlson, Ari Aye, and Wes Dyer. 2023. Learning to Learn to Predict Performance Regressions in Production at Meta. In2023 IEEE/ACM International Conference on Automation of Software Test (AST). 56––67. doi:10.1109/AST58925.2023.00010

work page doi:10.1109/ast58925.2023.00010 2023
[11]

Mohamed Bilel Besbes, Diego Elias Costa, Suhaib Mujahid, Gregory Mierzwinski, and Marco Castelluccio. 2025. A Dataset of Performance Measurements and Alerts from Mozilla (Data Artifact). InCompanion of the 16th ACM/SPEC International Conference on Performance Engineering (Toronto ON, Canada)(ICPE ’25). Association for Computing Machinery, New York, NY, US...

work page doi:10.1145/3680256.3721973 2025
[12]

Albert Bifet and Ricard Gavaldà. 2007. Learning from Time-Changing Data with Adaptive Windowing. InProceedings of the 2007 SIAM International Conference on Data Mining. 443–448. doi:10.1137/1.9781611972771.42

work page doi:10.1137/1.9781611972771.42 2007
[13]

Hampel , title =

Morton B. Brown and Alan B. Forsythe. 1974. Robust Tests for the Equality of Variances.J. Amer. Statist. Assoc.69, 346 (1974), 364––367. doi:10.1080/01621459.1974.10482955

work page doi:10.1080/01621459.1974.10482955 1974
[14]

Jinfu Chen and Weiyi Shang. 2017. An Exploratory Study of Performance Regression Introducing Code Changes. In2017 IEEE International Conference on Software Maintenance and Evolution (ICSME). 341––352. doi:10.1109/ICSME.2017.13

work page doi:10.1109/icsme.2017.13 2017
[15]

Jinfu Chen, Weiyi Shang, and Emad Shihab. 2022. PerfJIT: Test-Level Just-in-Time Prediction for Performance Regression Introducing Commits. IEEE Transactions on Software Engineering48, 5 (2022), 1529––1544. doi:10.1109/TSE.2020.3023955 Manuscript submitted to ACM 34 Besbes, Mierzwinski, Mujahid, Leitner, Serebrenik, Hunt, and Costa

work page doi:10.1109/tse.2020.3023955 2022
[16]

Zhuangbin Chen, Jinyang Liu, Yuxin Su, Hongyu Zhang, Xiao Ling, Yongqiang Yang, and Michael R. Lyu. 2022. Adaptive performance anomaly detection for online service systems via pattern sketching. InProceedings of the 44th International Conference on Software Engineering(Pittsburgh, Pennsylvania)(ICSE ’22). Association for Computing Machinery, New York, NY,...

work page doi:10.1145/3510003.3510085 2022
[17]

2012.How One Second Could Cost Amazon $1.6 Billion In Sales

Fast Company. 2012.How One Second Could Cost Amazon $1.6 Billion In Sales. https://www.fastcompany.com/1825005/how-one-second-could-cost- amazon-16-billion-sales

arXiv 2012
[18]

Diego Costa, Cor-Paul Bezemer, Philipp Leitner, and Artur Andrzejak. 2021. What’s Wrong with My Benchmark Results? Studying Bad Practices in JMH Benchmarks.IEEE Transactions on Software Engineering47, 7 (2021), 1452––1467. doi:10.1109/TSE.2019.2925345

work page doi:10.1109/tse.2019.2925345 2021
[19]

Harald Cramér. 1928. On the composition of elementary errors.Scandinavian Actuarial Journal1928, 1 (1928), 13––74. doi:10.1080/03461238.1928.10 416862

work page doi:10.1080/03461238.1928.10 1928
[20]

David Daly. 2021. Creating a Virtuous Cycle in Performance Testing at MongoDB. InProceedings of the ACM/SPEC International Conference on Performance Engineering(Virtual Event, France)(ICPE ’21). Association for Computing Machinery, New York, NY, USA, 33––41. doi:10.1145/342792 1.3450234

work page doi:10.1145/342792 2021
[21]

David Daly, William Brown, Henrik Ingo, Jim O’Leary, and David Bradford. 2020. The Use of Change Point Detection to Identify Software Performance Regressions in a Continuous Integration System. InProceedings of the ACM/SPEC International Conference on Performance Engineering (Edmonton AB, Canada)(ICPE ’20). Association for Computing Machinery, New York, N...

work page doi:10.1145/3358960.3375791 2020
[22]

Janez Demšar. 2006. Statistical Comparisons of Classifiers over Multiple Data Sets.J. Mach. Learn. Res.7 (Dec. 2006), 1––30. doi:10.5555/1248547.12 48548

work page doi:10.5555/1248547.12 2006
[23]

2022.TCPDBench

Simon Eismann. 2022.TCPDBench. https://github.com/SimonEismann/TCPDBench/tree/master

2022
[24]

Emelie Engström, Margaret-Anne Storey, Per Runeson, Martin Höst, and Maria Teresa Baldassarre. 2020. How software engineering research aligns with design science: a review.Empirical Software Engineering25, 4 (July 2020), 2630––2660. doi:10.1007/s10664-020-09818-7

work page doi:10.1007/s10664-020-09818-7 2020
[25]

Paul Fearnhead and Guillem Rigaill. 2019. Changepoint Detection in the Presence of Outliers.J. Amer. Statist. Assoc.114, 525 (2019), 169––183. doi:10.1080/01621459.2017.1385466

work page doi:10.1080/01621459.2017.1385466 2019
[26]

Piotr Fryzlewicz. 2014. WILD BINARY SEGMENTATION FOR MULTIPLE CHANGE-POINT DETECTION.The Annals of Statistics42, 6 (2014), 2243––2281. doi:10.1214/14-AOS1245

work page doi:10.1214/14-aos1245 2014
[27]

Heitor Murilo Gomes, Jacob Montiel, Saulo Martiello Mastelini, Bernhard Pfahringer, and Albert Bifet. 2020. On Ensemble Techniques for Data Stream Regression. In2020 International Joint Conference on Neural Networks (IJCNN). 1––8. doi:10.1109/IJCNN48605.2020.9206756

work page doi:10.1109/ijcnn48605.2020.9206756 2020
[28]

2020.Milliseconds Make Millions

Google and Deloitte. 2020.Milliseconds Make Millions. https://www.deloitte.com/content/dam/assets-zone2/ie/en/docs/services/consulting/2023/Mi lliseconds_Make_Millions_report.pdf

2020
[29]

Mark Grechanik, Chen Fu, and Qing Xie. 2012. Automatically finding performance problems with feedback-directed learning software testing. In 2012 34th International Conference on Software Engineering (ICSE). 156––166. doi:10.1109/ICSE.2012.6227197

work page doi:10.1109/icse.2012.6227197 2012
[30]

2019.Robust-FPOP

guillemr. 2019.Robust-FPOP. https://github.com/guillemr/robust-fpop

2019
[31]

Francis Harchaoui, Zaïd Moulines, Éric, and Bach. 2008. Kernel change-point analysis. InProceedings of the 22nd International Conference on Neural Information Processing Systems(Vancouver, British Columbia, Canada)(NIPS’08). Curran Associates Inc., Red Hook, NY, USA, 609––616. doi:10.5555/2981780.2981856

work page doi:10.5555/2981780.2981856 2008
[32]

Christoph Heger, Jens Happe, and Roozbeh Farahbod. 2013. Automated root cause isolation of performance regressions during software development. InProceedings of the 4th ACM/SPEC International Conference on Performance Engineering(Prague, Czech Republic)(ICPE ’13). Association for Computing Machinery, New York, NY, USA, 27––38. doi:10.1145/2479871.2479879

work page doi:10.1145/2479871.2479879 2013
[33]

Hevner, Salvatore T

Alan R. Hevner, Salvatore T. March, Jinsoo Park, and Sudha Ram. 2004. Design science in information systems research.MIS Q.28, 1 (March 2004), 75––105. doi:10.5555/2017212.2017217

work page doi:10.5555/2017212.2017217 2004
[34]

David V. Hinkley. 1970. Inference about the change-point in a sequence of random variables.Biometrika57, 1 (04 1970), 1––17. doi:10.1093/biomet /57.1.1

work page doi:10.1093/biomet 1970
[35]

Husari, Abdulrahman and Taherpour, Sepehr. 2024. Enhanced Techniques for Detecting Performance Abnormalities in Software Quality Assurance Processes. Student Paper

2024
[36]

Henrik Ingo. 2025. 8 Years of Optimizing Apache Otava: How disconnected open source developers took an algorithm from n3 to constant time. arXiv:2505.06758 [cs.PF] https://arxiv.org/abs/2505.06758

arXiv 2025
[37]

James and David S

Nicholas A. James and David S. Matteson. 2015. ecp: An R Package for Nonparametric Multiple Change Point Analysis of Multivariate Data.Journal of Statistical Software62, 7 (2015), 1––25. doi:10.18637/jss.v062.i07

work page doi:10.18637/jss.v062.i07 2015
[38]

Hiranya Jayathilaka, Chandra Krintz, and Rich Wolski. 2020. Detecting Performance Anomalies in Cloud Platform Applications .IEEE Transactions on Cloud Computing8, 03 (July 2020), 764––777. doi:10.1109/TCC.2018.2808289

work page doi:10.1109/tcc.2018.2808289 2020
[39]

Massey Jr

Frank J. Massey Jr. 1951. The Kolmogorov-Smirnov Test for Goodness of Fit.J. Amer. Statist. Assoc.46, 253 (1951), 68––78. doi:10.1080/01621459.195 1.10500769

work page doi:10.1080/01621459.195 1951
[40]

Iurii Katser, Viacheslav Kozitsin, Victor Lobachev, and Ivan Maksimov. 2021. Unsupervised Offline Changepoint Detection Ensembles.Applied Sciences11, 9 (2021). doi:10.3390/app11094280

work page doi:10.3390/app11094280 2021
[41]

Rebecca Killick and Idris A. Eckley. 2014. changepoint: An R Package for Changepoint Analysis.Journal of Statistical Software58, 3 (2014), 1––19. https://www.jstatsoft.org/article/view/v058i03 Manuscript submitted to ACM Exploring Statistical Change Point Detection Techniques for Performance Anomaly Detection at Mozilla 35

2014
[42]

and Liu, C

R. Killick, P. Fearnhead, and I. A. Eckley. 2012. Optimal Detection of Changepoints With a Linear Computational Cost.J. Amer. Statist. Assoc.107, 500 (2012), 1590––1598. doi:10.1080/01621459.2012.737745

work page doi:10.1080/01621459.2012.737745 2012
[43]

Ambika Kirkland, Shivam Mehta, Harm Lameris, Gustav Eje Henter, Eva Szekely, and Joakim Gustafson. 2023. Stuck in the MOS Pit: A Critical Analysis of MOS Test Methodology in TTS Evaluation. InProceedings of the 12th ISCA Speech Synthesis Workshop (SSW2023). 41––47. doi:10.21437/SSW.2023-7

work page doi:10.21437/ssw.2023-7 2023
[44]

Dagenais, and Mahsa Shakeri

Iman Kohyarnejadfard, Daniel Aloise, Michel R. Dagenais, and Mahsa Shakeri. 2021. A Framework for Detecting System Performance Anomalies Using Tracing Data Analysis.Entropy23, 8 (2021). doi:10.3390/e23081011

work page doi:10.3390/e23081011 2021
[45]

Gunnar Kudrjavets, Jeff Thomas, and Nachiappan Nagappan. 2022. The Evolving Landscape of Software Performance Engineering. InProceedings of the 26th International Conference on Evaluation and Assessment in Software Engineering(Gothenburg, Sweden)(EASE ’22). Association for Computing Machinery, New York, NY, USA, 260––261. doi:10.1145/3530019.3534977

work page doi:10.1145/3530019.3534977 2022
[46]

Christoph Laaber, Joel Scheuner, and Philipp Leitner. 2019. Software Microbenchmarking in the Cloud. How Bad Is It Really?Empirical Software Engineering24, 4 (Aug. 2019), 2469––2508. doi:10.1007/s10664-019-09681-1

work page doi:10.1007/s10664-019-09681-1 2019
[47]

Cha, and Arthur H

Donghun Lee, Sang K. Cha, and Arthur H. Lee. 2012. A Performance Anomaly Detection and Analysis Framework for DBMS Development.IEEE Transactions on Knowledge and Data Engineering24, 8 (2012), 1345–1360. doi:10.1109/TKDE.2011.88

work page doi:10.1109/tkde.2011.88 2012
[48]

Yixiao Li, Gloria Lin, Thomas Lau, and Ruochen Zeng. 2019. A Review of Changepoint Detection Models. arXiv:1908.07136 [stat.ML] https: //arxiv.org/abs/1908.07136

arXiv 2019
[49]

R.A. Likert. 1932. A technique for the measurement of attitudes.Archives of Psychology22, 140 (1932), 5––55

1932
[50]

Xutong Liu, Yufei Zhou, Yutian Tang, Junyan Qian, and Yuming Zhou. 2025. Human-in-the-loop online just-in-time software defect prediction: What have we achieved and what do we still miss?Science of Computer Programming244 (2025), 103296. doi:10.1016/j.scico.2025.103296

work page doi:10.1016/j.scico.2025.103296 2025
[51]

2025.ALIBI Detect

Seldon Technologies Ltd. 2025.ALIBI Detect. https://docs.seldon.io/projects/alibi-detect/en/latest/index.html

2025
[52]

Lucas and Michael S

James M. Lucas and Michael S. Saccucci. 1990. Exponentially Weighted Moving Average Control Schemes: Properties and Enhancements. Technometrics32, 1 (1990), 1––12. doi:10.1080/00401706.1990.10484583

work page doi:10.1080/00401706.1990.10484583 1990
[53]

Zhan Lyu, Thomas Bach, Yong Li, Nguyen Minh Le, and Lars Hoemke. 2024. BIPeC: A Combined Change-Point Analyzer to Identify Performance Regressions in Large-scale Database Systems. In2024 IEEE International Conference on Software Maintenance and Evolution (ICSME). 808––819. doi:10.1109/ICSME58944.2024.00084

work page doi:10.1109/icsme58944.2024.00084 2024
[54]

H. B. Mann and D. R. Whitney. 1947. On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other.The Annals of Mathematical Statistics18, 1 (1947), 50––60. doi:10.1214/aoms/1177730491

work page doi:10.1214/aoms/1177730491 1947
[55]

Aleksander Maricq, Dmitry Duplyakin, Ivo Jimenez, Carlos Maltzahn, Ryan Stutsman, and Robert Ricci. 2018. Taming performance variability. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation(Carlsbad, CA, USA)(OSDI’18). USENIX Association, USA, 409––425. doi:10.5555/3291168.3291198

work page doi:10.5555/3291168.3291198 2018
[56]

Florian Markusse, Philipp Leitner, and Alexander Serebrenik. 2022. Using Benchmarking Bots for Continuous Performance Assessment.IEEE Softw. 39, 5 (2022), 50––55. doi:10.1109/MS.2022.3184430

work page doi:10.1109/ms.2022.3184430 2022
[57]

Martin, C.C

D.R. Martin, C.C. Fowlkes, and J. Malik. 2004. Learning to detect natural image boundaries using local brightness, color, and texture cues.IEEE Transactions on Pattern Analysis and Machine Intelligence26, 5 (2004), 530––549. doi:10.1109/TPAMI.2004.1273918

work page doi:10.1109/tpami.2004.1273918 2004
[58]

Journal of the American Statistical Association , author =

David S. Matteson and Nicholas A. James. 2014. A Nonparametric Approach for Multiple Change Point Analysis of Multivariate Data.J. Amer. Statist. Assoc.109, 505 (2014), 334––345. doi:10.1080/01621459.2013.849605

work page doi:10.1080/01621459.2013.849605 2014
[59]

McBride, Wendy A

Sara E. McBride, Wendy A. Rogers, and Arthur D. Fisk. 2014. Understanding human management of automation errors.Theoretical Issues in Ergonomics Science15, 6 (2014), 545––577. doi:10.1080/1463922X.2013.817625 PMID: 25383042

work page doi:10.1080/1463922x.2013.817625 2014
[60]

2024.E Divisive CPD Method

MongoDB. 2024.E Divisive CPD Method. https://pypi.org/project/signal-processing-algorithms/

2024
[61]

Jacob Montiel, Max Halford, Saulo Martiello Mastelini, Geoffrey Bolmier, Raphael Sourty, Robin Vaysse, Adil Zouitine, Heitor Murilo Gomes, Jesse Read, Talel Abdessalem, and Albert Bifet. 2021. River: machine learning for streaming data in Python.J. Mach. Learn. Res.22, 1, Article 110 (Jan. 2021), 8 pages. doi:10.5555/3546258.3546368

work page doi:10.5555/3546258.3546368 2021
[62]

2026.Autoland code repository

Mozilla. 2026.Autoland code repository. https://hg.mozilla.org/integration/autoland/

2026
[63]

2026.Mozilla’s Perfherder dashboard

Mozilla. 2026.Mozilla’s Perfherder dashboard. https://treeherder.mozilla.org/perfherder/alerts/

2026
[64]

2026.Replication Package

Mozilla. 2026.Replication Package. https://github.com/mozilla/REALISE-Performance

2026
[65]

MUIR and NEVILLE MORAY

BONNIE M. MUIR and NEVILLE MORAY. 1996. Trust in automation. Part II. Experimental studies of trust and human intervention in a process control simulation.Ergonomics39, 3 (1996), 429––460. doi:10.1080/00140139608964474 PMID: 8849495

work page doi:10.1080/00140139608964474 1996
[66]

Todd Mytkowicz, Amer Diwan, Matthias Hauswirth, and Peter F. Sweeney. 2009. Producing wrong data without doing anything obviously wrong!. InProceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems(Washington, DC, USA)(ASPLOS XIV). Association for Computing Machinery, New York, NY, USA, 265––...

work page doi:10.1145/1508244.1508275 2009
[67]

Stefan Mühlbauer, Sven Apel, and Norbert Siegmund. 2019. Accurate Modeling of Performance Histories for Evolving Software Systems. In2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). 640––652. doi:10.1109/ASE.2019.00065

work page doi:10.1109/ase.2019.00065 2019
[68]

Adrian Nistor, Po-Chun Chang, Cosmin Radoi, and Shan Lu. 2015. CARAMEL: Detecting and Fixing Performance Problems That Have Non-Intrusive Fixes. In2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol. 1. 902––912. doi:10.1109/ICSE.2015.100

work page doi:10.1109/icse.2015.100 2015
[69]

Oscar Hernan Madrid Padilla, Alex Athey, Alex Reinhart, and James G. Scott. 2019. Sequential Nonparametric Tests for a Change in Distribution: An Application to Detecting Radiological Anomalies.J. Amer. Statist. Assoc.114, 526 (2019), 514––528. doi:10.1080/01621459.2018.1476245

work page doi:10.1080/01621459.2018.1476245 2019
[70]

E. S. Page. 1954. Continuous Inspection Schemes.Biometrika41, 1/2 (1954), 100––115. doi:10.2307/2333009 Manuscript submitted to ACM 36 Besbes, Mierzwinski, Mujahid, Leitner, Serebrenik, Hunt, and Costa

work page doi:10.2307/2333009 1954
[71]

2019.ocp: Bayesian Online Changepoint Detection

Andrea Pagotto. 2019.ocp: Bayesian Online Changepoint Detection. https://CRAN.R-project.org/package=ocp

2019
[72]

Shravan Pargaonkar. 2023. A Comprehensive Review of Performance Testing Methodologies and Best Practices: Software Quality Engineering. International Journal of Science and Research12, 8 (2023), 2008––2014. doi:10.21275/SR23822111402

work page doi:10.21275/sr23822111402 2023
[73]

Christoph Raab, Moritz Heusinger, and Frank-Michael Schleif. 2020. Reactive Soft Prototype Computing for Concept Drift Streams.Neurocomputing 416 (2020), 340––351. doi:10.1016/j.neucom.2019.11.111

work page doi:10.1016/j.neucom.2019.11.111 2020
[74]

Raghu Ramakrishnan and Arvinder Kaur. 2017. Technique for Detecting Early-Warning Signals of Performance Deterioration in Large Scale Software Systems. InProceedings of the 8th ACM/SPEC on International Conference on Performance Engineering(L’Aquila, Italy)(ICPE ’17). Association for Computing Machinery, New York, NY, USA, 213––222. doi:10.1145/3030207.3044533

work page doi:10.1145/3030207.3044533 2017
[75]

David Georg Reichelt and Stefan Kühne. 2018. How to Detect Performance Changes in Software History: Performance Analysis of Software System Versions. InCompanion of the 2018 ACM/SPEC International Conference on Performance Engineering(Berlin, Germany)(ICPE ’18). Association for Computing Machinery, New York, NY, USA, 183––188. doi:10.1145/3185768.3186404

work page doi:10.1145/3185768.3186404 2018
[76]

Robillard, Deeksha M

Martin P. Robillard, Deeksha M. Arya, Neil A. Ernst, Jin L. C. Guo, Maxime Lamothe, Mathieu Nassif, Nicole Novielli, Alexander Serebrenik, Igor Steinmacher, and Klaas-Jan Stol. 2024. Communicating Study Design Trade-offs in Software Engineering.ACM Trans. Softw. Eng. Methodol.33, 5 (2024), 112:1––112:10. doi:10.1145/3649598

work page doi:10.1145/3649598 2024
[77]

Ross and Niall M

Gordon J. Ross and Niall M. Adams. 2012. Two Nonparametric Control Charts for Detecting Arbitrary Distribution Changes.Journal of Quality Technology44, 2 (2012), 102––116. doi:10.1080/00224065.2012.11917887

work page doi:10.1080/00224065.2012.11917887 2012
[78]

2025.Cramér-von Mises test implementation in scipy.stats

SciPy. 2025.Cramér-von Mises test implementation in scipy.stats. https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.cramervonmises_ 2samp.html

2025
[79]

2025.Kolmogorov-Smirnov implementation in scipy.stats

SciPy. 2025.Kolmogorov-Smirnov implementation in scipy.stats. https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ks_2samp.html

2025
[80]

2025.Levene test implementation in scipy.stats

SciPy. 2025.Levene test implementation in scipy.stats. https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.levene.html

2025

Showing first 80 references.

[1] [1]

Ryan Prescott Adams and David J. C. MacKay. 2007. Bayesian Online Changepoint Detection. arXiv:0710.3742 [stat.ML] https://arxiv.org/abs/0710.3 742

Pith/arXiv arXiv 2007

[2] [2]

Mejbah Alam, Justin Gottschlich, Nesime Tatbul, Javier Turek, Timothy Mattson, and Abdullah Muzahid. 2019. A zero-positive learning approach for diagnosing software performance regressions. InAdvances in Neural Information Processing Systems, Vol. 32. Curran Associates Inc., Red Hook, NY, USA, Article 1043, 13 pages. doi:10.5555/3454287.3455330

work page doi:10.5555/3454287.3455330 2019

[3] [3]

Samaneh Aminikhanghahi and Diane Cook. 2017. A Survey of Methods for Time Series Change Point Detection.Knowledge and Information Systems 51, 2 (05 2017), 339––367. doi:10.1007/s10115-016-0987-z

work page doi:10.1007/s10115-016-0987-z 2017

[4] [4]

Andrea Arcuri and Lionel Briand. 2011. A practical guide for using statistical tests to assess randomized algorithms in software engineering. InProceedings of the 33rd International Conference on Software Engineering(Waikiki, Honolulu, HI, USA)(ICSE ’11). Association for Computing Machinery, New York, NY, USA, 1––10. doi:10.1145/1985793.1985795

work page doi:10.1145/1985793.1985795 2011

[5] [5]

Zahra Atashgahi, Decebal Constantin Mocanu, Raymond Veldhuis, and Mykola Pechenizkiy. 2022. Memory-free Online Change-point Detection: A Novel Neural Network Approach. arXiv:2207.03932 [cs.LG] https://arxiv.org/abs/2207.03932

arXiv 2022

[6] [6]

Auger and Charles E

Ivan E. Auger and Charles E. Lawrence. 1989. Algorithms for the Optimal Identification of Segment Neighborhoods.Bulletin of Mathematical Biology51, 1 (Jan. 1989), 39––54. doi:10.1007/BF02458835

work page doi:10.1007/bf02458835 1989

[7] [7]

Sayandeep Banerjee, Bappa Basak, Sandip Mandal, Subhajit Manna, Shuvam Chakraborty, Sujata Ghatak, and Anirban Das. 2024. Real Time Anomaly Detection in Network Traffic: A Comparative Analysis of Machine Learning.International Research Journal on Advanced Engineering Hub (IRJAEH)2, 07 (Jul. 2024), 1968––1977. doi:10.47392/IRJAEH.2024.0269

work page doi:10.47392/irjaeh.2024.0269 2024

[8] [8]

2024.WBS R Package

Rafal Baranowski and Piotr Fryzlewicz. 2024.WBS R Package. https://cran.r-project.org/web/packages/wbs/index.html

2024

[9] [9]

Jonathan Bell and Gail Kaiser. 2014. Unit test virtualization with VMVM. InProceedings of the 36th International Conference on Software Engineering (Hyderabad, India)(ICSE 2014). Association for Computing Machinery, New York, NY, USA, 550––561. doi:10.1145/2568225.2568248

work page doi:10.1145/2568225.2568248 2014

[10] [10]

Moritz Beller, Hongyu Li, Vivek Nair, Vijayaraghavan Murali, Imad Ahmad, Jürgen Cito, Drew Carlson, Ari Aye, and Wes Dyer. 2023. Learning to Learn to Predict Performance Regressions in Production at Meta. In2023 IEEE/ACM International Conference on Automation of Software Test (AST). 56––67. doi:10.1109/AST58925.2023.00010

work page doi:10.1109/ast58925.2023.00010 2023

[11] [11]

Mohamed Bilel Besbes, Diego Elias Costa, Suhaib Mujahid, Gregory Mierzwinski, and Marco Castelluccio. 2025. A Dataset of Performance Measurements and Alerts from Mozilla (Data Artifact). InCompanion of the 16th ACM/SPEC International Conference on Performance Engineering (Toronto ON, Canada)(ICPE ’25). Association for Computing Machinery, New York, NY, US...

work page doi:10.1145/3680256.3721973 2025

[12] [12]

Albert Bifet and Ricard Gavaldà. 2007. Learning from Time-Changing Data with Adaptive Windowing. InProceedings of the 2007 SIAM International Conference on Data Mining. 443–448. doi:10.1137/1.9781611972771.42

work page doi:10.1137/1.9781611972771.42 2007

[13] [13]

Hampel , title =

Morton B. Brown and Alan B. Forsythe. 1974. Robust Tests for the Equality of Variances.J. Amer. Statist. Assoc.69, 346 (1974), 364––367. doi:10.1080/01621459.1974.10482955

work page doi:10.1080/01621459.1974.10482955 1974

[14] [14]

Jinfu Chen and Weiyi Shang. 2017. An Exploratory Study of Performance Regression Introducing Code Changes. In2017 IEEE International Conference on Software Maintenance and Evolution (ICSME). 341––352. doi:10.1109/ICSME.2017.13

work page doi:10.1109/icsme.2017.13 2017

[15] [15]

Jinfu Chen, Weiyi Shang, and Emad Shihab. 2022. PerfJIT: Test-Level Just-in-Time Prediction for Performance Regression Introducing Commits. IEEE Transactions on Software Engineering48, 5 (2022), 1529––1544. doi:10.1109/TSE.2020.3023955 Manuscript submitted to ACM 34 Besbes, Mierzwinski, Mujahid, Leitner, Serebrenik, Hunt, and Costa

work page doi:10.1109/tse.2020.3023955 2022

[16] [16]

Zhuangbin Chen, Jinyang Liu, Yuxin Su, Hongyu Zhang, Xiao Ling, Yongqiang Yang, and Michael R. Lyu. 2022. Adaptive performance anomaly detection for online service systems via pattern sketching. InProceedings of the 44th International Conference on Software Engineering(Pittsburgh, Pennsylvania)(ICSE ’22). Association for Computing Machinery, New York, NY,...

work page doi:10.1145/3510003.3510085 2022

[17] [17]

2012.How One Second Could Cost Amazon $1.6 Billion In Sales

Fast Company. 2012.How One Second Could Cost Amazon $1.6 Billion In Sales. https://www.fastcompany.com/1825005/how-one-second-could-cost- amazon-16-billion-sales

arXiv 2012

[18] [18]

Diego Costa, Cor-Paul Bezemer, Philipp Leitner, and Artur Andrzejak. 2021. What’s Wrong with My Benchmark Results? Studying Bad Practices in JMH Benchmarks.IEEE Transactions on Software Engineering47, 7 (2021), 1452––1467. doi:10.1109/TSE.2019.2925345

work page doi:10.1109/tse.2019.2925345 2021

[19] [19]

Harald Cramér. 1928. On the composition of elementary errors.Scandinavian Actuarial Journal1928, 1 (1928), 13––74. doi:10.1080/03461238.1928.10 416862

work page doi:10.1080/03461238.1928.10 1928

[20] [20]

David Daly. 2021. Creating a Virtuous Cycle in Performance Testing at MongoDB. InProceedings of the ACM/SPEC International Conference on Performance Engineering(Virtual Event, France)(ICPE ’21). Association for Computing Machinery, New York, NY, USA, 33––41. doi:10.1145/342792 1.3450234

work page doi:10.1145/342792 2021

[21] [21]

David Daly, William Brown, Henrik Ingo, Jim O’Leary, and David Bradford. 2020. The Use of Change Point Detection to Identify Software Performance Regressions in a Continuous Integration System. InProceedings of the ACM/SPEC International Conference on Performance Engineering (Edmonton AB, Canada)(ICPE ’20). Association for Computing Machinery, New York, N...

work page doi:10.1145/3358960.3375791 2020

[22] [22]

Janez Demšar. 2006. Statistical Comparisons of Classifiers over Multiple Data Sets.J. Mach. Learn. Res.7 (Dec. 2006), 1––30. doi:10.5555/1248547.12 48548

work page doi:10.5555/1248547.12 2006

[23] [23]

2022.TCPDBench

Simon Eismann. 2022.TCPDBench. https://github.com/SimonEismann/TCPDBench/tree/master

2022

[24] [24]

Emelie Engström, Margaret-Anne Storey, Per Runeson, Martin Höst, and Maria Teresa Baldassarre. 2020. How software engineering research aligns with design science: a review.Empirical Software Engineering25, 4 (July 2020), 2630––2660. doi:10.1007/s10664-020-09818-7

work page doi:10.1007/s10664-020-09818-7 2020

[25] [25]

Paul Fearnhead and Guillem Rigaill. 2019. Changepoint Detection in the Presence of Outliers.J. Amer. Statist. Assoc.114, 525 (2019), 169––183. doi:10.1080/01621459.2017.1385466

work page doi:10.1080/01621459.2017.1385466 2019

[26] [26]

Piotr Fryzlewicz. 2014. WILD BINARY SEGMENTATION FOR MULTIPLE CHANGE-POINT DETECTION.The Annals of Statistics42, 6 (2014), 2243––2281. doi:10.1214/14-AOS1245

work page doi:10.1214/14-aos1245 2014

[27] [27]

Heitor Murilo Gomes, Jacob Montiel, Saulo Martiello Mastelini, Bernhard Pfahringer, and Albert Bifet. 2020. On Ensemble Techniques for Data Stream Regression. In2020 International Joint Conference on Neural Networks (IJCNN). 1––8. doi:10.1109/IJCNN48605.2020.9206756

work page doi:10.1109/ijcnn48605.2020.9206756 2020

[28] [28]

2020.Milliseconds Make Millions

Google and Deloitte. 2020.Milliseconds Make Millions. https://www.deloitte.com/content/dam/assets-zone2/ie/en/docs/services/consulting/2023/Mi lliseconds_Make_Millions_report.pdf

2020

[29] [29]

Mark Grechanik, Chen Fu, and Qing Xie. 2012. Automatically finding performance problems with feedback-directed learning software testing. In 2012 34th International Conference on Software Engineering (ICSE). 156––166. doi:10.1109/ICSE.2012.6227197

work page doi:10.1109/icse.2012.6227197 2012

[30] [30]

2019.Robust-FPOP

guillemr. 2019.Robust-FPOP. https://github.com/guillemr/robust-fpop

2019

[31] [31]

Francis Harchaoui, Zaïd Moulines, Éric, and Bach. 2008. Kernel change-point analysis. InProceedings of the 22nd International Conference on Neural Information Processing Systems(Vancouver, British Columbia, Canada)(NIPS’08). Curran Associates Inc., Red Hook, NY, USA, 609––616. doi:10.5555/2981780.2981856

work page doi:10.5555/2981780.2981856 2008

[32] [32]

Christoph Heger, Jens Happe, and Roozbeh Farahbod. 2013. Automated root cause isolation of performance regressions during software development. InProceedings of the 4th ACM/SPEC International Conference on Performance Engineering(Prague, Czech Republic)(ICPE ’13). Association for Computing Machinery, New York, NY, USA, 27––38. doi:10.1145/2479871.2479879

work page doi:10.1145/2479871.2479879 2013

[33] [33]

Hevner, Salvatore T

Alan R. Hevner, Salvatore T. March, Jinsoo Park, and Sudha Ram. 2004. Design science in information systems research.MIS Q.28, 1 (March 2004), 75––105. doi:10.5555/2017212.2017217

work page doi:10.5555/2017212.2017217 2004

[34] [34]

David V. Hinkley. 1970. Inference about the change-point in a sequence of random variables.Biometrika57, 1 (04 1970), 1––17. doi:10.1093/biomet /57.1.1

work page doi:10.1093/biomet 1970

[35] [35]

Husari, Abdulrahman and Taherpour, Sepehr. 2024. Enhanced Techniques for Detecting Performance Abnormalities in Software Quality Assurance Processes. Student Paper

2024

[36] [36]

Henrik Ingo. 2025. 8 Years of Optimizing Apache Otava: How disconnected open source developers took an algorithm from n3 to constant time. arXiv:2505.06758 [cs.PF] https://arxiv.org/abs/2505.06758

arXiv 2025

[37] [37]

James and David S

Nicholas A. James and David S. Matteson. 2015. ecp: An R Package for Nonparametric Multiple Change Point Analysis of Multivariate Data.Journal of Statistical Software62, 7 (2015), 1––25. doi:10.18637/jss.v062.i07

work page doi:10.18637/jss.v062.i07 2015

[38] [38]

Hiranya Jayathilaka, Chandra Krintz, and Rich Wolski. 2020. Detecting Performance Anomalies in Cloud Platform Applications .IEEE Transactions on Cloud Computing8, 03 (July 2020), 764––777. doi:10.1109/TCC.2018.2808289

work page doi:10.1109/tcc.2018.2808289 2020

[39] [39]

Massey Jr

Frank J. Massey Jr. 1951. The Kolmogorov-Smirnov Test for Goodness of Fit.J. Amer. Statist. Assoc.46, 253 (1951), 68––78. doi:10.1080/01621459.195 1.10500769

work page doi:10.1080/01621459.195 1951

[40] [40]

Iurii Katser, Viacheslav Kozitsin, Victor Lobachev, and Ivan Maksimov. 2021. Unsupervised Offline Changepoint Detection Ensembles.Applied Sciences11, 9 (2021). doi:10.3390/app11094280

work page doi:10.3390/app11094280 2021

[41] [41]

Rebecca Killick and Idris A. Eckley. 2014. changepoint: An R Package for Changepoint Analysis.Journal of Statistical Software58, 3 (2014), 1––19. https://www.jstatsoft.org/article/view/v058i03 Manuscript submitted to ACM Exploring Statistical Change Point Detection Techniques for Performance Anomaly Detection at Mozilla 35

2014

[42] [42]

and Liu, C

R. Killick, P. Fearnhead, and I. A. Eckley. 2012. Optimal Detection of Changepoints With a Linear Computational Cost.J. Amer. Statist. Assoc.107, 500 (2012), 1590––1598. doi:10.1080/01621459.2012.737745

work page doi:10.1080/01621459.2012.737745 2012

[43] [43]

Ambika Kirkland, Shivam Mehta, Harm Lameris, Gustav Eje Henter, Eva Szekely, and Joakim Gustafson. 2023. Stuck in the MOS Pit: A Critical Analysis of MOS Test Methodology in TTS Evaluation. InProceedings of the 12th ISCA Speech Synthesis Workshop (SSW2023). 41––47. doi:10.21437/SSW.2023-7

work page doi:10.21437/ssw.2023-7 2023

[44] [44]

Dagenais, and Mahsa Shakeri

Iman Kohyarnejadfard, Daniel Aloise, Michel R. Dagenais, and Mahsa Shakeri. 2021. A Framework for Detecting System Performance Anomalies Using Tracing Data Analysis.Entropy23, 8 (2021). doi:10.3390/e23081011

work page doi:10.3390/e23081011 2021

[45] [45]

Gunnar Kudrjavets, Jeff Thomas, and Nachiappan Nagappan. 2022. The Evolving Landscape of Software Performance Engineering. InProceedings of the 26th International Conference on Evaluation and Assessment in Software Engineering(Gothenburg, Sweden)(EASE ’22). Association for Computing Machinery, New York, NY, USA, 260––261. doi:10.1145/3530019.3534977

work page doi:10.1145/3530019.3534977 2022

[46] [46]

Christoph Laaber, Joel Scheuner, and Philipp Leitner. 2019. Software Microbenchmarking in the Cloud. How Bad Is It Really?Empirical Software Engineering24, 4 (Aug. 2019), 2469––2508. doi:10.1007/s10664-019-09681-1

work page doi:10.1007/s10664-019-09681-1 2019

[47] [47]

Cha, and Arthur H

Donghun Lee, Sang K. Cha, and Arthur H. Lee. 2012. A Performance Anomaly Detection and Analysis Framework for DBMS Development.IEEE Transactions on Knowledge and Data Engineering24, 8 (2012), 1345–1360. doi:10.1109/TKDE.2011.88

work page doi:10.1109/tkde.2011.88 2012

[48] [48]

Yixiao Li, Gloria Lin, Thomas Lau, and Ruochen Zeng. 2019. A Review of Changepoint Detection Models. arXiv:1908.07136 [stat.ML] https: //arxiv.org/abs/1908.07136

arXiv 2019

[49] [49]

R.A. Likert. 1932. A technique for the measurement of attitudes.Archives of Psychology22, 140 (1932), 5––55

1932

[50] [50]

Xutong Liu, Yufei Zhou, Yutian Tang, Junyan Qian, and Yuming Zhou. 2025. Human-in-the-loop online just-in-time software defect prediction: What have we achieved and what do we still miss?Science of Computer Programming244 (2025), 103296. doi:10.1016/j.scico.2025.103296

work page doi:10.1016/j.scico.2025.103296 2025

[51] [51]

2025.ALIBI Detect

Seldon Technologies Ltd. 2025.ALIBI Detect. https://docs.seldon.io/projects/alibi-detect/en/latest/index.html

2025

[52] [52]

Lucas and Michael S

James M. Lucas and Michael S. Saccucci. 1990. Exponentially Weighted Moving Average Control Schemes: Properties and Enhancements. Technometrics32, 1 (1990), 1––12. doi:10.1080/00401706.1990.10484583

work page doi:10.1080/00401706.1990.10484583 1990

[53] [53]

Zhan Lyu, Thomas Bach, Yong Li, Nguyen Minh Le, and Lars Hoemke. 2024. BIPeC: A Combined Change-Point Analyzer to Identify Performance Regressions in Large-scale Database Systems. In2024 IEEE International Conference on Software Maintenance and Evolution (ICSME). 808––819. doi:10.1109/ICSME58944.2024.00084

work page doi:10.1109/icsme58944.2024.00084 2024

[54] [54]

H. B. Mann and D. R. Whitney. 1947. On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other.The Annals of Mathematical Statistics18, 1 (1947), 50––60. doi:10.1214/aoms/1177730491

work page doi:10.1214/aoms/1177730491 1947

[55] [55]

Aleksander Maricq, Dmitry Duplyakin, Ivo Jimenez, Carlos Maltzahn, Ryan Stutsman, and Robert Ricci. 2018. Taming performance variability. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation(Carlsbad, CA, USA)(OSDI’18). USENIX Association, USA, 409––425. doi:10.5555/3291168.3291198

work page doi:10.5555/3291168.3291198 2018

[56] [56]

Florian Markusse, Philipp Leitner, and Alexander Serebrenik. 2022. Using Benchmarking Bots for Continuous Performance Assessment.IEEE Softw. 39, 5 (2022), 50––55. doi:10.1109/MS.2022.3184430

work page doi:10.1109/ms.2022.3184430 2022

[57] [57]

Martin, C.C

D.R. Martin, C.C. Fowlkes, and J. Malik. 2004. Learning to detect natural image boundaries using local brightness, color, and texture cues.IEEE Transactions on Pattern Analysis and Machine Intelligence26, 5 (2004), 530––549. doi:10.1109/TPAMI.2004.1273918

work page doi:10.1109/tpami.2004.1273918 2004

[58] [58]

Journal of the American Statistical Association , author =

David S. Matteson and Nicholas A. James. 2014. A Nonparametric Approach for Multiple Change Point Analysis of Multivariate Data.J. Amer. Statist. Assoc.109, 505 (2014), 334––345. doi:10.1080/01621459.2013.849605

work page doi:10.1080/01621459.2013.849605 2014

[59] [59]

McBride, Wendy A

Sara E. McBride, Wendy A. Rogers, and Arthur D. Fisk. 2014. Understanding human management of automation errors.Theoretical Issues in Ergonomics Science15, 6 (2014), 545––577. doi:10.1080/1463922X.2013.817625 PMID: 25383042

work page doi:10.1080/1463922x.2013.817625 2014

[60] [60]

2024.E Divisive CPD Method

MongoDB. 2024.E Divisive CPD Method. https://pypi.org/project/signal-processing-algorithms/

2024

[61] [61]

Jacob Montiel, Max Halford, Saulo Martiello Mastelini, Geoffrey Bolmier, Raphael Sourty, Robin Vaysse, Adil Zouitine, Heitor Murilo Gomes, Jesse Read, Talel Abdessalem, and Albert Bifet. 2021. River: machine learning for streaming data in Python.J. Mach. Learn. Res.22, 1, Article 110 (Jan. 2021), 8 pages. doi:10.5555/3546258.3546368

work page doi:10.5555/3546258.3546368 2021

[62] [62]

2026.Autoland code repository

Mozilla. 2026.Autoland code repository. https://hg.mozilla.org/integration/autoland/

2026

[63] [63]

2026.Mozilla’s Perfherder dashboard

Mozilla. 2026.Mozilla’s Perfherder dashboard. https://treeherder.mozilla.org/perfherder/alerts/

2026

[64] [64]

2026.Replication Package

Mozilla. 2026.Replication Package. https://github.com/mozilla/REALISE-Performance

2026

[65] [65]

MUIR and NEVILLE MORAY

BONNIE M. MUIR and NEVILLE MORAY. 1996. Trust in automation. Part II. Experimental studies of trust and human intervention in a process control simulation.Ergonomics39, 3 (1996), 429––460. doi:10.1080/00140139608964474 PMID: 8849495

work page doi:10.1080/00140139608964474 1996

[66] [66]

Todd Mytkowicz, Amer Diwan, Matthias Hauswirth, and Peter F. Sweeney. 2009. Producing wrong data without doing anything obviously wrong!. InProceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems(Washington, DC, USA)(ASPLOS XIV). Association for Computing Machinery, New York, NY, USA, 265––...

work page doi:10.1145/1508244.1508275 2009

[67] [67]

Stefan Mühlbauer, Sven Apel, and Norbert Siegmund. 2019. Accurate Modeling of Performance Histories for Evolving Software Systems. In2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). 640––652. doi:10.1109/ASE.2019.00065

work page doi:10.1109/ase.2019.00065 2019

[68] [68]

Adrian Nistor, Po-Chun Chang, Cosmin Radoi, and Shan Lu. 2015. CARAMEL: Detecting and Fixing Performance Problems That Have Non-Intrusive Fixes. In2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol. 1. 902––912. doi:10.1109/ICSE.2015.100

work page doi:10.1109/icse.2015.100 2015

[69] [69]

Oscar Hernan Madrid Padilla, Alex Athey, Alex Reinhart, and James G. Scott. 2019. Sequential Nonparametric Tests for a Change in Distribution: An Application to Detecting Radiological Anomalies.J. Amer. Statist. Assoc.114, 526 (2019), 514––528. doi:10.1080/01621459.2018.1476245

work page doi:10.1080/01621459.2018.1476245 2019

[70] [70]

E. S. Page. 1954. Continuous Inspection Schemes.Biometrika41, 1/2 (1954), 100––115. doi:10.2307/2333009 Manuscript submitted to ACM 36 Besbes, Mierzwinski, Mujahid, Leitner, Serebrenik, Hunt, and Costa

work page doi:10.2307/2333009 1954

[71] [71]

2019.ocp: Bayesian Online Changepoint Detection

Andrea Pagotto. 2019.ocp: Bayesian Online Changepoint Detection. https://CRAN.R-project.org/package=ocp

2019

[72] [72]

Shravan Pargaonkar. 2023. A Comprehensive Review of Performance Testing Methodologies and Best Practices: Software Quality Engineering. International Journal of Science and Research12, 8 (2023), 2008––2014. doi:10.21275/SR23822111402

work page doi:10.21275/sr23822111402 2023

[73] [73]

Christoph Raab, Moritz Heusinger, and Frank-Michael Schleif. 2020. Reactive Soft Prototype Computing for Concept Drift Streams.Neurocomputing 416 (2020), 340––351. doi:10.1016/j.neucom.2019.11.111

work page doi:10.1016/j.neucom.2019.11.111 2020

[74] [74]

Raghu Ramakrishnan and Arvinder Kaur. 2017. Technique for Detecting Early-Warning Signals of Performance Deterioration in Large Scale Software Systems. InProceedings of the 8th ACM/SPEC on International Conference on Performance Engineering(L’Aquila, Italy)(ICPE ’17). Association for Computing Machinery, New York, NY, USA, 213––222. doi:10.1145/3030207.3044533

work page doi:10.1145/3030207.3044533 2017

[75] [75]

David Georg Reichelt and Stefan Kühne. 2018. How to Detect Performance Changes in Software History: Performance Analysis of Software System Versions. InCompanion of the 2018 ACM/SPEC International Conference on Performance Engineering(Berlin, Germany)(ICPE ’18). Association for Computing Machinery, New York, NY, USA, 183––188. doi:10.1145/3185768.3186404

work page doi:10.1145/3185768.3186404 2018

[76] [76]

Robillard, Deeksha M

Martin P. Robillard, Deeksha M. Arya, Neil A. Ernst, Jin L. C. Guo, Maxime Lamothe, Mathieu Nassif, Nicole Novielli, Alexander Serebrenik, Igor Steinmacher, and Klaas-Jan Stol. 2024. Communicating Study Design Trade-offs in Software Engineering.ACM Trans. Softw. Eng. Methodol.33, 5 (2024), 112:1––112:10. doi:10.1145/3649598

work page doi:10.1145/3649598 2024

[77] [77]

Ross and Niall M

Gordon J. Ross and Niall M. Adams. 2012. Two Nonparametric Control Charts for Detecting Arbitrary Distribution Changes.Journal of Quality Technology44, 2 (2012), 102––116. doi:10.1080/00224065.2012.11917887

work page doi:10.1080/00224065.2012.11917887 2012

[78] [78]

2025.Cramér-von Mises test implementation in scipy.stats

SciPy. 2025.Cramér-von Mises test implementation in scipy.stats. https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.cramervonmises_ 2samp.html

2025

[79] [79]

2025.Kolmogorov-Smirnov implementation in scipy.stats

SciPy. 2025.Kolmogorov-Smirnov implementation in scipy.stats. https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ks_2samp.html

2025

[80] [80]

2025.Levene test implementation in scipy.stats

SciPy. 2025.Levene test implementation in scipy.stats. https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.levene.html

2025