The Right Call for Software Benchmarking: Consistent Decisions in Stateful Environments

G\'abor Melis

arxiv: 2606.17261 · v1 · pith:LMOEY536new · submitted 2026-06-15 · 💻 cs.PF · cs.SE· stat.AP

The Right Call for Software Benchmarking: Consistent Decisions in Stateful Environments

G\'abor Melis This is my paper

Pith reviewed 2026-06-27 02:21 UTC · model grok-4.3

classification 💻 cs.PF cs.SEstat.AP

keywords software benchmarkingstateful systemsperformance contrastsconsistent estimatorsexperiment designadaptive mechanismsfinite budget

0 comments

The pith

Benchmarking can identify the fastest program consistently by estimating performance contrasts that cancel biases in stateful environments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that benchmarking in stateful computing systems is confounded by adaptive mechanisms introducing temporal dependencies, making absolute performance measures biased. It argues for prioritizing performance differentials and proposes experiment designs with consistent estimators of contrasts where program-specific biases cancel under tenable assumptions about system dynamics. This approach asymptotically yields the correct decision on which program is fastest, providing a robust methodology for finite-budget benchmarking without needing to model the dynamics explicitly. A reader would care because it addresses a practical problem in optimizing performance-sensitive software where naive methods fail due to unknown system behaviors.

Core claim

By formalizing software benchmarking as the decision problem of identifying the fastest program, for which relative knowledge suffices, the paper proposes simple experiment designs admitting consistent estimators of contrasts. These designs ensure that program-specific biases cancel under tenable assumptions, asymptotically yielding the correct decision and affording a robust methodology for finite-budget benchmarking in stateful environments.

What carries the argument

Consistent estimators of contrasts in simple experiment designs, which cancel program-specific biases without explicit modeling of system dynamics.

If this is right

These designs asymptotically yield the correct decision about which program is fastest.
They afford a robust methodology for finite-budget benchmarking in stateful environments.
Broad implications for the development of performance-sensitive software follow from the focus on relative knowledge.
Naive estimators of individual program performance are biased due to temporal dependencies from adaptive mechanisms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such designs might be adapted to benchmarking in other adaptive systems like machine learning training environments.
Future work could test these designs on real hardware with known stateful behaviors to verify bias cancellation.
Emphasizing contrasts could lead to new standards in performance evaluation that avoid absolute measures altogether.

Load-bearing premise

The experiment designs make program-specific biases cancel under tenable assumptions about system dynamics.

What would settle it

A counterexample where the proposed designs lead to an incorrect decision on program speed in a controlled stateful environment with measurable biases that do not cancel.

Figures

Figures reproduced from arXiv: 2606.17261 by G\'abor Melis.

**Figure 1.** Figure 1: The general generative model (Definition 1) of sequential measurements in the presence of uncontrolled state [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: The delta design (Definition 17). This constrains the configuration model (Figure 1) but leaves the sampling strategy for [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: The blocked experiment design 𝛽( [2, 4, . . . ] ). The individual blocks are delta designs (§6) linked via their last and first uncontrolled states. Run counts of programs within blocks are balanced. The balancing couples 𝐹𝑖 and 𝐹𝑗 in the same block, but those in different blocks are independent. M𝑐 = max 𝑇 | 𝑐 [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: DP estimates of the worst-case confidence [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Dynamic programming estimates of the worst-case [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Dynamic programming estimates of the worst-case [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Estimates of the worst-case confidence 𝛾 with the detection threshold 𝜔 set to 0.5M𝑐 . The dynamic programming approximations, 𝛼 DP10 (§8.2.1) and 𝛽 DP10 (§7.6), are with 10 quanta (Q = 10). The Asymmetric and the Martingale approximations for the 𝛼-design are discussed in §8.2.2. 50 100 150 200 250 0 0.2 0.4 0.6 0.8 1 𝑛 𝛾 𝛼 DP10 𝛼 Asymmetric 𝛼 Martingale 𝛽 DP10 [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: With the lower detection threshold of 𝜔 = 0.25M𝑐 , it becomes apparent that the 𝛽-design needs a significant number of samples to achieve non-zero confidence. 50 100 150 200 250 0 0.2 0.4 0.6 𝑛 𝛾 𝛼 DP10 𝛼 Asymmetric 𝛼 Martingale 𝛽 DP10 [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: We halve the detection threshold in Figure 8 to [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 13.** Figure 13: Estimates of the worst-case confidence 𝛾 for 𝐾 = 4. 20 40 60 80 100 120 140 0.92 0.94 0.96 0.98 1 𝑛 𝛾 𝛼 DP10 𝛼 Asymmetric 𝛼 Martingale 𝛽 DP10 [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗

**Figure 14.** Figure 14: Estimates of the worst-case confidence 𝛾 for 𝐾 = 5. 20 40 60 80 100 120 140 0.98 0.99 1 𝑛 𝛾 𝛼 DP10 𝛼 Asymmetric 𝛼 Martingale 𝛽 DP10 [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗

**Figure 15.** Figure 15: Estimates of the worst-case confidence 𝛾 for 𝐾 = 6. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗

**Figure 12.** Figure 12: Estimates of the worst-case confidence 𝛾 for 𝐾 = 3. 20 40 60 80 100 120 140 0.8 0.9 1 𝑛 𝛾 𝛼 DP10 𝛼 Asymmetric 𝛼 Martingale 𝛽 DP10 [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

read the original abstract

In the perpetual pursuit of performance, modern computing systems rely ever more on stateful mechanisms to accommodate the dynamics of workloads and physical environments, bolstering efficiency but confounding benchmarking and thereby the optimization of software. Indeed, by their nature, adaptive mechanisms introduce temporal dependencies between measurements and render naive estimators of individual program performance biased. Observing that rectifying such biases necessitates speculative assumptions about system dynamics, we call for prioritizing performance differentials over absolute measures and formalize software benchmarking as the decision problem of identifying the fastest program, for which relative knowledge suffices. To this end, we propose simple experiment designs admitting consistent estimators of contrasts, whereby program-specific biases cancel under tenable assumptions. These designs asymptotically yield the correct decision and afford a robust methodology for finite-budget benchmarking in stateful environments, bearing broad implications for the development of performance-sensitive software.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reframes benchmarking as a decision problem of picking the fastest program and proposes experiment designs for consistent contrast estimation via bias cancellation in stateful systems.

read the letter

The main takeaway is that the paper reframes software benchmarking as a decision problem of finding the fastest program and proposes experiment designs to get consistent estimates of performance contrasts in stateful environments by canceling biases.

It does a good job highlighting how adaptive mechanisms introduce temporal dependencies that bias naive estimators, and it correctly notes that relative knowledge is enough for the identification task. The suggestion of simple designs that allow biases to cancel under reasonable assumptions is the novel framing here, and it could help with finite-budget experiments.

The soft spots are that the abstract only states the existence of such designs and their asymptotic correctness without providing derivations, proofs, or any validation. This makes it difficult to assess whether the estimators are truly consistent or if the assumptions hold in real systems. The reader's soundness score of 3.0 reflects this lack of detail accurately.

This paper is for researchers and practitioners in software performance evaluation who deal with complex, stateful systems. A reader looking for new statistical approaches to benchmarking might get value from it if the full paper supplies the missing technical content.

I think it deserves a serious referee to examine the proposed designs and check the claims.

Referee Report

0 major / 3 minor

Summary. The paper claims that stateful mechanisms in modern computing systems introduce temporal dependencies that bias naive estimators of individual program performance. It reframes software benchmarking as the decision problem of identifying the fastest program (rather than estimating absolute performance), and proposes simple experiment designs that admit consistent estimators of performance contrasts. Under tenable assumptions on system dynamics, program-specific biases cancel, so that the designs asymptotically yield the correct decision and support robust finite-budget benchmarking.

Significance. If the consistency results hold, the work supplies a statistically grounded methodology that avoids the need to model system dynamics explicitly while still guaranteeing asymptotically correct identification of the best program. This has clear practical value for performance evaluation in adaptive environments and shifts emphasis from absolute metrics to relative contrasts, which aligns with the decision-theoretic nature of benchmarking.

minor comments (3)

[Abstract] The abstract is dense and would benefit from a single concrete example of one of the proposed designs (e.g., the interleaving pattern or the contrast estimator formula) to make the central idea immediately accessible.
[§3] Notation for the contrast estimators and the bias-cancellation conditions should be introduced with a small running example in the main text before the general theorems; the current presentation jumps directly to the general case.
[§5] Figure 2 (or equivalent) showing the finite-sample behavior of the decision error rate would be strengthened by adding a baseline that uses naive independent sampling, to quantify the practical gain of the proposed designs.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the supportive summary, significance assessment, and recommendation of minor revision. No specific major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained statistical argument

full rationale

The paper formalizes benchmarking as a decision problem of identifying the fastest program and proposes experiment designs whose consistent estimators of contrasts rely on program-specific biases canceling under stated assumptions about system dynamics. This is a direct application of standard statistical consistency results for relative estimators and does not reduce any claimed prediction or result to a fitted parameter, self-definition, or self-citation chain. No equations or steps in the provided abstract or claim description exhibit the enumerated circularity patterns; the argument remains independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existence of experiment designs whose contrast estimators cancel program-specific biases. No free parameters or invented entities are visible in the abstract. The key supporting premise is treated as a domain assumption rather than derived.

axioms (1)

domain assumption Program-specific biases cancel under tenable assumptions in the proposed experiment designs for contrasts.
Explicitly invoked in the abstract as the basis for consistent estimators and correct asymptotic decisions.

pith-pipeline@v0.9.1-grok · 5665 in / 1141 out tokens · 46834 ms · 2026-06-27T02:21:39.661970+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 1 linked inside Pith

[1]

Tracking down performance variation against source code evolution

Juan Pablo Sandoval Alcocer and Alexandre Bergel. Tracking down performance variation against source code evolution. ACM SIGPLAN Notices, 51 0 (2): 0 129--139, 2015

2015
[2]

Writing fast code, 2015

Andrei Alexandrescu. Writing fast code, 2015. URL https://www.youtube.com/watch?v=vrfYLlR8X8k&t=914s. Accessed: 2024-08-25

2015
[3]

Identification of causal effects using instrumental variables

Joshua D Angrist, Guido W Imbens, and Donald B Rubin. Identification of causal effects using instrumental variables. Journal of the American statistical Association, 91 0 (434): 0 444--455, 1996

1996
[4]

Virtual machine warmup blows hot and cold

Edd Barrett, Carl Friedrich Bolz-Tereick, Rebecca Killick, Sarah Mount, and Laurence Tratt. Virtual machine warmup blows hot and cold. Proceedings of the ACM on Programming Languages, 1 0 (OOPSLA): 0 1--27, 2017

2017
[5]

Conditional inference from confidence sets

George Casella. Conditional inference from confidence sets. Lecture Notes-Monograph Series, pages 1--12, 1992

1992
[6]

Robust benchmarking in noisy environments

Jiahao Chen and Jarrett Revels. Robust benchmarking in noisy environments. arXiv preprint arXiv:1608.04295, 2016

Pith/arXiv arXiv 2016
[7]

Statistical performance comparisons of computers

Tianshi Chen, Qi Guo, Olivier Temam, Yue Wu, Yungang Bao, Zhiwei Xu, and Yunji Chen. Statistical performance comparisons of computers. IEEE Transactions on Computers, 64 0 (5): 0 1442--1455, 2014

2014
[8]

Planning of experiments

David Roxbee Cox. Planning of experiments. Wiley, 1958

1958
[9]

Stabilizer: S tatistically sound performance evaluation

Charlie Curtsinger and Emery D Berger. Stabilizer: S tatistically sound performance evaluation. ACM SIGARCH Computer Architecture News, 41 0 (1): 0 219--228, 2013

2013
[10]

Paired benchmarking: H ow to measure performance, 2023

Denis Bazhenov . Paired benchmarking: H ow to measure performance, 2023. URL https://www.bazhenov.me/posts/paired-benchmarking/. Accessed: 2024-03-03

2023
[11]

An introduction to probability theory and its applications

William Feller. An introduction to probability theory and its applications. John Wiley & Sons, 3 edition, 1968. ISBN 978-0-471-25708-0

1968
[12]

Statistically rigorous J ava performance evaluation

Andy Georges, Dries Buytaert, and Lieven Eeckhout. Statistically rigorous J ava performance evaluation. ACM SIGPLAN Notices, 42 0 (10): 0 57--76, 2007

2007
[13]

A microbenchmark case study and lessons learned

Joseph Yossi Gil, Keren Lenz, and Yuval Shimron. A microbenchmark case study and lessons learned. In Proceedings of the compilation of the co-located workshops on DSM'11, TMC'11, AGERE! 2011, AOOPES'11, NEAT'11, & VMIL'11, pages 297--308, 2011

2011
[14]

Longest interval between zeros of the tied-down random walk, the B rownian bridge and related renewal processes

Claude Godreche. Longest interval between zeros of the tied-down random walk, the B rownian bridge and related renewal processes. Journal of Physics A: Mathematical and Theoretical, 50 0 (19): 0 195003, 2017

2017
[15]

URL https://github.com/google/benchmark/

Google Benchmark l ibrary, 2020. URL https://github.com/google/benchmark/. Accessed: 2024-04-14

2020
[16]

Frequentist post-data inference

Constantinos Goutis and George Casella. Frequentist post-data inference. International Statistical Review/Revue Internationale de Statistique, pages 325--344, 1995

1995
[17]

Continuous benchmarking: U sing system benchmarking in build pipelines

Martin Grambow, Fabian Lehmann, and David Bermbach. Continuous benchmarking: U sing system benchmarking in build pipelines. In 2019 IEEE International Conference on Cloud Engineering (IC2E), pages 241--246. IEEE, 2019

2019
[18]

Benchmarking as empirical standard in software engineering research

Wilhelm Hasselbring. Benchmarking as empirical standard in software engineering research. In Proceedings of the 25th International Conference on Evaluation and Assessment in Software Engineering, pages 365--372, 2021

2021
[19]

An alternative to the kolmogrov-smirnov two-sample test

Walter Katzenbeisser and Peter Hackl. An alternative to the kolmogrov-smirnov two-sample test. Communications in Statistics-Theory and Methods, 15 0 (4): 0 1163--1177, 1986

1986
[20]

A note on the higher moments of the random variable t associated with the number of returns of a simple random walk

Walter Katzenbeisser and Wolfgang Panny. A note on the higher moments of the random variable t associated with the number of returns of a simple random walk. Advances in applied probability, 18 0 (1): 0 279--282, 1986

1986
[21]

The structure of scientific revolutions, volume 962

Thomas S Kuhn. The structure of scientific revolutions, volume 962. University of Chicago press Chicago, 1997

1997
[22]

War of the benchmark means: T ime for a truce

John R Mashey. War of the benchmark means: T ime for a truce. ACM SIGARCH Computer Architecture News, 32 0 (4): 0 1--14, 2004

2004
[23]

Producing wrong data without doing anything obviously wrong! ACM Sigplan Notices, 44 0 (3): 0 265--276, 2009

Todd Mytkowicz, Amer Diwan, Matthias Hauswirth, and Peter F Sweeney. Producing wrong data without doing anything obviously wrong! ACM Sigplan Notices, 44 0 (3): 0 265--276, 2009

2009
[24]

Theory of games and economic behavior

John von Neumann and Oskar Morgenstern. Theory of games and economic behavior. Princeton University Press, Princeton, NJ, USA, 1944

1944
[25]

Using benchmarking to advance research: A challenge to software engineering

Susan Elliott Sim, Steve Easterbrook, and Richard C Holt. Using benchmarking to advance research: A challenge to software engineering. In 25th International Conference on Software Engineering, 2003. Proceedings., pages 74--83. IEEE, 2003

2003
[26]

SPEC CPU 2006 B enchmarks, 2006

SPEC . SPEC CPU 2006 B enchmarks, 2006. URL http://www.spec.org/cpu2006/. Accessed: 2024-08-20

2006
[27]

Some pairwise independent sequences for which the central limit theorem fails

Janson Svante. Some pairwise independent sequences for which the central limit theorem fails. Stochastics: An International Journal of Probability and Stochastic Processes, 23 0 (4): 0 439--448, 1988

1988
[28]

Should computer scientists experiment more? Computer, 31 0 (5): 0 32--40, 1998

Walter F Tichy. Should computer scientists experiment more? Computer, 31 0 (5): 0 32--40, 1998

1998
[29]

Dynamic frequency scaling, 2024

Wikipedia . Dynamic frequency scaling, 2024. URL https://en.wikipedia.org/w/index.php?title=Dynamic_frequency_scaling&oldid=1215993085. Accessed: 2024-08-20

2024
[30]

Non-stationary A B tests

Yuhang Wu, Zeyu Zheng, Guangyu Zhang, Zuohua Zhang, and Chu Wang. Non-stationary A B tests. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 2079--2089, 2022

2079

[1] [1]

Tracking down performance variation against source code evolution

Juan Pablo Sandoval Alcocer and Alexandre Bergel. Tracking down performance variation against source code evolution. ACM SIGPLAN Notices, 51 0 (2): 0 129--139, 2015

2015

[2] [2]

Writing fast code, 2015

Andrei Alexandrescu. Writing fast code, 2015. URL https://www.youtube.com/watch?v=vrfYLlR8X8k&t=914s. Accessed: 2024-08-25

2015

[3] [3]

Identification of causal effects using instrumental variables

Joshua D Angrist, Guido W Imbens, and Donald B Rubin. Identification of causal effects using instrumental variables. Journal of the American statistical Association, 91 0 (434): 0 444--455, 1996

1996

[4] [4]

Virtual machine warmup blows hot and cold

Edd Barrett, Carl Friedrich Bolz-Tereick, Rebecca Killick, Sarah Mount, and Laurence Tratt. Virtual machine warmup blows hot and cold. Proceedings of the ACM on Programming Languages, 1 0 (OOPSLA): 0 1--27, 2017

2017

[5] [5]

Conditional inference from confidence sets

George Casella. Conditional inference from confidence sets. Lecture Notes-Monograph Series, pages 1--12, 1992

1992

[6] [6]

Robust benchmarking in noisy environments

Jiahao Chen and Jarrett Revels. Robust benchmarking in noisy environments. arXiv preprint arXiv:1608.04295, 2016

Pith/arXiv arXiv 2016

[7] [7]

Statistical performance comparisons of computers

Tianshi Chen, Qi Guo, Olivier Temam, Yue Wu, Yungang Bao, Zhiwei Xu, and Yunji Chen. Statistical performance comparisons of computers. IEEE Transactions on Computers, 64 0 (5): 0 1442--1455, 2014

2014

[8] [8]

Planning of experiments

David Roxbee Cox. Planning of experiments. Wiley, 1958

1958

[9] [9]

Stabilizer: S tatistically sound performance evaluation

Charlie Curtsinger and Emery D Berger. Stabilizer: S tatistically sound performance evaluation. ACM SIGARCH Computer Architecture News, 41 0 (1): 0 219--228, 2013

2013

[10] [10]

Paired benchmarking: H ow to measure performance, 2023

Denis Bazhenov . Paired benchmarking: H ow to measure performance, 2023. URL https://www.bazhenov.me/posts/paired-benchmarking/. Accessed: 2024-03-03

2023

[11] [11]

An introduction to probability theory and its applications

William Feller. An introduction to probability theory and its applications. John Wiley & Sons, 3 edition, 1968. ISBN 978-0-471-25708-0

1968

[12] [12]

Statistically rigorous J ava performance evaluation

Andy Georges, Dries Buytaert, and Lieven Eeckhout. Statistically rigorous J ava performance evaluation. ACM SIGPLAN Notices, 42 0 (10): 0 57--76, 2007

2007

[13] [13]

A microbenchmark case study and lessons learned

Joseph Yossi Gil, Keren Lenz, and Yuval Shimron. A microbenchmark case study and lessons learned. In Proceedings of the compilation of the co-located workshops on DSM'11, TMC'11, AGERE! 2011, AOOPES'11, NEAT'11, & VMIL'11, pages 297--308, 2011

2011

[14] [14]

Longest interval between zeros of the tied-down random walk, the B rownian bridge and related renewal processes

Claude Godreche. Longest interval between zeros of the tied-down random walk, the B rownian bridge and related renewal processes. Journal of Physics A: Mathematical and Theoretical, 50 0 (19): 0 195003, 2017

2017

[15] [15]

URL https://github.com/google/benchmark/

Google Benchmark l ibrary, 2020. URL https://github.com/google/benchmark/. Accessed: 2024-04-14

2020

[16] [16]

Frequentist post-data inference

Constantinos Goutis and George Casella. Frequentist post-data inference. International Statistical Review/Revue Internationale de Statistique, pages 325--344, 1995

1995

[17] [17]

Continuous benchmarking: U sing system benchmarking in build pipelines

Martin Grambow, Fabian Lehmann, and David Bermbach. Continuous benchmarking: U sing system benchmarking in build pipelines. In 2019 IEEE International Conference on Cloud Engineering (IC2E), pages 241--246. IEEE, 2019

2019

[18] [18]

Benchmarking as empirical standard in software engineering research

Wilhelm Hasselbring. Benchmarking as empirical standard in software engineering research. In Proceedings of the 25th International Conference on Evaluation and Assessment in Software Engineering, pages 365--372, 2021

2021

[19] [19]

An alternative to the kolmogrov-smirnov two-sample test

Walter Katzenbeisser and Peter Hackl. An alternative to the kolmogrov-smirnov two-sample test. Communications in Statistics-Theory and Methods, 15 0 (4): 0 1163--1177, 1986

1986

[20] [20]

A note on the higher moments of the random variable t associated with the number of returns of a simple random walk

Walter Katzenbeisser and Wolfgang Panny. A note on the higher moments of the random variable t associated with the number of returns of a simple random walk. Advances in applied probability, 18 0 (1): 0 279--282, 1986

1986

[21] [21]

The structure of scientific revolutions, volume 962

Thomas S Kuhn. The structure of scientific revolutions, volume 962. University of Chicago press Chicago, 1997

1997

[22] [22]

War of the benchmark means: T ime for a truce

John R Mashey. War of the benchmark means: T ime for a truce. ACM SIGARCH Computer Architecture News, 32 0 (4): 0 1--14, 2004

2004

[23] [23]

Producing wrong data without doing anything obviously wrong! ACM Sigplan Notices, 44 0 (3): 0 265--276, 2009

Todd Mytkowicz, Amer Diwan, Matthias Hauswirth, and Peter F Sweeney. Producing wrong data without doing anything obviously wrong! ACM Sigplan Notices, 44 0 (3): 0 265--276, 2009

2009

[24] [24]

Theory of games and economic behavior

John von Neumann and Oskar Morgenstern. Theory of games and economic behavior. Princeton University Press, Princeton, NJ, USA, 1944

1944

[25] [25]

Using benchmarking to advance research: A challenge to software engineering

Susan Elliott Sim, Steve Easterbrook, and Richard C Holt. Using benchmarking to advance research: A challenge to software engineering. In 25th International Conference on Software Engineering, 2003. Proceedings., pages 74--83. IEEE, 2003

2003

[26] [26]

SPEC CPU 2006 B enchmarks, 2006

SPEC . SPEC CPU 2006 B enchmarks, 2006. URL http://www.spec.org/cpu2006/. Accessed: 2024-08-20

2006

[27] [27]

Some pairwise independent sequences for which the central limit theorem fails

Janson Svante. Some pairwise independent sequences for which the central limit theorem fails. Stochastics: An International Journal of Probability and Stochastic Processes, 23 0 (4): 0 439--448, 1988

1988

[28] [28]

Should computer scientists experiment more? Computer, 31 0 (5): 0 32--40, 1998

Walter F Tichy. Should computer scientists experiment more? Computer, 31 0 (5): 0 32--40, 1998

1998

[29] [29]

Dynamic frequency scaling, 2024

Wikipedia . Dynamic frequency scaling, 2024. URL https://en.wikipedia.org/w/index.php?title=Dynamic_frequency_scaling&oldid=1215993085. Accessed: 2024-08-20

2024

[30] [30]

Non-stationary A B tests

Yuhang Wu, Zeyu Zheng, Guangyu Zhang, Zuohua Zhang, and Chu Wang. Non-stationary A B tests. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 2079--2089, 2022

2079