When is Enough Enough? A Proposed Termination Point for the Number of Replicates in Computational Simulations
Pith reviewed 2026-06-27 14:01 UTC · model grok-4.3
The pith
Simulations should stop adding replicates when the proposed Ω test signals sufficient stability, modeled on P-value logic.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors claim that the Ω test provides a simple, straightforward criterion for halting additional simulation trials once results stabilize in a manner comparable to how P-tests indicate statistical significance, thereby replacing arbitrary decisions about replicate count with a consistent, communicable rule.
What carries the argument
The Ω test, a proposed termination criterion designed to function like a frequentist P-test by objectively signaling when further replicates are unnecessary.
If this is right
- Simulation studies could adopt a shared stopping rule that permits direct comparison of results across papers.
- Computational resources would be allocated only until the Ω criterion is met, reducing unnecessary runs.
- Interpretation of simulation outcomes would become less dependent on the number of trials performed.
- Reviewers could evaluate whether a study met an explicit, pre-specified termination standard.
Where Pith is reading between the lines
- Fields that already use power analysis or convergence diagnostics might integrate the Ω test as an additional or alternative check.
- If the test generalizes beyond the models tested in the paper, it could influence standards for reporting in agent-based modeling and similar domains.
- The analogy to P-tests suggests the Ω test could be taught alongside classical statistics in methods courses for computational scientists.
Load-bearing premise
That an objective, uniform test for sufficient replicates can be defined without relying on arbitrary thresholds or post-hoc adjustments.
What would settle it
Running the Ω test on a set of established simulation models and checking whether it produces consistent stopping points that align with or contradict existing best-practice replicate counts across different model types.
read the original abstract
Computational simulation provides a powerful toolkit for in silico experimentation. However, while the field has developed best practices for the design and implementation of such models, there remains ambiguity in discussions about how to understand and/or interpret their results due to their inherent ability to overwhelm traditional frequentist statistics by simply increasing the number of trials simulated. This fails the discipline in two ways: first, it leaves the community unsure of what constitutes a best practice for uniform understanding, and second, it potentially overburdens computational studies that burn clock cycles solely to ensure "enough runs to satisfy peers" without any theoretical underpinning for a definition of "enough". We propose a simple and straightforward standard for when to stop simulating additional trials, the {\Omega} test, designed to be analogous to the function of traditional frequentist P-tests. Community adoption of a reasonable and uniform standard will permit more efficient computational experimentation and clearly communication/interpretation of the findings discovered in this way.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript identifies ambiguity in determining sufficient replicates for computational simulations, noting that unlimited trials can overwhelm frequentist statistics and lead to inefficient resource use without a theoretical basis for 'enough'. It proposes the Ω test as a simple, uniform standard analogous to p-tests for deciding when to terminate additional simulations, enabling better best practices and clearer interpretation of results.
Significance. A well-defined, non-arbitrary stopping rule for simulation replicates could standardize practices and improve efficiency if it were theoretically grounded and validated. The manuscript, however, supplies no such rule, so the work primarily restates a known issue without advancing a solution.
major comments (1)
- [Abstract] Abstract: The central claim introduces the Ω test as 'a simple and straightforward standard' analogous to p-tests, yet provides no equation, algorithm, convergence criterion, derivation, or example. This absence makes it impossible to evaluate whether the test avoids arbitrary thresholds or rests on identifiable statistical principles.
Simulated Author's Rebuttal
We thank the referee for their review. We address the sole major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim introduces the Ω test as 'a simple and straightforward standard' analogous to p-tests, yet provides no equation, algorithm, convergence criterion, derivation, or example. This absence makes it impossible to evaluate whether the test avoids arbitrary thresholds or rests on identifiable statistical principles.
Authors: We agree with the referee that the abstract (and, by extension, the current manuscript) does not supply the equation, algorithm, convergence criterion, derivation, or example for the Ω test. This omission prevents evaluation of its statistical grounding. In the revised manuscript we will add a dedicated methods section that defines the Ω statistic mathematically, specifies the termination algorithm and convergence criterion, derives the test from first principles, and includes a concrete numerical example. These additions will allow direct assessment of whether the rule is non-arbitrary and rests on identifiable principles. revision: yes
Circularity Check
No derivation chain present; proposal asserted without equations or reductions
full rationale
The paper proposes the Ω test as an analogous stopping rule to p-values for simulation replicates but supplies neither equations, algorithms, convergence criteria, nor any derivation from first principles, data, or prior results. With no claimed mathematical chain or fitted parameters to inspect, none of the enumerated circularity patterns (self-definitional, fitted-input prediction, self-citation load-bearing, etc.) can be exhibited. The central claim is a normative proposal rather than a derived result, rendering the manuscript self-contained against the circularity criteria.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Ross, Simulation, academic press2022
S.M. Ross, Simulation, academic press2022
-
[2]
Boulesteix, R.H
A.-L. Boulesteix, R.H. Groenwold, M. Abrahamowicz, H. Binder, M. Briel, R. Hornung, T.P. Morris, J. Rahnenführer, W. Sauerbrei, Introduction to statistical simulations in health research, BMJ open, 10 (2020) e039921
2020
-
[3]
Alizadeh, J.K
R. Alizadeh, J.K. Allen, F. Mistree, Managing computational complexity using surrogate models: a critical review, Research in Engineering Design, 31 (2020) 275-298
2020
-
[4]
R. Seri, D. Secchi, How many times should one run a computational simulation?, Simulating social complexity: A handbook, (2017) 229-251
2017
-
[5]
Confidence Intervals
E.T. Lofgren, Visualizing Results From Infection Transmission Models: A Case Against “Confidence Intervals”, Epidemiology, 23 (2012) 738-741
2012
-
[6]
Halsey, The reign of the p-value is over: what alternative analyses could we employ to fill the power vacuum?, Biology letters, 15 (2019) 20190174
L.G. Halsey, The reign of the p-value is over: what alternative analyses could we employ to fill the power vacuum?, Biology letters, 15 (2019) 20190174
2019
-
[7]
Halsey, D
L.G. Halsey, D. Curran-Everett, S.L. Vowler, G.B. Drummond, The fickle P value generates irreproducible results, Nature methods, 12 (2015) 179-185
2015
-
[8]
Greenland, S.J
S. Greenland, S.J. Senn, K.J. Rothman, J.B. Carlin, C. Poole, S.N. Goodman, D.G. Altman, Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations, European journal of epidemiology, 31 (2016) 337-350
2016
-
[9]
Nakagawa, T.M
S. Nakagawa, T.M. Foster, The case against retrospective statistical power analyses with an introduction to power analysis, Acta ethologica, 7 (2004) 103-108
2004
-
[10]
Hoenig, D.M
J.M. Hoenig, D.M. Heisey, The abuse of power: the pervasive fallacy of power calculations for data analysis, The American Statistician, 55 (2001) 19-24
2001
-
[11]
Maxwell, K
S.E. Maxwell, K. Kelley, J.R. Rausch, Sample size planning for statistical power and accuracy in parameter estimation, Annu. Rev. Psychol., 59 (2008) 537-563
2008
-
[12]
Cox, P.J
D.R. Cox, P.J. Solomon, Components of variance, Chapman and Hall/CRC2002
-
[13]
M.L. Head, L. Holman, R. Lanfear, A.T. Kahn, M.D. Jennions, The extent and consequences of p- hacking in science, PLoS biology, 13 (2015) e1002106
2015
-
[14]
Greenland, Modeling and variable selection in epidemiologic analysis, American journal of public health, 79 (1989) 340-349
S. Greenland, Modeling and variable selection in epidemiologic analysis, American journal of public health, 79 (1989) 340-349
1989
-
[15]
Westreich, S
D. Westreich, S. Greenland, The table 2 fallacy: presenting and interpreting confounder and modifier coefficients, American journal of epidemiology, 177 (2013) 292-298
2013
-
[16]
White, A
J.W. White, A. Rassweiler, J.F. Samhouri, A.C. Stier, C. White, Ecologists should not use statistical significance tests to interpret simulation model results, Oikos, 123 (2014) 385-388
2014
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.