When is Enough Enough? A Proposed Termination Point for the Number of Replicates in Computational Simulations

Eric T. Lofgren; Kellen Myers; Nina H. Fefferman

arxiv: 2606.10109 · v1 · pith:Q7WYLPHVnew · submitted 2026-06-08 · 🧬 q-bio.OT

When is Enough Enough? A Proposed Termination Point for the Number of Replicates in Computational Simulations

Eric T. Lofgren , Kellen Myers , Nina H. Fefferman This is my paper

Pith reviewed 2026-06-27 14:01 UTC · model grok-4.3

classification 🧬 q-bio.OT

keywords computational simulationreplicate numbertermination criterionΩ testfrequentist statisticsin silico experiments

0 comments

The pith

Simulations should stop adding replicates when the proposed Ω test signals sufficient stability, modeled on P-value logic.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies that computational simulations can produce arbitrarily precise results simply by running more trials, creating ambiguity about what counts as enough data and wasting resources on extra runs. It proposes the Ω test as a uniform, objective stopping rule analogous to traditional frequentist P-tests. Adoption of this standard would let researchers terminate simulations at a theoretically grounded point rather than through ad-hoc choices. A reader would care because the method directly addresses efficiency and reproducibility in fields that rely on in silico experiments.

Core claim

The authors claim that the Ω test provides a simple, straightforward criterion for halting additional simulation trials once results stabilize in a manner comparable to how P-tests indicate statistical significance, thereby replacing arbitrary decisions about replicate count with a consistent, communicable rule.

What carries the argument

The Ω test, a proposed termination criterion designed to function like a frequentist P-test by objectively signaling when further replicates are unnecessary.

If this is right

Simulation studies could adopt a shared stopping rule that permits direct comparison of results across papers.
Computational resources would be allocated only until the Ω criterion is met, reducing unnecessary runs.
Interpretation of simulation outcomes would become less dependent on the number of trials performed.
Reviewers could evaluate whether a study met an explicit, pre-specified termination standard.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Fields that already use power analysis or convergence diagnostics might integrate the Ω test as an additional or alternative check.
If the test generalizes beyond the models tested in the paper, it could influence standards for reporting in agent-based modeling and similar domains.
The analogy to P-tests suggests the Ω test could be taught alongside classical statistics in methods courses for computational scientists.

Load-bearing premise

That an objective, uniform test for sufficient replicates can be defined without relying on arbitrary thresholds or post-hoc adjustments.

What would settle it

Running the Ω test on a set of established simulation models and checking whether it produces consistent stopping points that align with or contradict existing best-practice replicate counts across different model types.

read the original abstract

Computational simulation provides a powerful toolkit for in silico experimentation. However, while the field has developed best practices for the design and implementation of such models, there remains ambiguity in discussions about how to understand and/or interpret their results due to their inherent ability to overwhelm traditional frequentist statistics by simply increasing the number of trials simulated. This fails the discipline in two ways: first, it leaves the community unsure of what constitutes a best practice for uniform understanding, and second, it potentially overburdens computational studies that burn clock cycles solely to ensure "enough runs to satisfy peers" without any theoretical underpinning for a definition of "enough". We propose a simple and straightforward standard for when to stop simulating additional trials, the {\Omega} test, designed to be analogous to the function of traditional frequentist P-tests. Community adoption of a reasonable and uniform standard will permit more efficient computational experimentation and clearly communication/interpretation of the findings discovered in this way.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper names a real problem with simulation replicates but never defines or shows the proposed Ω test.

read the letter

The manuscript identifies a practical issue: simulation studies can keep adding replicates without a clear stopping rule, and it suggests an Ω test meant to work like a p-value for deciding when to stop. That framing is reasonable on its face.

Beyond the framing, the paper supplies nothing else. There is no equation for Ω, no derivation, no algorithm, no worked example on real or toy data, and no comparison to existing rules of thumb or sequential sampling methods already used in Monte Carlo work. The abstract states the proposal; the rest of the manuscript appears to do the same without adding the test itself.

The central claim therefore rests on an assertion rather than evidence. Without the actual test, it is impossible to check whether it avoids arbitrary thresholds, reduces to post-hoc fitting, or improves on simpler convergence diagnostics that practitioners already apply.

This leaves the work in a preliminary state. It might interest methodologists who want to standardize simulation practice, but it does not yet give them a usable tool or enough detail to evaluate one. A reader looking for a concrete stopping criterion will find the paper empty on that point.

I would not send it to referees. The authors would need to supply the actual test, its statistical justification, and at least one validation case before the manuscript becomes evaluable.

Referee Report

1 major / 0 minor

Summary. The manuscript identifies ambiguity in determining sufficient replicates for computational simulations, noting that unlimited trials can overwhelm frequentist statistics and lead to inefficient resource use without a theoretical basis for 'enough'. It proposes the Ω test as a simple, uniform standard analogous to p-tests for deciding when to terminate additional simulations, enabling better best practices and clearer interpretation of results.

Significance. A well-defined, non-arbitrary stopping rule for simulation replicates could standardize practices and improve efficiency if it were theoretically grounded and validated. The manuscript, however, supplies no such rule, so the work primarily restates a known issue without advancing a solution.

major comments (1)

[Abstract] Abstract: The central claim introduces the Ω test as 'a simple and straightforward standard' analogous to p-tests, yet provides no equation, algorithm, convergence criterion, derivation, or example. This absence makes it impossible to evaluate whether the test avoids arbitrary thresholds or rests on identifiable statistical principles.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review. We address the sole major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim introduces the Ω test as 'a simple and straightforward standard' analogous to p-tests, yet provides no equation, algorithm, convergence criterion, derivation, or example. This absence makes it impossible to evaluate whether the test avoids arbitrary thresholds or rests on identifiable statistical principles.

Authors: We agree with the referee that the abstract (and, by extension, the current manuscript) does not supply the equation, algorithm, convergence criterion, derivation, or example for the Ω test. This omission prevents evaluation of its statistical grounding. In the revised manuscript we will add a dedicated methods section that defines the Ω statistic mathematically, specifies the termination algorithm and convergence criterion, derives the test from first principles, and includes a concrete numerical example. These additions will allow direct assessment of whether the rule is non-arbitrary and rests on identifiable principles. revision: yes

Circularity Check

0 steps flagged

No derivation chain present; proposal asserted without equations or reductions

full rationale

The paper proposes the Ω test as an analogous stopping rule to p-values for simulation replicates but supplies neither equations, algorithms, convergence criteria, nor any derivation from first principles, data, or prior results. With no claimed mathematical chain or fitted parameters to inspect, none of the enumerated circularity patterns (self-definitional, fitted-input prediction, self-citation load-bearing, etc.) can be exhibited. The central claim is a normative proposal rather than a derived result, rendering the manuscript self-contained against the circularity criteria.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract only; no free parameters, axioms, or invented entities are specified beyond the named test itself.

pith-pipeline@v0.9.1-grok · 5702 in / 1096 out tokens · 31379 ms · 2026-06-27T14:01:42.898513+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references

[1]

Ross, Simulation, academic press2022

S.M. Ross, Simulation, academic press2022
[2]

Boulesteix, R.H

A.-L. Boulesteix, R.H. Groenwold, M. Abrahamowicz, H. Binder, M. Briel, R. Hornung, T.P. Morris, J. Rahnenführer, W. Sauerbrei, Introduction to statistical simulations in health research, BMJ open, 10 (2020) e039921

2020
[3]

Alizadeh, J.K

R. Alizadeh, J.K. Allen, F. Mistree, Managing computational complexity using surrogate models: a critical review, Research in Engineering Design, 31 (2020) 275-298

2020
[4]

R. Seri, D. Secchi, How many times should one run a computational simulation?, Simulating social complexity: A handbook, (2017) 229-251

2017
[5]

Confidence Intervals

E.T. Lofgren, Visualizing Results From Infection Transmission Models: A Case Against “Confidence Intervals”, Epidemiology, 23 (2012) 738-741

2012
[6]

Halsey, The reign of the p-value is over: what alternative analyses could we employ to fill the power vacuum?, Biology letters, 15 (2019) 20190174

L.G. Halsey, The reign of the p-value is over: what alternative analyses could we employ to fill the power vacuum?, Biology letters, 15 (2019) 20190174

2019
[7]

Halsey, D

L.G. Halsey, D. Curran-Everett, S.L. Vowler, G.B. Drummond, The fickle P value generates irreproducible results, Nature methods, 12 (2015) 179-185

2015
[8]

Greenland, S.J

S. Greenland, S.J. Senn, K.J. Rothman, J.B. Carlin, C. Poole, S.N. Goodman, D.G. Altman, Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations, European journal of epidemiology, 31 (2016) 337-350

2016
[9]

Nakagawa, T.M

S. Nakagawa, T.M. Foster, The case against retrospective statistical power analyses with an introduction to power analysis, Acta ethologica, 7 (2004) 103-108

2004
[10]

Hoenig, D.M

J.M. Hoenig, D.M. Heisey, The abuse of power: the pervasive fallacy of power calculations for data analysis, The American Statistician, 55 (2001) 19-24

2001
[11]

Maxwell, K

S.E. Maxwell, K. Kelley, J.R. Rausch, Sample size planning for statistical power and accuracy in parameter estimation, Annu. Rev. Psychol., 59 (2008) 537-563

2008
[12]

Cox, P.J

D.R. Cox, P.J. Solomon, Components of variance, Chapman and Hall/CRC2002
[13]

M.L. Head, L. Holman, R. Lanfear, A.T. Kahn, M.D. Jennions, The extent and consequences of p- hacking in science, PLoS biology, 13 (2015) e1002106

2015
[14]

Greenland, Modeling and variable selection in epidemiologic analysis, American journal of public health, 79 (1989) 340-349

S. Greenland, Modeling and variable selection in epidemiologic analysis, American journal of public health, 79 (1989) 340-349

1989
[15]

Westreich, S

D. Westreich, S. Greenland, The table 2 fallacy: presenting and interpreting confounder and modifier coefficients, American journal of epidemiology, 177 (2013) 292-298

2013
[16]

White, A

J.W. White, A. Rassweiler, J.F. Samhouri, A.C. Stier, C. White, Ecologists should not use statistical significance tests to interpret simulation model results, Oikos, 123 (2014) 385-388

2014

[1] [1]

Ross, Simulation, academic press2022

S.M. Ross, Simulation, academic press2022

[2] [2]

Boulesteix, R.H

A.-L. Boulesteix, R.H. Groenwold, M. Abrahamowicz, H. Binder, M. Briel, R. Hornung, T.P. Morris, J. Rahnenführer, W. Sauerbrei, Introduction to statistical simulations in health research, BMJ open, 10 (2020) e039921

2020

[3] [3]

Alizadeh, J.K

R. Alizadeh, J.K. Allen, F. Mistree, Managing computational complexity using surrogate models: a critical review, Research in Engineering Design, 31 (2020) 275-298

2020

[4] [4]

R. Seri, D. Secchi, How many times should one run a computational simulation?, Simulating social complexity: A handbook, (2017) 229-251

2017

[5] [5]

Confidence Intervals

E.T. Lofgren, Visualizing Results From Infection Transmission Models: A Case Against “Confidence Intervals”, Epidemiology, 23 (2012) 738-741

2012

[6] [6]

Halsey, The reign of the p-value is over: what alternative analyses could we employ to fill the power vacuum?, Biology letters, 15 (2019) 20190174

L.G. Halsey, The reign of the p-value is over: what alternative analyses could we employ to fill the power vacuum?, Biology letters, 15 (2019) 20190174

2019

[7] [7]

Halsey, D

L.G. Halsey, D. Curran-Everett, S.L. Vowler, G.B. Drummond, The fickle P value generates irreproducible results, Nature methods, 12 (2015) 179-185

2015

[8] [8]

Greenland, S.J

S. Greenland, S.J. Senn, K.J. Rothman, J.B. Carlin, C. Poole, S.N. Goodman, D.G. Altman, Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations, European journal of epidemiology, 31 (2016) 337-350

2016

[9] [9]

Nakagawa, T.M

S. Nakagawa, T.M. Foster, The case against retrospective statistical power analyses with an introduction to power analysis, Acta ethologica, 7 (2004) 103-108

2004

[10] [10]

Hoenig, D.M

J.M. Hoenig, D.M. Heisey, The abuse of power: the pervasive fallacy of power calculations for data analysis, The American Statistician, 55 (2001) 19-24

2001

[11] [11]

Maxwell, K

S.E. Maxwell, K. Kelley, J.R. Rausch, Sample size planning for statistical power and accuracy in parameter estimation, Annu. Rev. Psychol., 59 (2008) 537-563

2008

[12] [12]

Cox, P.J

D.R. Cox, P.J. Solomon, Components of variance, Chapman and Hall/CRC2002

[13] [13]

M.L. Head, L. Holman, R. Lanfear, A.T. Kahn, M.D. Jennions, The extent and consequences of p- hacking in science, PLoS biology, 13 (2015) e1002106

2015

[14] [14]

Greenland, Modeling and variable selection in epidemiologic analysis, American journal of public health, 79 (1989) 340-349

S. Greenland, Modeling and variable selection in epidemiologic analysis, American journal of public health, 79 (1989) 340-349

1989

[15] [15]

Westreich, S

D. Westreich, S. Greenland, The table 2 fallacy: presenting and interpreting confounder and modifier coefficients, American journal of epidemiology, 177 (2013) 292-298

2013

[16] [16]

White, A

J.W. White, A. Rassweiler, J.F. Samhouri, A.C. Stier, C. White, Ecologists should not use statistical significance tests to interpret simulation model results, Oikos, 123 (2014) 385-388

2014