arxiv: 2604.25458 · v1 · submitted 2026-04-28 · 💻 cs.NE

Recognition: unknown

Benchmarking Stopping Criteria for Evolutionary Multi-objective Optimization

Kenji Kitamura, Ryoji Tanabe

Authors on Pith no claims yet

Pith reviewed 2026-05-07 14:03 UTC · model grok-4.3

classification 💻 cs.NE

keywords stopping criteriaevolutionary multi-objective optimizationbenchmarkingperformance measurefile-based approachpopulation storageconvergence detectionEMO algorithms

0 comments

The pith

A scalar performance measure and file-based method for storing population states enable fair, reproducible benchmarking of stopping criteria in evolutionary multi-objective optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets the absence of good ways to test stopping criteria in evolutionary multi-objective optimization, which has kept new criteria from being developed. It introduces a performance measure that condenses a criterion's effectiveness into one number for direct comparisons. It also supplies a file-based benchmarking process that records population states in text files for simplicity and reproducibility, plus a compact data format to control file sizes. If these tools succeed, researchers can evaluate when an algorithm should halt without wasting evaluations on problems where progress has stalled. This matters for practical uses where each function evaluation costs time or money.

Core claim

The paper claims that its proposed performance measure represents stopping-criterion quality as a single scalar value for easy comparison, that the file-based benchmarking approach simplifies experiments while supporting reproducibility, and that the accompanying data representation method solves the file-size problem in that approach, with effectiveness shown by applying the tools to five representative stopping criteria for EMO.

What carries the argument

The scalar performance measure for stopping criteria paired with the file-based benchmarking approach that stores population states in compact text files.

If this is right

Direct numerical comparisons of different stopping criteria become possible without subjective judgment.
Reproducibility improves because benchmark runs can be shared and re-evaluated from the stored population files.
New stopping criteria can be validated more quickly against existing ones using the same standardized process.
EMO applications in practice can adopt criteria shown to stop at appropriate times, reducing wasted evaluations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same scalar-plus-file structure could be adapted to benchmark early-stopping rules in single-objective evolutionary algorithms or in machine-learning training.
Public archives of population-state files might emerge, allowing community-wide re-use and meta-analysis of stopping performance.
The method opens the door to studying how stopping-criterion effectiveness changes with problem features such as objective count or decision-space dimensionality.

Load-bearing premise

A single scalar performance measure can meaningfully capture the quality of stopping criteria across varied EMO problems and saving population states in files preserves all information needed for accurate benchmarking.

What would settle it

Applying the scalar measure to the same set of problems and stopping criteria and obtaining rankings that contradict those produced by multiple independent metrics or expert review of the actual convergence behavior.

Figures

Figures reproduced from arXiv: 2604.25458 by Kenji Kitamura, Ryoji Tanabe.

**Figure 1.** Figure 1: HV values in a single run of NSGA-II. Algorithm 1: An EMO algorithm with a stopping criterion 1 𝑡 ← 1, 𝑏 stop ← False; 2 The initialization of the population P (𝑡 ) = {x (𝑡 ) 1 , ..., x (𝑡 ) 𝜇 }; 3 while 𝑏 stop = False, or 𝑡 < 𝑡 max do 4 P ′ (𝑡 ) ← Mating selection (P (𝑡 ) ); 5 Q (𝑡 ) ← Variation (P ′ (𝑡 ) ); 6 P (𝑡+1) ← Environmental selection( P(𝑡 ) ∪ Q(𝑡 ) ); 7 𝑏 stop ← Stopping criterion( P(𝑡+1) , P (𝑡… view at source ↗

**Figure 2.** Figure 2: Examples of three text files that maintain the three view at source ↗

**Figure 3.** Figure 3: Examples of the proposed method. Let us consider traditional benchmarking of a stopping criterion (e.g., OCD) by incorporating it into an EMO algorithm (e.g., NSGAII). Based on a sequence of population states f(P(1) ), . . . , f(P(𝑡) ), the stopping criterion determines whether to stop the search at iteration 𝑡. We point out that this process can be perfectly simulated by sequentially reloading the above-… view at source ↗

**Figure 5.** Figure 5: Distributions of the average POSE values. view at source ↗

**Figure 6.** Figure 6: shows FE∗ and FEstop of the five stopping criteria on the bi-objective DTLZ1 problem in a single run. The gray line in view at source ↗

**Figure 7.** Figure 7: Distributions of POSE values for 𝑚 ∈ {2, 4, 6} when using NSGA-II view at source ↗

**Figure 8.** Figure 8: Total size of all text files in the two data representa view at source ↗

read the original abstract

Stopping criteria automatically determine when to stop an evolutionary algorithm, so as not to waste function evaluations on a stagnant population. Although stopping criteria play an important role in real-world applications, they have attracted little attention in the evolutionary multi-objective optimization (EMO) community. In fact, new stopping criteria for EMO have been rarely developed in recent years. One reason for the stagnation in developing stopping criteria for EMO is a lack of effective benchmarking methodologies. To address this issue, this paper proposes (i) a performance measure of stopping criteria for EMO and (ii) a file-based benchmarking approach. This paper also proposes (iii) a data representation method that effectively stores population states in text files. (i) The proposed measure represents the performance of stopping criteria as a single scalar value, making comparison easy. (ii) The proposed file-based approach not only simplifies the benchmarking process but also facilitates reproducibility. (iii) The proposed data representation method addresses the issue of file size in (ii). We demonstrate the effectiveness of our three contributions (i)--(iii) by benchmarking five representative stopping criteria for EMO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper supplies a scalar performance score plus a file-based reproducibility setup for testing EMO stopping criteria, which fills a narrow but real methodological gap.

read the letter

The core new pieces are a single-number way to score how well a stopping rule works and a text-file format that lets people rerun or extend benchmarks without regenerating populations from scratch. They also give a compact encoding to keep those files from ballooning. These three items directly target the stagnation the authors describe: almost no new stopping criteria have appeared lately because there was no standard way to compare them fairly and reproducibly. That diagnosis looks accurate, and the practical framing around real-world use is a plus. The demonstration on five existing criteria shows the machinery in action and should let others plug in their own rules without much friction. The file approach in particular lowers the barrier for follow-up work and supports the reproducibility claim without extra overhead. The main soft spot is the scalar itself. Any single number that mixes proximity to the Pareto front, evaluations saved, and robustness across problems will involve some aggregation rule, and the abstract gives no hint of how that rule is derived or tested for stability under scaling or noise. If the weighting turns out to be tuned to the chosen test problems, the rankings could change with a different suite. Five criteria is also a thin sample for a benchmarking paper; a broader set would make the comparison more convincing. This is aimed at the EMO community that actually deploys these algorithms on expensive problems and wants to stop wasting evaluations. Readers outside that niche or looking for theoretical advances will find little. The work is coherent on its own terms and addresses a stated limitation with concrete tools, so it deserves a serious referee rather than a desk reject.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes three contributions to address the lack of benchmarking methodologies for stopping criteria in evolutionary multi-objective optimization (EMO): (i) a performance measure represented as a single scalar value to enable easy comparison of stopping criteria, (ii) a file-based benchmarking approach to simplify the process and improve reproducibility, and (iii) a data representation method for efficiently storing population states in text files to mitigate file size issues. These are demonstrated through benchmarking of five representative stopping criteria for EMO.

Significance. If the single-scalar performance measure can be shown to aggregate proximity to the Pareto front, evaluation savings, and cross-problem robustness in a non-arbitrary, ranking-preserving manner, and if the file-based approach delivers lossless, reproducible comparisons, the work would provide a much-needed standardized framework for evaluating and developing stopping criteria in EMO, an area that has seen little recent progress. The explicit focus on reproducibility via file-based methods is a clear strength.

major comments (1)

Abstract and contribution (i): the central claim that a single scalar performance measure enables meaningful comparisons rests on an unspecified aggregation of closeness to the true Pareto front, number of wasted evaluations avoided, and robustness across problem classes. Without an explicit formula, weighting scheme, or demonstration that the scalar preserves rankings under changes in objective scaling, front shape, or noise, the measure risks oversimplifying heterogeneous EMO landscapes and the subsequent benchmarking demonstration loses force.

minor comments (1)

Abstract: the description of contributions (i)–(iii) remains at a high level with no formulas for the performance measure, no pseudocode or specification for the file format or data representation, and no quantitative results from the five-criteria benchmark, which makes it difficult to assess the claimed effectiveness.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and recommendation for major revision. We address the single major comment below and describe the changes we will make to strengthen the manuscript.

read point-by-point responses

Referee: [—] Abstract and contribution (i): the central claim that a single scalar performance measure enables meaningful comparisons rests on an unspecified aggregation of closeness to the true Pareto front, number of wasted evaluations avoided, and robustness across problem classes. Without an explicit formula, weighting scheme, or demonstration that the scalar preserves rankings under changes in objective scaling, front shape, or noise, the measure risks oversimplifying heterogeneous EMO landscapes and the subsequent benchmarking demonstration loses force.

Authors: We agree that the aggregation underlying the single-scalar performance measure must be made fully explicit. In the submitted manuscript the measure is described at a high level as combining proximity to the Pareto front, evaluation savings, and cross-problem robustness, but the precise formula and weighting scheme were not stated in the abstract or early sections. In the revised manuscript we will insert a new subsection (Section 3.1) that gives the mathematical definition of the scalar, specifies the weighting coefficients (with justification), and reports additional experiments that test ranking stability under objective scaling, different front geometries, and additive noise. These additions will also include a short discussion of the measure’s limitations with respect to heterogeneous landscapes. We believe the expanded presentation will remove any ambiguity and reinforce the validity of the subsequent benchmarking results. revision: yes

Circularity Check

0 steps flagged

No circularity: independent methodological proposals with no self-referential derivations

full rationale

The paper introduces three standalone contributions—a scalar performance measure for EMO stopping criteria, a file-based benchmarking framework, and a compact text-file data representation—without any equations, fitted parameters, or derivations that reduce to prior inputs. Benchmarking of five existing criteria is presented as an empirical demonstration rather than a self-referential prediction. No self-citations, uniqueness theorems, or ansatzes are invoked to justify the core methods. The work is self-contained as a set of practical tools; the single-scalar measure is explicitly proposed as a new construct, not derived from or fitted to the benchmarking results themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is methodological and rests on standard domain assumptions of EMO rather than new fitted parameters or invented entities.

axioms (1)

domain assumption Evolutionary multi-objective optimization algorithms maintain populations and use Pareto dominance or similar ranking to guide search.
The benchmarking proposals operate inside the conventional EMO framework without re-deriving or questioning these background assumptions.

pith-pipeline@v0.9.0 · 5490 in / 1172 out tokens · 91889 ms · 2026-05-07T14:03:31.151856+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 21 canonical work pages

[1]

Anne Auger, Johannes Bader, Dimo Brockhoff, and Eckart Zitzler. 2012. Hypervolume-based multiobjective optimization: Theoretical foundations and practical implications.Theor. Comput. Sci.425 (2012), 75–103. doi:10.1016/J.TCS. 2011.03.012 Benchmarking Stopping Criteria for Evolutionary Multi-objective Optimization GECCO ’26, July 13–17, 2026, San Jose, Costa Rica

work page doi:10.1016/j.tcs 2012
[2]

Nicola Beume, Boris Naujoks, and Michael Emmerich. 2007. SMS-EMOA: Mul- tiobjective selection based on dominated hypervolume.European journal of operational research181, 3 (2007), 1653–1669

2007
[3]

2009.A Statistical Learning Perspective of Genetic Programming

Mauro Birattari. 2009.Tuning Metaheuristics - A Machine Learning Perspective. Studies in Computational Intelligence, Vol. 197. Springer. doi:10.1007/978-3-642- 00483-4

work page doi:10.1007/978-3-642- 2009
[4]

Francesco Biscani and Dario Izzo. 2020. A parallel global multiobjective frame- work for optimization: pagmo.Journal of Open Source Software5, 53 (2020), 2338

2020
[5]

Julian Blank and Kalyanmoy Deb. 2020. Pymoo: Multi-Objective Optimization in Python.IEEE Access8 (2020), 89497–89509. doi:10.1109/ACCESS.2020.2990567

work page doi:10.1109/access.2020.2990567 2020
[6]

Dimo Brockhoff. 2015. A Bug in the Multiobjective Optimizer IBEA: Salutary Lessons for Code Release and a Performance Re-Assessment. InEvolutionary Multi-Criterion Optimization - 8th International Conference, EMO 2015, Guimarães, Portugal, March 29 -April 1, 2015. Proceedings, Part I (Lecture Notes in Computer Science, Vol. 9018), António Gaspar-Cunha, Ca...

work page doi:10.1007/978-3-319-15934-8_13 2015
[7]

Coello Coello and Margarita Reyes Sierra

Carlos A. Coello Coello and Margarita Reyes Sierra. 2004. A Study of the Paral- lelization of a Coevolutionary Multi-objective Evolutionary Algorithm. InMICAI. 688–697. doi:10.1007/978-3-540-24694-7_71

work page doi:10.1007/978-3-540-24694-7_71 2004
[8]

2001.Multi-objective optimization using evolutionary algorithms

Kalyanmoy Deb. 2001.Multi-objective optimization using evolutionary algorithms. Wiley

2001
[9]

Meyarivan

Kalyanmoy Deb, Samir Agrawal, Amrit Pratap, and T. Meyarivan. 2002. A fast and elitist multiobjective genetic algorithm: NSGA-II.IEEE Trans. Evol. Comput. 6, 2 (2002), 182–197. doi:10.1109/4235.996017

work page doi:10.1109/4235.996017 2002
[10]

Kalyanmoy Deb and Himanshu Jain. 2014. An Evolutionary Many-Objective Optimization Algorithm Using Reference-Point-Based Nondominated Sorting Approach, Part I: Solving Problems With Box Constraints.IEEE Trans. Evol. Comput.18, 4 (2014), 577–601. doi:10.1109/TEVC.2013.2281535

work page doi:10.1109/tevc.2013.2281535 2014
[11]

Kalyanmoy Deb, Lothar Thiele, Marco Laumanns, and Eckart Zitzler. 2005. Scal- able Test Problems for Evolutionary Multiobjective Optimization. InEvolutionary Multiobjective Optimization. Springer, 105–145. doi:10.1007/1-84628-137-7_6

work page doi:10.1007/1-84628-137-7_6 2005
[12]

Hammouri, and Moham- mad Qasem Bataineh

Iyad Abu Doush, Mohammed El-Abd, Abdelaziz I. Hammouri, and Moham- mad Qasem Bataineh. 2023. The effect of different stopping criteria on multi- objective optimization algorithms.Neural Comput. Appl.35, 2 (2023), 1125–1155. doi:10.1007/S00521-021-05805-1

work page doi:10.1007/s00521-021-05805-1 2023
[13]

Salvador García, Alberto Fernández, Julián Luengo, and Francisco Herrera. 2010. Advanced nonparametric tests for multiple comparisons in the design of experi- ments in computational intelligence and data mining: Experimental analysis of power.Inf. Sci.180, 10 (2010), 2044–2064. doi:10.1016/J.INS.2009.12.010

work page doi:10.1016/j.ins.2009.12.010 2010
[14]

Tushar Goel and Nielen Stander. 2010. A non-dominance-based online stopping criterion for multi-objective evolutionary algorithms.Internat. J. Numer. Methods Engrg.84, 6 (2010), 661—-684. doi:10.1002/nme.2909

work page doi:10.1002/nme.2909 2010
[15]

Cheng Gong, Yang Nan, Lie Meng Pang, Hisao Ishibuchi, and Qingfu Zhang
[16]

InProceedings of the Genetic and Evolutionary Computation Conference, GECCO 2024, Melbourne, VIC, Australia, July 14-18, 2024, Xiaodong Li and Julia Handl (Eds.)

Performance of NSGA-III on Multi-objective Combinatorial Optimization Problems Heavily Depends on Its Implementations. InProceedings of the Genetic and Evolutionary Computation Conference, GECCO 2024, Melbourne, VIC, Australia, July 14-18, 2024, Xiaodong Li and Julia Handl (Eds.). ACM. doi:10.1145/3638529. 3654004

work page doi:10.1145/3638529 2024
[17]

David Hadka and Patrick M. Reed. 2013. Borg: An Auto-Adaptive Many-Objective Evolutionary Computing Framework.Evol. Comput.21, 2 (2013), 231–259. doi:10. 1162/EVCO_A_00075

2013
[18]

1998.Evaluating the quality of approximations to the non-dominated set

Michael Pilegaard Hansen and Andrzej Jaszkiewicz. 1998.Evaluating the quality of approximations to the non-dominated set. Technical Report IMM-REP-1998-7. Poznan University of Technology

1998
[19]

Nikolaus Hansen. 2016. The CMA Evolution Strategy: A Tutorial.CoRR abs/1604.00772 (2016). arXiv:1604.00772 http://arxiv.org/abs/1604.00772

work page arXiv 2016
[20]

2006.The base16, base32, and base64 data encodings

Simon Josefsson. 2006.The base16, base32, and base64 data encodings. Technical Report

2006
[21]

Manuel López-Ibáñez, Jürgen Branke, and Luís Paquete. 2021. Reproducibility in Evolutionary Computation.ACM Trans. Evol. Learn. Optim.1, 4 (2021), 14:1–14:21. doi:10.1145/3466624

work page doi:10.1145/3466624 2021
[22]

Knowles, and Marco Laumanns

Manuel López-Ibáñez, Joshua D. Knowles, and Marco Laumanns. 2011. On Sequential Online Archiving of Objective Vectors. InEMO, Vol. 6576. 46–60. doi:10.1007/978-3-642-19893-9_4

work page doi:10.1007/978-3-642-19893-9_4 2011
[23]

Shahriar Mahbub, Tobias Wagner, and Luigi Crema

Md. Shahriar Mahbub, Tobias Wagner, and Luigi Crema. 2015. Improving Ro- bustness of Stopping Multi-objective Evolutionary Algorithms by Simultane- ously Monitoring Objective and Decision Space. InProceedings of the Genetic and Evolutionary Computation Conference, GECCO 2015, Madrid, Spain, July 11-15, 2015, Sara Silva and Anna Isabel Esparcia-Alcázar (Ed...

work page doi:10.1145/2739480.2754680 2015
[24]

Luis Martí, Jesús García, Antonio Berlanga, and José Manuel Molina. 2007. A cumulative evidential stopping criterion for multiobjective optimization evolu- tionary algorithms. InGECCO. 911. doi:10.1145/1276958.1277141

work page doi:10.1145/1276958.1277141 2007
[25]

Luis Martí, Jesús García, Antonio Berlanga, and José M. Molina. 2016. A stopping criterion for multi-objective optimization evolutionary algorithms.Inf. Sci.367- 368 (2016), 700–718. doi:10.1016/J.INS.2016.07.025

work page doi:10.1016/j.ins.2016.07.025 2016
[26]

Dhish Kumar Saxena, Arnab Sinha, Joao A Duro, and Qingfu Zhang. 2015. Entropy-based termination criterion for multiobjective evolutionary algorithms. IEEE Trans. Evol. Comput.20, 4 (2015), 485–498

2015
[27]

Tobias Wagner, Heike Trautmann, and Boris Naujoks. 2009. OCD: Online Conver- gence Detection for Evolutionary Multi-Objective Algorithms Based on Statistical Testing. InEMO. 198–215. doi:10.1007/978-3-642-01020-0_19

work page doi:10.1007/978-3-642-01020-0_19 2009
[28]

Qingfu Zhang and Hui Li. 2007. MOEA/D: A Multiobjective Evolutionary Algo- rithm Based on Decomposition.IEEE Trans. Evol. Comput.11, 6 (2007), 712–731. doi:10.1109/TEVC.2007.892759

work page doi:10.1109/tevc.2007.892759 2007
[29]

Eckart Zitzler and Simon Künzli. 2004. Indicator-Based Selection in Multiobjective Search. InParallel Problem Solving from Nature - PPSN VIII, 8th International Conference, Birmingham, UK, September 18-22, 2004, Proceedings (Lecture Notes in Computer Science, Vol. 3242), Xin Yao, Edmund K. Burke, José Antonio Lozano, Jim Smith, Juan Julián Merelo Guervós,...

2004
[30]

Eckart Zitzler and Lothar Thiele. 1998. Multiobjective Optimization Using Evolu- tionary Algorithms - A Comparative Case Study. InPPSN. 292–304. doi:10.1007/ BFb0056872

1998
[31]

Fonseca, and Vi- viane Grunert da Fonseca

Eckart Zitzler, Lothar Thiele, Marco Laumanns, Carlos M. Fonseca, and Vi- viane Grunert da Fonseca. 2003. Performance assessment of multiobjective optimizers: an analysis and review.IEEE Trans. Evol. Comput.7, 2 (2003), 117–132. doi:10.1109/TEVC.2003.810758

work page doi:10.1109/tevc.2003.810758 2003