arxiv: 2605.11233 · v1 · submitted 2026-05-11 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

A Comparative Study of Model Selection Criteria for Symbolic Regression

Ali Soltani , Gabriel Kronberger , Fabricio Olivetti de Franca , Mattia Billa , Alessandro Lucantonio

Authors on Pith no claims yet

Pith reviewed 2026-05-13 02:20 UTC · model grok-4.3

classification 💻 cs.LG

keywords symbolic regressionmodel selectionminimum description lengthinformation criteriaAICBICgeneralizationgenetic programming

0 comments

The pith

MDL selects symbolic regression models with lowest test error and shortest length on most synthetic datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares common model selection criteria to help symbolic regression pick mathematical expressions that generalize well without excessive complexity. It tests AIC, corrected AIC, BIC, minimum description length, and a bootstrap error estimate on seven synthetic datasets made by adding Gaussian noise to known ground-truth functions. The comparison ranks perturbed candidate expressions to measure both generalization error on held-out data and the chance that the true expression is chosen. A sympathetic reader cares because modern genetic programming systems often return many Pareto-optimal candidates, and a reliable automatic selector would make symbolic regression more practical for real problems.

Core claim

Through systematic tests on synthetic data, the study finds that minimum description length consistently identifies models with the lowest test error and the shortest length across most datasets. While no criterion wins in every case, both MDL and the Bayesian information criterion show the highest probability of recovering the ground-truth expressions.

What carries the argument

Empirical ranking of candidate expressions by AIC, AICc, BIC, MDL, and bootstrap in-sample error on noise-perturbed ground-truth functions to measure selection accuracy and generalization.

If this is right

MDL offers a practical default for automatic selection when the goal is low test error.
BIC serves as a strong alternative when the priority is recovering the true underlying expression.
No universal winner exists, so dataset characteristics still influence the best choice.
Controlled perturbation of ground-truth functions provides a reproducible benchmark for comparing selection methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If MDL performs well on synthetics, it could reduce manual tuning when symbolic regression is applied to noisy experimental data.
Testing the criteria on actual genetic programming output rather than controlled perturbations might reveal different rankings.
The same selection approach could be adapted to other expression-discovery tasks where complexity penalties matter.

Load-bearing premise

Results from ranking models created by perturbing known ground-truth expressions with Gaussian noise will predict performance on real data and on candidate sets actually produced by symbolic regression algorithms.

What would settle it

Apply the same criteria to candidate models generated by an actual symbolic regression run on a real dataset whose underlying expression is independently known, then check whether MDL still yields the lowest test error.

Figures

Figures reproduced from arXiv: 2605.11233 by Alessandro Lucantonio, Ali Soltani, Fabricio Olivetti de Franca, Gabriel Kronberger, Mattia Billa.

**Figure 3.** Figure 3: Precision at 𝑘 for different model selection criteria. instance, MDL’s coding penalty appears overly conservative, potentially down-ranking models that capture the complex shape of [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

read the original abstract

Effective model selection is critical in symbolic regression (SR) to identify mathematical expressions that balance accuracy and complexity, and have low expected error on unseen data. Many modern implementations of genetic programming (GP) for SR generate a set of Pareto optimal candidate solutions, but reliable automatic selection of solutions that generalize well remains an open issue. Current literature offers various information-theoretic and Bayesian approaches, yet comprehensive comparisons of their performance across different data regimes are limited. This study presents a systematic empirical comparison of widely used selection criteria: the Akaike information criterion (AIC), the corrected AIC (AICc), the Bayesian information criterion (BIC), minimum description length (MDL), as well as Efron's bootstrap estimate for the in-sample prediction error on seven synthetic datasets with Gaussian noise. We rank candidate expressions generated by perturbing ground-truth functions to assess generalization error and selection probability of the ground-truth expression. Our findings reveal that MDL consistently identifies models with the lowest test error and the shortest length across most datasets. While no single criterion dominates all results, MDL and BIC produced the highest probability of selecting the ground-truth expressions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MDL edges out the others in this head-to-head on synthetic SR data, but the candidates come from controlled perturbations of the ground truth rather than real algorithm output, so the ranking is hard to trust for actual use.

read the letter

The paper runs a clean comparison of AIC, AICc, BIC, MDL, and a bootstrap error estimate on seven synthetic datasets. It generates candidate expressions by small perturbations around known ground-truth functions, adds Gaussian noise, and checks which criterion picks the model with lowest test error and highest chance of recovering the original expression. MDL comes out on top for most datasets on both metrics, with BIC close behind. No single rule wins everywhere, but the ordering is consistent enough to be useful as a starting point for people who have to pick from a Pareto set in symbolic regression pipelines.

Referee Report

2 major / 1 minor

Summary. The manuscript conducts a systematic empirical comparison of model selection criteria for symbolic regression, specifically AIC, AICc, BIC, MDL, and Efron's bootstrap estimate of in-sample prediction error. Using seven synthetic datasets with added Gaussian noise, candidate expressions are generated via controlled perturbations of known ground-truth functions. The criteria are ranked by their ability to identify models with low held-out test error, short length, and high probability of recovering the ground-truth expression. The central finding is that MDL consistently selects models with the lowest test error and shortest length across most datasets, while MDL and BIC achieve the highest ground-truth recovery rates.

Significance. If the results generalize, the work provides actionable empirical guidance for automatic model selection from Pareto fronts in symbolic regression, where reliable generalization remains an open problem. The use of ground-truth recovery as an evaluation metric and the multi-dataset protocol are strengths that allow direct assessment of selection quality. The study is purely empirical with no circular derivations, which supports its internal validity on the tested setups.

major comments (2)

[Experimental Setup] Experimental setup (candidate generation): generating candidate expressions exclusively by small perturbations of the ground-truth functions produces an artificial distribution of in-sample errors and complexities that does not match the Pareto fronts or error-complexity correlations arising from actual GP-based symbolic regression search. This setup is load-bearing for the claim that MDL is superior, as the reported advantages in test error and ground-truth recovery may be artifacts of the controlled perturbation process rather than properties that hold for realistic candidate pools.
[Results] Results and discussion: the assertion that MDL 'consistently identifies models with the lowest test error ... across most datasets' requires explicit reporting of per-dataset rankings, effect sizes, and whether the protocol (including noise levels, perturbation magnitudes, and statistical tests) was pre-specified; without these details the robustness of the cross-dataset claim cannot be verified.

minor comments (1)

[Methods] The abstract and methods should explicitly list the seven ground-truth functions, the exact perturbation operators, and the range of noise variances used, to enable full reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below with our responses and indicate the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Experimental Setup] Experimental setup (candidate generation): generating candidate expressions exclusively by small perturbations of the ground-truth functions produces an artificial distribution of in-sample errors and complexities that does not match the Pareto fronts or error-complexity correlations arising from actual GP-based symbolic regression search. This setup is load-bearing for the claim that MDL is superior, as the reported advantages in test error and ground-truth recovery may be artifacts of the controlled perturbation process rather than properties that hold for realistic candidate pools.

Authors: We acknowledge that the perturbation-based generation of candidates creates a controlled but artificial distribution of errors and complexities that does not replicate the stochastic correlations found in Pareto fronts from standard GP symbolic regression. This design choice was made to isolate the behavior of the selection criteria when the ground truth is known and to enable direct measurement of recovery rates. We agree this limits the strength of any claim about performance on realistic candidate pools. In the revision we will add an explicit limitations subsection that describes the synthetic nature of the candidate set, discusses its implications for generalizability, and outlines planned follow-up experiments that apply the same criteria to full GP-generated fronts. revision: partial
Referee: [Results] Results and discussion: the assertion that MDL 'consistently identifies models with the lowest test error ... across most datasets' requires explicit reporting of per-dataset rankings, effect sizes, and whether the protocol (including noise levels, perturbation magnitudes, and statistical tests) was pre-specified; without these details the robustness of the cross-dataset claim cannot be verified.

Authors: We agree that greater granularity is required. The revised manuscript will include a supplementary table (or expanded main-text table) that reports, for each of the seven datasets separately, the ranking of all criteria by test error, model length, and ground-truth recovery probability, together with the raw performance values. We will also report effect sizes (e.g., mean differences in test error between MDL and the next-best criterion). The noise variances and perturbation magnitudes were selected from values commonly used in the symbolic regression literature to produce moderate signal-to-noise ratios; however, the overall experimental protocol was not pre-registered. We will state this explicitly in the methods section, list the exact parameter values employed, and add statistical comparisons (Friedman test with post-hoc Nemenyi tests) to support or qualify the word “consistently.” revision: yes

Circularity Check

0 steps flagged

No circularity: pure empirical benchmark on synthetic perturbations

full rationale

The manuscript is a controlled empirical comparison of AIC, AICc, BIC, MDL and bootstrap criteria. Candidate expressions are generated by explicit perturbation of known ground-truth functions, ranked by each criterion, and evaluated on held-out test error plus ground-truth recovery rate. No derivation, uniqueness theorem, ansatz, or fitted parameter is invoked; every reported ranking follows directly from the stated experimental protocol and external test data. The methodology is therefore self-contained and does not reduce any claim to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work rests on standard statistical assumptions for information criteria and on the representativeness of synthetic data; no new free parameters, axioms, or entities are introduced beyond those already present in the cited criteria.

axioms (2)

domain assumption Gaussian noise model for synthetic data generation
Invoked to create the seven benchmark datasets used for all comparisons.
domain assumption Perturbed ground-truth expressions form a representative set of candidate models
Used to compute selection probability of the true expression.

pith-pipeline@v0.9.0 · 5505 in / 1346 out tokens · 64431 ms · 2026-05-13T02:20:49.998238+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We rank candidate expressions generated by perturbing ground-truth functions... MDL consistently identifies models with the lowest test error and the shortest length

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages

[1]

2012.Controlling Overfitting in Symbolic Regression Based on a Bias/Variance Error Decomposition

Alexandros Agapitos, Anthony Brabazon, and Michael O’Neill. 2012.Controlling Overfitting in Symbolic Regression Based on a Bias/Variance Error Decomposition. Springer Berlin Heidelberg, 438–447. doi:10.1007/978-3-642-32937-1_44

work page doi:10.1007/978-3-642-32937-1_44 2012
[2]

2009.A Statistical Learning Perspective of Genetic Programming

Nur Merve Amil, Nicolas Bredeche, Christian Gagné, Sylvain Gelly, Marc Schoe- nauer, and Olivier Teytaud. 2009.A Statistical Learning Perspective of Genetic Programming. Springer Berlin Heidelberg, 327–338. doi:10.1007/978-3-642- 01181-8_28

work page doi:10.1007/978-3-642- 2009
[3]

Sylvain Arlot and Alain Celisse. 2010. A survey of cross-validation procedures for model selection.Statistics Surveys4, none (2010), 40 – 79. doi:10.1214/09-SS054

work page doi:10.1214/09-ss054 2010
[4]

Deaglan Bartlett, Harry Desmond, and Pedro Ferreira. 2023. Priors for symbolic re- gression. InProceedings of the Companion Conference on Genetic and Evolutionary Computation(Lisbon, Portugal)(GECCO ’23 Companion). Association for Com- puting Machinery, New York, NY, USA, 2402–2411. doi:10.1145/3583133.3596327

work page doi:10.1145/3583133.3596327 2023
[5]

Deaglan J Bartlett, Harry Desmond, and Pedro G Ferreira. 2023. Exhaustive symbolic regression.IEEE Transactions on Evolutionary Computation28, 4 (2023), 950–964

work page 2023
[6]

Bartlett, Harry Desmond, and Pedro G

Deaglan J. Bartlett, Harry Desmond, and Pedro G. Ferreira. 2023. Exhaustive Symbolic Regression.IEEE Transactions on Evolutionary Computation(2023), 1–1. arXiv:2211.11461 [astro-ph.CO] doi:10.1109/TEVC.2023.3280250

work page doi:10.1109/tevc.2023.3280250 2023
[7]

Bomarito, Patrik E

Geoffrey F. Bomarito, Patrik E. Leser, N. C. M. Strauss, K. M. Garbrecht, and J. D. Hochhalter. 2022. Bayesian model selection for reducing bloat and overfitting in genetic programming for symbolic regression. InProceedings of the Genetic and Evolutionary Computation Conference Companion(Boston, Massachusetts) (GECCO ’22). Association for Computing Machin...

work page doi:10.1145/3520304.3528899 2022
[8]

Jure Brence, Ljupčo Todorovski, and Sašo Džeroski. 2021. Probabilistic grammars for equation discovery.Knowledge-Based Systems224 (2021), 107077

work page 2021
[9]

Steven L Brunton, Joshua L Proctor, and J Nathan Kutz. 2016. Discovering governing equations from data by sparse identification of nonlinear dynamical systems.Proceedings of the national academy of sciences113, 15 (2016), 3932–3937

work page 2016
[10]

Burnham and David R

Kenneth P. Burnham and David R. Anderson (Eds.). 2004.Model Selection and Multimodel Inference. Springer New York. doi:10.1007/b97636

work page doi:10.1007/b97636 2004
[11]

Qi Chen, Bing Xue, Lin Shang, and Mengjie Zhang. 2016. Improving General- isation of Genetic Programming for Symbolic Regression with Structural Risk Minimisation. InProceedings of the Genetic and Evolutionary Computation Con- ference 2016 (GECCO ’16). ACM, 709–716. doi:10.1145/2908812.2908842

work page doi:10.1145/2908812.2908842 2016
[12]

Bradley Efron. 2004. The Estimation of Prediction Error.J. Amer. Statist. Assoc. 99, 467 (2004), 619–632. doi:10.1198/016214504000000692

work page doi:10.1198/016214504000000692 2004
[13]

Friedman, Eric Grosse, and Werner Stuetzle

Jerome H. Friedman, Eric Grosse, and Werner Stuetzle. 1983. Multidimensional Additive Spline Approximation.SIAM J. Sci. Statist. Comput.4, 2 (June 1983), 291–301. doi:10.1137/0904023

work page doi:10.1137/0904023 1983
[14]

Grünwald

Peter D. Grünwald. 2007.The Minimum Description Length Principle. The MIT Press. doi:10.7551/mitpress/4643.001.0001

work page doi:10.7551/mitpress/4643.001.0001 2007
[15]

2009.The elements of statistical learning

Trevor Hastie, Robert Tibshirani, Jerome Friedman, et al. 2009.The elements of statistical learning. Springer series in statistics New-York

work page 2009
[16]

Pierre-Alexandre Kamienny, Guillaume Lample, Sylvain Lamprier, and Marco Virgolin. 2023. Deep generative symbolic regression with monte-carlo-tree- search. InInternational Conference on Machine Learning. PMLR, 15655–15668

work page 2023
[17]

Lukas Kammerer, Gabriel Kronberger, Bogdan Burlacu, Stephan M Winkler, Michael Kommenda, and Michael Affenzeller. 2020. Symbolic regression by exhaustive search: Reducing the search space using syntactical constraints and efficient semantic structure deduplication. InGenetic programming theory and practice XVII. Springer, 79–99

work page 2020
[18]

John R. Koza. 1992.Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press

work page 1992
[19]

Winkler, and Michael Affenzeller

Gabriel Kronberger, Bogdan Burlacu, Michael Kommenda, Stephan M. Winkler, and Michael Affenzeller. 2024.Symbolic Regression. Chapman and Hall/CRC. doi:10.1201/9781315166407

work page doi:10.1201/9781315166407 2024
[20]

Mikel Landajuela, Chak Shing Lee, Jiachen Yang, Ruben Glatt, Claudio P Santiago, Ignacio Aravena, Terrell Mundhenk, Garrett Mulcahy, and Brenden K Petersen

work page
[21]

A unified framework for deep symbolic regression.Advances in Neural Information Processing Systems35 (2022), 33985–33998

work page 2022
[22]

Langdon and Riccardo Poli

William B. Langdon and Riccardo Poli. 2002.Foundations of Genetic Programming. Springer-Verlag. doi:10.1007/978-3-662-04726-2

work page doi:10.1007/978-3-662-04726-2 2002
[23]

Maciel, Ivan G

Aron I. Maciel, Ivan G. Costa, and Ana C. Lorena. 2016. Measuring the complexity of regression problems.Proceedings of the International Joint Conference on Neural Networks2016-October (2016), 1450 – 1457. doi:10.1109/IJCNN.2016.7727369

work page doi:10.1109/ijcnn.2016.7727369 2016
[24]

Georg Martius and Christoph H Lampert. 2016. Extrapolation and learning equations.arXiv preprint arXiv:1610.02995(2016)

work page Pith review arXiv 2016
[25]

Trent McConaghy. 2011. FFX: Fast, scalable, deterministic symbolic regression technology. InGenetic Programming Theory and Practice IX. Springer, 235–260

work page 2011
[26]

Montaña, César L

José L. Montaña, César L. Alonso, Cruz Enrique Borges, and Javier de la Dehesa. 2011.Penalty Functions for Genetic Programming Algorithms. Springer Berlin Heidelberg, 550–562. doi:10.1007/978-3-642-21928-3_40

work page doi:10.1007/978-3-642-21928-3_40 2011
[27]

Andrea Murari, Riccardo Rossi, Luca Spolladore, Michele Lungaroni, Pasquale Gaudio, and Michela Gelfusa. 2023. A practical utility-based but objective ap- proach to model selection for regression in scientific applications.Artificial Intelligence Review56, S2 (Oct. 2023), 2825–2859. doi:10.1007/s10462-023-10591-4

work page doi:10.1007/s10462-023-10591-4 2023
[28]

Miguel Nicolau and Alexandros Agapitos. 2021. Choosing function sets with better generalisation performance for symbolic regression models.Genetic pro- gramming and evolvable machines22, 1 (2021), 73–100

work page 2021
[29]

Olson, William La Cava, Patryk Orzechowski, Ryan J

Randal S. Olson, William La Cava, Patryk Orzechowski, Ryan J. Urbanowicz, and Jason H. Moore. 2017. PMLB: a large benchmark suite for machine learning evaluation and comparison.BioData Mining10, 1 (11 Dec 2017), 36. doi:10.1186/ s13040-017-0154-4

work page 2017
[30]

2004.Grammatical Evolution by Grammatical Evolution: The Evolution of Grammar and Genetic Code

Michael O’Neill and Conor Ryan. 2004.Grammatical Evolution by Grammatical Evolution: The Evolution of Grammar and Genetic Code. Springer Berlin Heidelberg, GECCO ’26, July 13–17, 2026, San José, Costa Rica Soltani et al. 138–149. doi:10.1007/978-3-540-24650-3_13

work page doi:10.1007/978-3-540-24650-3_13 2004
[31]

Fitria Wulandari Ramlan, Gabriel Kronberger, Colm O’Riordan, and James Mc- Dermott. 2026. Comparative Analysis of Model Selection Criteria for Symbolic Regression using Genetic Programming. InComputational Intelligence, Proc. of IJCCI 2025, CCIS 2828. Springer Nature, 1–18. doi:10.1007/978-3-032-15635-8_6 accepted to be published

work page doi:10.1007/978-3-032-15635-8_6 2026
[32]

Jorma Rissanen. 1978. Modeling by shortest data description.Automatica14, 5 (1978), 465–471. doi:10.1016/0005-1098(78)90005-5

work page doi:10.1016/0005-1098(78)90005-5 1978
[33]

Subham Sahoo, Christoph Lampert, and Georg Martius. 2018. Learning Equations for Extrapolation and Control. InProceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 80), Jennifer Dy and Andreas Krause (Eds.). PMLR, 4442–4450. https://proceedings.mlr.press/ v80/sahoo18a.html

work page 2018
[34]

Rafal Salustowicz and Jürgen Schmidhuber. 1997. Probabilistic Incremental Program Evolution.Evolutionary Computation5, 2 (June 1997), 123–141. doi:10. 1162/evco.1997.5.2.123

work page 1997
[35]

Gideon Schwarz. 1978. Estimating the Dimension of a Model.The Annals of Statistics6, 2 (1978), 461–464. http://www.jstor.org/stable/2958889

work page arXiv 1978
[36]

Erik-Jan Senn. 2025. Model Recovery in Symbolic Regression: Theory, Con- jectures, and Open Questions. InProceedings of the Genetic and Evolutionary Computation Conference Companion (GECCO ’25 Companion). ACM, 2556–2562. doi:10.1145/3712255.3734334

work page doi:10.1145/3712255.3734334 2025
[37]

Parshin Shojaee, Kazem Meidani, Amir Barati Farimani, and Chandan Reddy

work page
[38]

Transformer-based planning for symbolic regression.Advances in Neural Information Processing Systems36 (2023), 45907–45919

work page 2023
[39]

Smits and Mark Kotanchek

Guido F. Smits and Mark Kotanchek. 2005.Pareto-Front Exploitation in Symbolic Regression. Springer US, Boston, MA, 283–299. doi:10.1007/0-387-23254-0_17

work page doi:10.1007/0-387-23254-0_17 2005
[40]

Fangzheng Sun, Yang Liu, Jian-Xun Wang, and Hao Sun. 2023. Symbolic Physics Learner: Discovering governing equations via Monte Carlo tree search. InThe Eleventh International Conference on Learning Representations. https://openreview. net/forum?id=ZTK3SefE8_Z

work page 2023
[41]

Alexander Topchy and W. F. Punch. 2001. Faster genetic programming based on local gradient search of numeric leaf values. InProceedings of the 3rd Annual Conference on Genetic and Evolutionary Computation(San Francisco, California) (GECCO’01). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 155–162

work page 2001
[42]

Vladislavleva, Guido F

Ekaterina J. Vladislavleva, Guido F. Smits, and Dick den Hertog. 2009. Order of Nonlinearity as a Complexity Measure for Models Generated by Symbolic Regression via Pareto Genetic Programming.IEEE Transactions on Evolutionary Computation13, 2 (April 2009), 333–349. doi:10.1109/tevc.2008.926486

work page doi:10.1109/tevc.2008.926486 2009
[43]

Yilong Xu, Yang Liu, and Hao Sun. 2024. Reinforcement symbolic regression machine. InThe Twelfth International Conference on Learning Representations

work page 2024