Recognition: 2 theorem links
· Lean TheoremA Comparative Study of Model Selection Criteria for Symbolic Regression
Pith reviewed 2026-05-13 02:20 UTC · model grok-4.3
The pith
MDL selects symbolic regression models with lowest test error and shortest length on most synthetic datasets.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Through systematic tests on synthetic data, the study finds that minimum description length consistently identifies models with the lowest test error and the shortest length across most datasets. While no criterion wins in every case, both MDL and the Bayesian information criterion show the highest probability of recovering the ground-truth expressions.
What carries the argument
Empirical ranking of candidate expressions by AIC, AICc, BIC, MDL, and bootstrap in-sample error on noise-perturbed ground-truth functions to measure selection accuracy and generalization.
If this is right
- MDL offers a practical default for automatic selection when the goal is low test error.
- BIC serves as a strong alternative when the priority is recovering the true underlying expression.
- No universal winner exists, so dataset characteristics still influence the best choice.
- Controlled perturbation of ground-truth functions provides a reproducible benchmark for comparing selection methods.
Where Pith is reading between the lines
- If MDL performs well on synthetics, it could reduce manual tuning when symbolic regression is applied to noisy experimental data.
- Testing the criteria on actual genetic programming output rather than controlled perturbations might reveal different rankings.
- The same selection approach could be adapted to other expression-discovery tasks where complexity penalties matter.
Load-bearing premise
Results from ranking models created by perturbing known ground-truth expressions with Gaussian noise will predict performance on real data and on candidate sets actually produced by symbolic regression algorithms.
What would settle it
Apply the same criteria to candidate models generated by an actual symbolic regression run on a real dataset whose underlying expression is independently known, then check whether MDL still yields the lowest test error.
Figures
read the original abstract
Effective model selection is critical in symbolic regression (SR) to identify mathematical expressions that balance accuracy and complexity, and have low expected error on unseen data. Many modern implementations of genetic programming (GP) for SR generate a set of Pareto optimal candidate solutions, but reliable automatic selection of solutions that generalize well remains an open issue. Current literature offers various information-theoretic and Bayesian approaches, yet comprehensive comparisons of their performance across different data regimes are limited. This study presents a systematic empirical comparison of widely used selection criteria: the Akaike information criterion (AIC), the corrected AIC (AICc), the Bayesian information criterion (BIC), minimum description length (MDL), as well as Efron's bootstrap estimate for the in-sample prediction error on seven synthetic datasets with Gaussian noise. We rank candidate expressions generated by perturbing ground-truth functions to assess generalization error and selection probability of the ground-truth expression. Our findings reveal that MDL consistently identifies models with the lowest test error and the shortest length across most datasets. While no single criterion dominates all results, MDL and BIC produced the highest probability of selecting the ground-truth expressions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript conducts a systematic empirical comparison of model selection criteria for symbolic regression, specifically AIC, AICc, BIC, MDL, and Efron's bootstrap estimate of in-sample prediction error. Using seven synthetic datasets with added Gaussian noise, candidate expressions are generated via controlled perturbations of known ground-truth functions. The criteria are ranked by their ability to identify models with low held-out test error, short length, and high probability of recovering the ground-truth expression. The central finding is that MDL consistently selects models with the lowest test error and shortest length across most datasets, while MDL and BIC achieve the highest ground-truth recovery rates.
Significance. If the results generalize, the work provides actionable empirical guidance for automatic model selection from Pareto fronts in symbolic regression, where reliable generalization remains an open problem. The use of ground-truth recovery as an evaluation metric and the multi-dataset protocol are strengths that allow direct assessment of selection quality. The study is purely empirical with no circular derivations, which supports its internal validity on the tested setups.
major comments (2)
- [Experimental Setup] Experimental setup (candidate generation): generating candidate expressions exclusively by small perturbations of the ground-truth functions produces an artificial distribution of in-sample errors and complexities that does not match the Pareto fronts or error-complexity correlations arising from actual GP-based symbolic regression search. This setup is load-bearing for the claim that MDL is superior, as the reported advantages in test error and ground-truth recovery may be artifacts of the controlled perturbation process rather than properties that hold for realistic candidate pools.
- [Results] Results and discussion: the assertion that MDL 'consistently identifies models with the lowest test error ... across most datasets' requires explicit reporting of per-dataset rankings, effect sizes, and whether the protocol (including noise levels, perturbation magnitudes, and statistical tests) was pre-specified; without these details the robustness of the cross-dataset claim cannot be verified.
minor comments (1)
- [Methods] The abstract and methods should explicitly list the seven ground-truth functions, the exact perturbation operators, and the range of noise variances used, to enable full reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major comment below with our responses and indicate the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [Experimental Setup] Experimental setup (candidate generation): generating candidate expressions exclusively by small perturbations of the ground-truth functions produces an artificial distribution of in-sample errors and complexities that does not match the Pareto fronts or error-complexity correlations arising from actual GP-based symbolic regression search. This setup is load-bearing for the claim that MDL is superior, as the reported advantages in test error and ground-truth recovery may be artifacts of the controlled perturbation process rather than properties that hold for realistic candidate pools.
Authors: We acknowledge that the perturbation-based generation of candidates creates a controlled but artificial distribution of errors and complexities that does not replicate the stochastic correlations found in Pareto fronts from standard GP symbolic regression. This design choice was made to isolate the behavior of the selection criteria when the ground truth is known and to enable direct measurement of recovery rates. We agree this limits the strength of any claim about performance on realistic candidate pools. In the revision we will add an explicit limitations subsection that describes the synthetic nature of the candidate set, discusses its implications for generalizability, and outlines planned follow-up experiments that apply the same criteria to full GP-generated fronts. revision: partial
-
Referee: [Results] Results and discussion: the assertion that MDL 'consistently identifies models with the lowest test error ... across most datasets' requires explicit reporting of per-dataset rankings, effect sizes, and whether the protocol (including noise levels, perturbation magnitudes, and statistical tests) was pre-specified; without these details the robustness of the cross-dataset claim cannot be verified.
Authors: We agree that greater granularity is required. The revised manuscript will include a supplementary table (or expanded main-text table) that reports, for each of the seven datasets separately, the ranking of all criteria by test error, model length, and ground-truth recovery probability, together with the raw performance values. We will also report effect sizes (e.g., mean differences in test error between MDL and the next-best criterion). The noise variances and perturbation magnitudes were selected from values commonly used in the symbolic regression literature to produce moderate signal-to-noise ratios; however, the overall experimental protocol was not pre-registered. We will state this explicitly in the methods section, list the exact parameter values employed, and add statistical comparisons (Friedman test with post-hoc Nemenyi tests) to support or qualify the word “consistently.” revision: yes
Circularity Check
No circularity: pure empirical benchmark on synthetic perturbations
full rationale
The manuscript is a controlled empirical comparison of AIC, AICc, BIC, MDL and bootstrap criteria. Candidate expressions are generated by explicit perturbation of known ground-truth functions, ranked by each criterion, and evaluated on held-out test error plus ground-truth recovery rate. No derivation, uniqueness theorem, ansatz, or fitted parameter is invoked; every reported ranking follows directly from the stated experimental protocol and external test data. The methodology is therefore self-contained and does not reduce any claim to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Gaussian noise model for synthetic data generation
- domain assumption Perturbed ground-truth expressions form a representative set of candidate models
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We rank candidate expressions generated by perturbing ground-truth functions... MDL consistently identifies models with the lowest test error and the shortest length
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
2012.Controlling Overfitting in Symbolic Regression Based on a Bias/Variance Error Decomposition
Alexandros Agapitos, Anthony Brabazon, and Michael O’Neill. 2012.Controlling Overfitting in Symbolic Regression Based on a Bias/Variance Error Decomposition. Springer Berlin Heidelberg, 438–447. doi:10.1007/978-3-642-32937-1_44
-
[2]
2009.A Statistical Learning Perspective of Genetic Programming
Nur Merve Amil, Nicolas Bredeche, Christian Gagné, Sylvain Gelly, Marc Schoe- nauer, and Olivier Teytaud. 2009.A Statistical Learning Perspective of Genetic Programming. Springer Berlin Heidelberg, 327–338. doi:10.1007/978-3-642- 01181-8_28
-
[3]
Sylvain Arlot and Alain Celisse. 2010. A survey of cross-validation procedures for model selection.Statistics Surveys4, none (2010), 40 – 79. doi:10.1214/09-SS054
-
[4]
Deaglan Bartlett, Harry Desmond, and Pedro Ferreira. 2023. Priors for symbolic re- gression. InProceedings of the Companion Conference on Genetic and Evolutionary Computation(Lisbon, Portugal)(GECCO ’23 Companion). Association for Com- puting Machinery, New York, NY, USA, 2402–2411. doi:10.1145/3583133.3596327
-
[5]
Deaglan J Bartlett, Harry Desmond, and Pedro G Ferreira. 2023. Exhaustive symbolic regression.IEEE Transactions on Evolutionary Computation28, 4 (2023), 950–964
work page 2023
-
[6]
Bartlett, Harry Desmond, and Pedro G
Deaglan J. Bartlett, Harry Desmond, and Pedro G. Ferreira. 2023. Exhaustive Symbolic Regression.IEEE Transactions on Evolutionary Computation(2023), 1–1. arXiv:2211.11461 [astro-ph.CO] doi:10.1109/TEVC.2023.3280250
-
[7]
Geoffrey F. Bomarito, Patrik E. Leser, N. C. M. Strauss, K. M. Garbrecht, and J. D. Hochhalter. 2022. Bayesian model selection for reducing bloat and overfitting in genetic programming for symbolic regression. InProceedings of the Genetic and Evolutionary Computation Conference Companion(Boston, Massachusetts) (GECCO ’22). Association for Computing Machin...
-
[8]
Jure Brence, Ljupčo Todorovski, and Sašo Džeroski. 2021. Probabilistic grammars for equation discovery.Knowledge-Based Systems224 (2021), 107077
work page 2021
-
[9]
Steven L Brunton, Joshua L Proctor, and J Nathan Kutz. 2016. Discovering governing equations from data by sparse identification of nonlinear dynamical systems.Proceedings of the national academy of sciences113, 15 (2016), 3932–3937
work page 2016
-
[10]
Kenneth P. Burnham and David R. Anderson (Eds.). 2004.Model Selection and Multimodel Inference. Springer New York. doi:10.1007/b97636
-
[11]
Qi Chen, Bing Xue, Lin Shang, and Mengjie Zhang. 2016. Improving General- isation of Genetic Programming for Symbolic Regression with Structural Risk Minimisation. InProceedings of the Genetic and Evolutionary Computation Con- ference 2016 (GECCO ’16). ACM, 709–716. doi:10.1145/2908812.2908842
-
[12]
Bradley Efron. 2004. The Estimation of Prediction Error.J. Amer. Statist. Assoc. 99, 467 (2004), 619–632. doi:10.1198/016214504000000692
-
[13]
Friedman, Eric Grosse, and Werner Stuetzle
Jerome H. Friedman, Eric Grosse, and Werner Stuetzle. 1983. Multidimensional Additive Spline Approximation.SIAM J. Sci. Statist. Comput.4, 2 (June 1983), 291–301. doi:10.1137/0904023
-
[14]
Peter D. Grünwald. 2007.The Minimum Description Length Principle. The MIT Press. doi:10.7551/mitpress/4643.001.0001
-
[15]
2009.The elements of statistical learning
Trevor Hastie, Robert Tibshirani, Jerome Friedman, et al. 2009.The elements of statistical learning. Springer series in statistics New-York
work page 2009
-
[16]
Pierre-Alexandre Kamienny, Guillaume Lample, Sylvain Lamprier, and Marco Virgolin. 2023. Deep generative symbolic regression with monte-carlo-tree- search. InInternational Conference on Machine Learning. PMLR, 15655–15668
work page 2023
-
[17]
Lukas Kammerer, Gabriel Kronberger, Bogdan Burlacu, Stephan M Winkler, Michael Kommenda, and Michael Affenzeller. 2020. Symbolic regression by exhaustive search: Reducing the search space using syntactical constraints and efficient semantic structure deduplication. InGenetic programming theory and practice XVII. Springer, 79–99
work page 2020
-
[18]
John R. Koza. 1992.Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press
work page 1992
-
[19]
Winkler, and Michael Affenzeller
Gabriel Kronberger, Bogdan Burlacu, Michael Kommenda, Stephan M. Winkler, and Michael Affenzeller. 2024.Symbolic Regression. Chapman and Hall/CRC. doi:10.1201/9781315166407
-
[20]
Mikel Landajuela, Chak Shing Lee, Jiachen Yang, Ruben Glatt, Claudio P Santiago, Ignacio Aravena, Terrell Mundhenk, Garrett Mulcahy, and Brenden K Petersen
-
[21]
A unified framework for deep symbolic regression.Advances in Neural Information Processing Systems35 (2022), 33985–33998
work page 2022
-
[22]
William B. Langdon and Riccardo Poli. 2002.Foundations of Genetic Programming. Springer-Verlag. doi:10.1007/978-3-662-04726-2
-
[23]
Aron I. Maciel, Ivan G. Costa, and Ana C. Lorena. 2016. Measuring the complexity of regression problems.Proceedings of the International Joint Conference on Neural Networks2016-October (2016), 1450 – 1457. doi:10.1109/IJCNN.2016.7727369
-
[24]
Georg Martius and Christoph H Lampert. 2016. Extrapolation and learning equations.arXiv preprint arXiv:1610.02995(2016)
work page Pith review arXiv 2016
-
[25]
Trent McConaghy. 2011. FFX: Fast, scalable, deterministic symbolic regression technology. InGenetic Programming Theory and Practice IX. Springer, 235–260
work page 2011
-
[26]
José L. Montaña, César L. Alonso, Cruz Enrique Borges, and Javier de la Dehesa. 2011.Penalty Functions for Genetic Programming Algorithms. Springer Berlin Heidelberg, 550–562. doi:10.1007/978-3-642-21928-3_40
-
[27]
Andrea Murari, Riccardo Rossi, Luca Spolladore, Michele Lungaroni, Pasquale Gaudio, and Michela Gelfusa. 2023. A practical utility-based but objective ap- proach to model selection for regression in scientific applications.Artificial Intelligence Review56, S2 (Oct. 2023), 2825–2859. doi:10.1007/s10462-023-10591-4
-
[28]
Miguel Nicolau and Alexandros Agapitos. 2021. Choosing function sets with better generalisation performance for symbolic regression models.Genetic pro- gramming and evolvable machines22, 1 (2021), 73–100
work page 2021
-
[29]
Olson, William La Cava, Patryk Orzechowski, Ryan J
Randal S. Olson, William La Cava, Patryk Orzechowski, Ryan J. Urbanowicz, and Jason H. Moore. 2017. PMLB: a large benchmark suite for machine learning evaluation and comparison.BioData Mining10, 1 (11 Dec 2017), 36. doi:10.1186/ s13040-017-0154-4
work page 2017
-
[30]
2004.Grammatical Evolution by Grammatical Evolution: The Evolution of Grammar and Genetic Code
Michael O’Neill and Conor Ryan. 2004.Grammatical Evolution by Grammatical Evolution: The Evolution of Grammar and Genetic Code. Springer Berlin Heidelberg, GECCO ’26, July 13–17, 2026, San José, Costa Rica Soltani et al. 138–149. doi:10.1007/978-3-540-24650-3_13
-
[31]
Fitria Wulandari Ramlan, Gabriel Kronberger, Colm O’Riordan, and James Mc- Dermott. 2026. Comparative Analysis of Model Selection Criteria for Symbolic Regression using Genetic Programming. InComputational Intelligence, Proc. of IJCCI 2025, CCIS 2828. Springer Nature, 1–18. doi:10.1007/978-3-032-15635-8_6 accepted to be published
-
[32]
Jorma Rissanen. 1978. Modeling by shortest data description.Automatica14, 5 (1978), 465–471. doi:10.1016/0005-1098(78)90005-5
-
[33]
Subham Sahoo, Christoph Lampert, and Georg Martius. 2018. Learning Equations for Extrapolation and Control. InProceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 80), Jennifer Dy and Andreas Krause (Eds.). PMLR, 4442–4450. https://proceedings.mlr.press/ v80/sahoo18a.html
work page 2018
-
[34]
Rafal Salustowicz and Jürgen Schmidhuber. 1997. Probabilistic Incremental Program Evolution.Evolutionary Computation5, 2 (June 1997), 123–141. doi:10. 1162/evco.1997.5.2.123
work page 1997
- [35]
-
[36]
Erik-Jan Senn. 2025. Model Recovery in Symbolic Regression: Theory, Con- jectures, and Open Questions. InProceedings of the Genetic and Evolutionary Computation Conference Companion (GECCO ’25 Companion). ACM, 2556–2562. doi:10.1145/3712255.3734334
-
[37]
Parshin Shojaee, Kazem Meidani, Amir Barati Farimani, and Chandan Reddy
-
[38]
Transformer-based planning for symbolic regression.Advances in Neural Information Processing Systems36 (2023), 45907–45919
work page 2023
-
[39]
Guido F. Smits and Mark Kotanchek. 2005.Pareto-Front Exploitation in Symbolic Regression. Springer US, Boston, MA, 283–299. doi:10.1007/0-387-23254-0_17
-
[40]
Fangzheng Sun, Yang Liu, Jian-Xun Wang, and Hao Sun. 2023. Symbolic Physics Learner: Discovering governing equations via Monte Carlo tree search. InThe Eleventh International Conference on Learning Representations. https://openreview. net/forum?id=ZTK3SefE8_Z
work page 2023
-
[41]
Alexander Topchy and W. F. Punch. 2001. Faster genetic programming based on local gradient search of numeric leaf values. InProceedings of the 3rd Annual Conference on Genetic and Evolutionary Computation(San Francisco, California) (GECCO’01). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 155–162
work page 2001
-
[42]
Ekaterina J. Vladislavleva, Guido F. Smits, and Dick den Hertog. 2009. Order of Nonlinearity as a Complexity Measure for Models Generated by Symbolic Regression via Pareto Genetic Programming.IEEE Transactions on Evolutionary Computation13, 2 (April 2009), 333–349. doi:10.1109/tevc.2008.926486
-
[43]
Yilong Xu, Yang Liu, and Hao Sun. 2024. Reinforcement symbolic regression machine. InThe Twelfth International Conference on Learning Representations
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.