pith. machine review for the scientific record. sign in

arxiv: 2605.11233 · v1 · submitted 2026-05-11 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

A Comparative Study of Model Selection Criteria for Symbolic Regression

Authors on Pith no claims yet

Pith reviewed 2026-05-13 02:20 UTC · model grok-4.3

classification 💻 cs.LG
keywords symbolic regressionmodel selectionminimum description lengthinformation criteriaAICBICgeneralizationgenetic programming
0
0 comments X

The pith

MDL selects symbolic regression models with lowest test error and shortest length on most synthetic datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares common model selection criteria to help symbolic regression pick mathematical expressions that generalize well without excessive complexity. It tests AIC, corrected AIC, BIC, minimum description length, and a bootstrap error estimate on seven synthetic datasets made by adding Gaussian noise to known ground-truth functions. The comparison ranks perturbed candidate expressions to measure both generalization error on held-out data and the chance that the true expression is chosen. A sympathetic reader cares because modern genetic programming systems often return many Pareto-optimal candidates, and a reliable automatic selector would make symbolic regression more practical for real problems.

Core claim

Through systematic tests on synthetic data, the study finds that minimum description length consistently identifies models with the lowest test error and the shortest length across most datasets. While no criterion wins in every case, both MDL and the Bayesian information criterion show the highest probability of recovering the ground-truth expressions.

What carries the argument

Empirical ranking of candidate expressions by AIC, AICc, BIC, MDL, and bootstrap in-sample error on noise-perturbed ground-truth functions to measure selection accuracy and generalization.

If this is right

  • MDL offers a practical default for automatic selection when the goal is low test error.
  • BIC serves as a strong alternative when the priority is recovering the true underlying expression.
  • No universal winner exists, so dataset characteristics still influence the best choice.
  • Controlled perturbation of ground-truth functions provides a reproducible benchmark for comparing selection methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If MDL performs well on synthetics, it could reduce manual tuning when symbolic regression is applied to noisy experimental data.
  • Testing the criteria on actual genetic programming output rather than controlled perturbations might reveal different rankings.
  • The same selection approach could be adapted to other expression-discovery tasks where complexity penalties matter.

Load-bearing premise

Results from ranking models created by perturbing known ground-truth expressions with Gaussian noise will predict performance on real data and on candidate sets actually produced by symbolic regression algorithms.

What would settle it

Apply the same criteria to candidate models generated by an actual symbolic regression run on a real dataset whose underlying expression is independently known, then check whether MDL still yields the lowest test error.

Figures

Figures reproduced from arXiv: 2605.11233 by Alessandro Lucantonio, Ali Soltani, Fabricio Olivetti de Franca, Gabriel Kronberger, Mattia Billa.

Figure 2
Figure 2. Figure 2: Average expression size for different model selec [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Precision at 𝑘 for different model selection criteria. instance, MDL’s coding penalty appears overly conservative, po￾tentially down-ranking models that capture the complex shape of [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
read the original abstract

Effective model selection is critical in symbolic regression (SR) to identify mathematical expressions that balance accuracy and complexity, and have low expected error on unseen data. Many modern implementations of genetic programming (GP) for SR generate a set of Pareto optimal candidate solutions, but reliable automatic selection of solutions that generalize well remains an open issue. Current literature offers various information-theoretic and Bayesian approaches, yet comprehensive comparisons of their performance across different data regimes are limited. This study presents a systematic empirical comparison of widely used selection criteria: the Akaike information criterion (AIC), the corrected AIC (AICc), the Bayesian information criterion (BIC), minimum description length (MDL), as well as Efron's bootstrap estimate for the in-sample prediction error on seven synthetic datasets with Gaussian noise. We rank candidate expressions generated by perturbing ground-truth functions to assess generalization error and selection probability of the ground-truth expression. Our findings reveal that MDL consistently identifies models with the lowest test error and the shortest length across most datasets. While no single criterion dominates all results, MDL and BIC produced the highest probability of selecting the ground-truth expressions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript conducts a systematic empirical comparison of model selection criteria for symbolic regression, specifically AIC, AICc, BIC, MDL, and Efron's bootstrap estimate of in-sample prediction error. Using seven synthetic datasets with added Gaussian noise, candidate expressions are generated via controlled perturbations of known ground-truth functions. The criteria are ranked by their ability to identify models with low held-out test error, short length, and high probability of recovering the ground-truth expression. The central finding is that MDL consistently selects models with the lowest test error and shortest length across most datasets, while MDL and BIC achieve the highest ground-truth recovery rates.

Significance. If the results generalize, the work provides actionable empirical guidance for automatic model selection from Pareto fronts in symbolic regression, where reliable generalization remains an open problem. The use of ground-truth recovery as an evaluation metric and the multi-dataset protocol are strengths that allow direct assessment of selection quality. The study is purely empirical with no circular derivations, which supports its internal validity on the tested setups.

major comments (2)
  1. [Experimental Setup] Experimental setup (candidate generation): generating candidate expressions exclusively by small perturbations of the ground-truth functions produces an artificial distribution of in-sample errors and complexities that does not match the Pareto fronts or error-complexity correlations arising from actual GP-based symbolic regression search. This setup is load-bearing for the claim that MDL is superior, as the reported advantages in test error and ground-truth recovery may be artifacts of the controlled perturbation process rather than properties that hold for realistic candidate pools.
  2. [Results] Results and discussion: the assertion that MDL 'consistently identifies models with the lowest test error ... across most datasets' requires explicit reporting of per-dataset rankings, effect sizes, and whether the protocol (including noise levels, perturbation magnitudes, and statistical tests) was pre-specified; without these details the robustness of the cross-dataset claim cannot be verified.
minor comments (1)
  1. [Methods] The abstract and methods should explicitly list the seven ground-truth functions, the exact perturbation operators, and the range of noise variances used, to enable full reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below with our responses and indicate the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [Experimental Setup] Experimental setup (candidate generation): generating candidate expressions exclusively by small perturbations of the ground-truth functions produces an artificial distribution of in-sample errors and complexities that does not match the Pareto fronts or error-complexity correlations arising from actual GP-based symbolic regression search. This setup is load-bearing for the claim that MDL is superior, as the reported advantages in test error and ground-truth recovery may be artifacts of the controlled perturbation process rather than properties that hold for realistic candidate pools.

    Authors: We acknowledge that the perturbation-based generation of candidates creates a controlled but artificial distribution of errors and complexities that does not replicate the stochastic correlations found in Pareto fronts from standard GP symbolic regression. This design choice was made to isolate the behavior of the selection criteria when the ground truth is known and to enable direct measurement of recovery rates. We agree this limits the strength of any claim about performance on realistic candidate pools. In the revision we will add an explicit limitations subsection that describes the synthetic nature of the candidate set, discusses its implications for generalizability, and outlines planned follow-up experiments that apply the same criteria to full GP-generated fronts. revision: partial

  2. Referee: [Results] Results and discussion: the assertion that MDL 'consistently identifies models with the lowest test error ... across most datasets' requires explicit reporting of per-dataset rankings, effect sizes, and whether the protocol (including noise levels, perturbation magnitudes, and statistical tests) was pre-specified; without these details the robustness of the cross-dataset claim cannot be verified.

    Authors: We agree that greater granularity is required. The revised manuscript will include a supplementary table (or expanded main-text table) that reports, for each of the seven datasets separately, the ranking of all criteria by test error, model length, and ground-truth recovery probability, together with the raw performance values. We will also report effect sizes (e.g., mean differences in test error between MDL and the next-best criterion). The noise variances and perturbation magnitudes were selected from values commonly used in the symbolic regression literature to produce moderate signal-to-noise ratios; however, the overall experimental protocol was not pre-registered. We will state this explicitly in the methods section, list the exact parameter values employed, and add statistical comparisons (Friedman test with post-hoc Nemenyi tests) to support or qualify the word “consistently.” revision: yes

Circularity Check

0 steps flagged

No circularity: pure empirical benchmark on synthetic perturbations

full rationale

The manuscript is a controlled empirical comparison of AIC, AICc, BIC, MDL and bootstrap criteria. Candidate expressions are generated by explicit perturbation of known ground-truth functions, ranked by each criterion, and evaluated on held-out test error plus ground-truth recovery rate. No derivation, uniqueness theorem, ansatz, or fitted parameter is invoked; every reported ranking follows directly from the stated experimental protocol and external test data. The methodology is therefore self-contained and does not reduce any claim to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work rests on standard statistical assumptions for information criteria and on the representativeness of synthetic data; no new free parameters, axioms, or entities are introduced beyond those already present in the cited criteria.

axioms (2)
  • domain assumption Gaussian noise model for synthetic data generation
    Invoked to create the seven benchmark datasets used for all comparisons.
  • domain assumption Perturbed ground-truth expressions form a representative set of candidate models
    Used to compute selection probability of the true expression.

pith-pipeline@v0.9.0 · 5505 in / 1346 out tokens · 64431 ms · 2026-05-13T02:20:49.998238+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages

  1. [1]

    2012.Controlling Overfitting in Symbolic Regression Based on a Bias/Variance Error Decomposition

    Alexandros Agapitos, Anthony Brabazon, and Michael O’Neill. 2012.Controlling Overfitting in Symbolic Regression Based on a Bias/Variance Error Decomposition. Springer Berlin Heidelberg, 438–447. doi:10.1007/978-3-642-32937-1_44

  2. [2]

    2009.A Statistical Learning Perspective of Genetic Programming

    Nur Merve Amil, Nicolas Bredeche, Christian Gagné, Sylvain Gelly, Marc Schoe- nauer, and Olivier Teytaud. 2009.A Statistical Learning Perspective of Genetic Programming. Springer Berlin Heidelberg, 327–338. doi:10.1007/978-3-642- 01181-8_28

  3. [3]

    Sylvain Arlot and Alain Celisse. 2010. A survey of cross-validation procedures for model selection.Statistics Surveys4, none (2010), 40 – 79. doi:10.1214/09-SS054

  4. [4]

    Deaglan Bartlett, Harry Desmond, and Pedro Ferreira. 2023. Priors for symbolic re- gression. InProceedings of the Companion Conference on Genetic and Evolutionary Computation(Lisbon, Portugal)(GECCO ’23 Companion). Association for Com- puting Machinery, New York, NY, USA, 2402–2411. doi:10.1145/3583133.3596327

  5. [5]

    Deaglan J Bartlett, Harry Desmond, and Pedro G Ferreira. 2023. Exhaustive symbolic regression.IEEE Transactions on Evolutionary Computation28, 4 (2023), 950–964

  6. [6]

    Bartlett, Harry Desmond, and Pedro G

    Deaglan J. Bartlett, Harry Desmond, and Pedro G. Ferreira. 2023. Exhaustive Symbolic Regression.IEEE Transactions on Evolutionary Computation(2023), 1–1. arXiv:2211.11461 [astro-ph.CO] doi:10.1109/TEVC.2023.3280250

  7. [7]

    Bomarito, Patrik E

    Geoffrey F. Bomarito, Patrik E. Leser, N. C. M. Strauss, K. M. Garbrecht, and J. D. Hochhalter. 2022. Bayesian model selection for reducing bloat and overfitting in genetic programming for symbolic regression. InProceedings of the Genetic and Evolutionary Computation Conference Companion(Boston, Massachusetts) (GECCO ’22). Association for Computing Machin...

  8. [8]

    Jure Brence, Ljupčo Todorovski, and Sašo Džeroski. 2021. Probabilistic grammars for equation discovery.Knowledge-Based Systems224 (2021), 107077

  9. [9]

    Steven L Brunton, Joshua L Proctor, and J Nathan Kutz. 2016. Discovering governing equations from data by sparse identification of nonlinear dynamical systems.Proceedings of the national academy of sciences113, 15 (2016), 3932–3937

  10. [10]

    Burnham and David R

    Kenneth P. Burnham and David R. Anderson (Eds.). 2004.Model Selection and Multimodel Inference. Springer New York. doi:10.1007/b97636

  11. [11]

    Qi Chen, Bing Xue, Lin Shang, and Mengjie Zhang. 2016. Improving General- isation of Genetic Programming for Symbolic Regression with Structural Risk Minimisation. InProceedings of the Genetic and Evolutionary Computation Con- ference 2016 (GECCO ’16). ACM, 709–716. doi:10.1145/2908812.2908842

  12. [12]

    Bradley Efron. 2004. The Estimation of Prediction Error.J. Amer. Statist. Assoc. 99, 467 (2004), 619–632. doi:10.1198/016214504000000692

  13. [13]

    Friedman, Eric Grosse, and Werner Stuetzle

    Jerome H. Friedman, Eric Grosse, and Werner Stuetzle. 1983. Multidimensional Additive Spline Approximation.SIAM J. Sci. Statist. Comput.4, 2 (June 1983), 291–301. doi:10.1137/0904023

  14. [14]

    Grünwald

    Peter D. Grünwald. 2007.The Minimum Description Length Principle. The MIT Press. doi:10.7551/mitpress/4643.001.0001

  15. [15]

    2009.The elements of statistical learning

    Trevor Hastie, Robert Tibshirani, Jerome Friedman, et al. 2009.The elements of statistical learning. Springer series in statistics New-York

  16. [16]

    Pierre-Alexandre Kamienny, Guillaume Lample, Sylvain Lamprier, and Marco Virgolin. 2023. Deep generative symbolic regression with monte-carlo-tree- search. InInternational Conference on Machine Learning. PMLR, 15655–15668

  17. [17]

    Lukas Kammerer, Gabriel Kronberger, Bogdan Burlacu, Stephan M Winkler, Michael Kommenda, and Michael Affenzeller. 2020. Symbolic regression by exhaustive search: Reducing the search space using syntactical constraints and efficient semantic structure deduplication. InGenetic programming theory and practice XVII. Springer, 79–99

  18. [18]

    John R. Koza. 1992.Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press

  19. [19]

    Winkler, and Michael Affenzeller

    Gabriel Kronberger, Bogdan Burlacu, Michael Kommenda, Stephan M. Winkler, and Michael Affenzeller. 2024.Symbolic Regression. Chapman and Hall/CRC. doi:10.1201/9781315166407

  20. [20]

    Mikel Landajuela, Chak Shing Lee, Jiachen Yang, Ruben Glatt, Claudio P Santiago, Ignacio Aravena, Terrell Mundhenk, Garrett Mulcahy, and Brenden K Petersen

  21. [21]

    A unified framework for deep symbolic regression.Advances in Neural Information Processing Systems35 (2022), 33985–33998

  22. [22]

    Langdon and Riccardo Poli

    William B. Langdon and Riccardo Poli. 2002.Foundations of Genetic Programming. Springer-Verlag. doi:10.1007/978-3-662-04726-2

  23. [23]

    Maciel, Ivan G

    Aron I. Maciel, Ivan G. Costa, and Ana C. Lorena. 2016. Measuring the complexity of regression problems.Proceedings of the International Joint Conference on Neural Networks2016-October (2016), 1450 – 1457. doi:10.1109/IJCNN.2016.7727369

  24. [24]

    Georg Martius and Christoph H Lampert. 2016. Extrapolation and learning equations.arXiv preprint arXiv:1610.02995(2016)

  25. [25]

    Trent McConaghy. 2011. FFX: Fast, scalable, deterministic symbolic regression technology. InGenetic Programming Theory and Practice IX. Springer, 235–260

  26. [26]

    Montaña, César L

    José L. Montaña, César L. Alonso, Cruz Enrique Borges, and Javier de la Dehesa. 2011.Penalty Functions for Genetic Programming Algorithms. Springer Berlin Heidelberg, 550–562. doi:10.1007/978-3-642-21928-3_40

  27. [27]

    Andrea Murari, Riccardo Rossi, Luca Spolladore, Michele Lungaroni, Pasquale Gaudio, and Michela Gelfusa. 2023. A practical utility-based but objective ap- proach to model selection for regression in scientific applications.Artificial Intelligence Review56, S2 (Oct. 2023), 2825–2859. doi:10.1007/s10462-023-10591-4

  28. [28]

    Miguel Nicolau and Alexandros Agapitos. 2021. Choosing function sets with better generalisation performance for symbolic regression models.Genetic pro- gramming and evolvable machines22, 1 (2021), 73–100

  29. [29]

    Olson, William La Cava, Patryk Orzechowski, Ryan J

    Randal S. Olson, William La Cava, Patryk Orzechowski, Ryan J. Urbanowicz, and Jason H. Moore. 2017. PMLB: a large benchmark suite for machine learning evaluation and comparison.BioData Mining10, 1 (11 Dec 2017), 36. doi:10.1186/ s13040-017-0154-4

  30. [30]

    2004.Grammatical Evolution by Grammatical Evolution: The Evolution of Grammar and Genetic Code

    Michael O’Neill and Conor Ryan. 2004.Grammatical Evolution by Grammatical Evolution: The Evolution of Grammar and Genetic Code. Springer Berlin Heidelberg, GECCO ’26, July 13–17, 2026, San José, Costa Rica Soltani et al. 138–149. doi:10.1007/978-3-540-24650-3_13

  31. [31]

    Fitria Wulandari Ramlan, Gabriel Kronberger, Colm O’Riordan, and James Mc- Dermott. 2026. Comparative Analysis of Model Selection Criteria for Symbolic Regression using Genetic Programming. InComputational Intelligence, Proc. of IJCCI 2025, CCIS 2828. Springer Nature, 1–18. doi:10.1007/978-3-032-15635-8_6 accepted to be published

  32. [32]

    Jorma Rissanen. 1978. Modeling by shortest data description.Automatica14, 5 (1978), 465–471. doi:10.1016/0005-1098(78)90005-5

  33. [33]

    Subham Sahoo, Christoph Lampert, and Georg Martius. 2018. Learning Equations for Extrapolation and Control. InProceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 80), Jennifer Dy and Andreas Krause (Eds.). PMLR, 4442–4450. https://proceedings.mlr.press/ v80/sahoo18a.html

  34. [34]

    Rafal Salustowicz and Jürgen Schmidhuber. 1997. Probabilistic Incremental Program Evolution.Evolutionary Computation5, 2 (June 1997), 123–141. doi:10. 1162/evco.1997.5.2.123

  35. [35]

    Gideon Schwarz. 1978. Estimating the Dimension of a Model.The Annals of Statistics6, 2 (1978), 461–464. http://www.jstor.org/stable/2958889

  36. [36]

    Erik-Jan Senn. 2025. Model Recovery in Symbolic Regression: Theory, Con- jectures, and Open Questions. InProceedings of the Genetic and Evolutionary Computation Conference Companion (GECCO ’25 Companion). ACM, 2556–2562. doi:10.1145/3712255.3734334

  37. [37]

    Parshin Shojaee, Kazem Meidani, Amir Barati Farimani, and Chandan Reddy

  38. [38]

    Transformer-based planning for symbolic regression.Advances in Neural Information Processing Systems36 (2023), 45907–45919

  39. [39]

    Smits and Mark Kotanchek

    Guido F. Smits and Mark Kotanchek. 2005.Pareto-Front Exploitation in Symbolic Regression. Springer US, Boston, MA, 283–299. doi:10.1007/0-387-23254-0_17

  40. [40]

    Fangzheng Sun, Yang Liu, Jian-Xun Wang, and Hao Sun. 2023. Symbolic Physics Learner: Discovering governing equations via Monte Carlo tree search. InThe Eleventh International Conference on Learning Representations. https://openreview. net/forum?id=ZTK3SefE8_Z

  41. [41]

    Alexander Topchy and W. F. Punch. 2001. Faster genetic programming based on local gradient search of numeric leaf values. InProceedings of the 3rd Annual Conference on Genetic and Evolutionary Computation(San Francisco, California) (GECCO’01). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 155–162

  42. [42]

    Vladislavleva, Guido F

    Ekaterina J. Vladislavleva, Guido F. Smits, and Dick den Hertog. 2009. Order of Nonlinearity as a Complexity Measure for Models Generated by Symbolic Regression via Pareto Genetic Programming.IEEE Transactions on Evolutionary Computation13, 2 (April 2009), 333–349. doi:10.1109/tevc.2008.926486

  43. [43]

    Yilong Xu, Yang Liu, and Hao Sun. 2024. Reinforcement symbolic regression machine. InThe Twelfth International Conference on Learning Representations