Synthics: Synthetic Physics-like Datasets for Machine Learning

Jari Veps\"al\"ainen

arxiv: 2606.06724 · v1 · pith:F5B5SJGYnew · submitted 2026-06-04 · 💻 cs.LG

Synthics: Synthetic Physics-like Datasets for Machine Learning

Jari Veps\"al\"ainen This is my paper

Pith reviewed 2026-06-28 02:07 UTC · model grok-4.3

classification 💻 cs.LG

keywords synthetic dataphysics equationsBayesian probabilistic context-free grammarregression datasetsmachine learninghyperparameter tuningequation corpusKolmogorov-Smirnov validation

0 comments

The pith

A Bayesian probabilistic context-free grammar generates synthetic physics-like regression datasets that match real equations on all eight structural features and enable equivalent hyperparameter tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a way to produce artificial datasets whose equations and input ranges resemble those found in physics. It trains a Bayesian version of a probabilistic context-free grammar on an existing corpus of equations so that newly sampled equations keep the same algebraic patterns. Input values are then drawn from physically plausible sub-ranges that also respect variable constraints. This matters for machine learning because many physics problems lack enough real measurements to train or tune models reliably. When the synthetic data is used to tune a gradient-boosted regressor, the selected settings perform as well on real data as settings chosen directly from real data.

Core claim

The central claim is that a Bayesian probabilistic context-free grammar fitted to the Feynman equation corpus can sample novel equations whose eight measured structural properties match the corpus distribution, whereas an unsmoothed probabilistic grammar matches only two of them. When these synthetic equations are paired with inputs drawn from recovered applicability domains, the resulting regression datasets allow a gradient-boosted regressor to select, on average, the sixth-best hyperparameter configuration out of twenty when tested on real data, matching the result obtained by tuning directly on real data and beating both random expression trees and pure noise.

What carries the argument

Bayesian Probabilistic Context-Free Grammar that encodes the algebraic structure of the equation corpus through a prior and is used both to sample new equations and to characterise their applicability domains.

If this is right

The Bayesian prior is required to achieve structural fidelity when the training corpus is limited in size.
Synthetic datasets produced this way can substitute for real data in hyperparameter tuning without loss of downstream performance.
Non-intrusive probing recovers both the applicability domain and inter-variable constraints for each generated equation.
The generated data substantially outperforms both random expression trees and unstructured noise in the same tuning task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same grammar-based generation pipeline could be applied to equation corpora from chemistry or biology to create synthetic data for those fields.
Further validation on additional model families such as neural networks or support-vector machines would test whether the equivalence generalises beyond gradient boosting.
The recovered applicability domains might be reused to generate data that respects physical units or conservation laws not explicitly present in the grammar.
One could measure whether equations sampled from the grammar obey additional invariants such as dimensional consistency that were not part of the original eight structural features.

Load-bearing premise

That agreement on eight structural features plus matching performance in one hyperparameter-tuning task is enough to guarantee the synthetic data will be useful for other machine learning models and other physics domains.

What would settle it

Repeat the hyperparameter-tuning experiment with a neural-network regressor instead of gradient boosting and check whether the synthetic-data tuning still selects configurations whose test performance on real data matches that of real-data tuning.

Figures

Figures reproduced from arXiv: 2606.06724 by Jari Veps\"al\"ainen.

**Figure 2.** Figure 2: An example equation expression tree and the corresponding Bayesian-smoothed production rules. Each rule’s [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Applicability-domain specification for one variable. The probing range is sampled, each point evaluated, and [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Input sampling within a variable’s applicability domain. A sub-range [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Average configuration rankings across tasks for each source. The real-data ranking is the baseline, and [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: The regret of each source, measured as the average [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

read the original abstract

Representative data is fundamental in machine learning, as limited data hinders generalisation. Collecting sufficient real-world samples is often infeasible. Synthetic data generation offers a practical solution, but only if the generated data faithfully reflects the structure of real observations. In this paper, a method for generating synthetic regression datasets that structurally resemble physics equations from a given equation corpus is presented. The approach uses a Bayesian Probabilistic Context-Free Grammar to capture the underlying algebraic structure of the corpus, from which novel equations are sampled. To ensure the generated inputs lie within a physically meaningful domain, the applicability domain is characterised for each equation through non-intrusive probing, also recovering inter-variable constraints. Input sampling further mimics realistic experimental conditions by drawing from random sub-ranges of the valid domain with mixed uniform and truncated normal distributions. The generated data is statistically validated against the Feynman equation corpus using Kolmogorov-Smirnov tests. The generated equations match the corpus on all of the eight studied structural features, compared to only two for an unsmoothed purely probabilistic grammar, demonstrating that the Bayesian prior is essential for structural fidelity given the size of the corpus. In a downstream hyperparameter-tuning task, a gradient-boosted regressor tuned on the synthetic data picks, on average, the 6th-best configuration out of 20 on real data, matching the result of tuning on real data itself and substantially outperforming random expression trees (10th) and noise (19th).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Bayesian PCFG plus domain probing produces equations that match the Feynman corpus on eight structural features and substitute for real data in one hyperparameter-tuning task, but the evidence stays narrow.

read the letter

The core contribution is a pipeline that samples new regression equations from a Bayesian-smoothed PCFG trained on the Feynman corpus, then uses non-intrusive probing to recover valid input domains and inter-variable constraints, and finally draws inputs from mixed uniform and truncated-normal sub-ranges. The generated equations pass Kolmogorov-Smirnov tests on all eight structural features while an unsmoothed grammar only matches two. In the downstream check, a gradient-boosted regressor tuned on the synthetic data reaches the same average rank (6th of 20) on real data as tuning directly on real data, beating random trees and noise.

The method is practical and the validation is external: it compares against the real corpus and a real-data task rather than fitting to itself. The Bayesian prior clearly helps structural fidelity on a small corpus, and the domain-probing step addresses a real issue with synthetic physics data.

The soft spot is the narrowness of the tests. Everything rests on those eight features plus equivalence in one specific task with one model family. No results appear for symbolic regression, PINNs, other regressors, or different equation sets, so it is unclear whether the synthetic data transfers usefully beyond this setup. The free parameters in the Bayesian prior also sit unexamined.

This is useful for people building data-scarce scientific ML pipelines who need regression data that respects algebraic structure. A reader working on synthetic data for physics-informed models would get concrete implementation ideas and a working baseline. It is coherent on its own terms and shows honest engagement with the corpus and a real task, so it deserves a serious referee even though additional experiments on transfer would strengthen the case.

Referee Report

3 major / 2 minor

Summary. The paper introduces Synthics, a method that trains a Bayesian probabilistic context-free grammar on the Feynman equation corpus to sample novel equations, then generates regression datasets by probing applicability domains and sampling inputs from mixed uniform/truncated-normal distributions over sub-ranges. It reports that the synthetic equations match the corpus on all eight structural features via Kolmogorov-Smirnov tests (versus only two features for an unsmoothed baseline grammar) and that hyperparameter tuning of a gradient-boosted regressor on the synthetic data yields the same average rank (6th of 20) on real data as tuning directly on real data.

Significance. If the structural-matching and downstream-equivalence results hold under fuller verification, the work supplies a concrete route to large-scale synthetic regression data that preserves algebraic structure from a physics corpus, which could mitigate data scarcity in scientific ML. The explicit demonstration that the Bayesian prior is required for fidelity (via the baseline comparison) and the reproducible downstream task are strengths that could be cited by follow-on studies.

major comments (3)

[§5.2] §5.2 (downstream experiment): equivalence is shown only for hyperparameter tuning of gradient-boosted regressors on a single real dataset; this single-task, single-model result is load-bearing for the claim that synthetic data has 'utility matching real data in downstream ML use,' yet no transfer experiments to other tasks (e.g., symbolic regression) or model families are reported.
[§4.1–4.2] §4.1–4.2 (structural validation): the eight features and the exact KS-test procedure (including sample sizes, multiple-testing correction, and power analysis) are not detailed enough to confirm that 'match on all eight' is robust rather than an artifact of feature selection or low statistical power; this directly supports the central claim that the Bayesian prior is 'essential for structural fidelity.'
[§3.3] §3.3 (applicability-domain probing): the non-intrusive probing and inter-variable constraint recovery are validated only indirectly through the same narrow downstream result; no direct ablation or sensitivity check on the probing step is provided, leaving open whether domain characterization errors propagate into the reported equivalence.

minor comments (2)

The definition of the eight structural features should be moved from the appendix to the main text or given an explicit table for reader accessibility.
Notation for the Bayesian prior hyperparameters (free parameters listed in the axiom ledger) is introduced without a consolidated table; a single reference table would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed report. We address each major comment below with clarifications from the manuscript and indicate where revisions will be made to improve clarity and robustness without altering the core claims or results.

read point-by-point responses

Referee: [§5.2] §5.2 (downstream experiment): equivalence is shown only for hyperparameter tuning of gradient-boosted regressors on a single real dataset; this single-task, single-model result is load-bearing for the claim that synthetic data has 'utility matching real data in downstream ML use,' yet no transfer experiments to other tasks (e.g., symbolic regression) or model families are reported.

Authors: The downstream experiment in §5.2 is explicitly framed as a targeted demonstration of utility for hyperparameter tuning of gradient-boosted regressors, using the average rank (6th of 20) on real data as the metric. This matches the performance of tuning directly on real data and outperforms the baselines (random trees at 10th, noise at 19th). While we agree the result is narrow, it directly supports the paper's claim of 'utility matching real data in downstream ML use' within the regression setting studied. We will revise the discussion in §5.2 and the conclusion to explicitly qualify the scope as an initial validation on this task and model family, and note that broader transfer (e.g., to symbolic regression) remains future work. No new experiments are added at this stage. revision: partial
Referee: [§4.1–4.2] §4.1–4.2 (structural validation): the eight features and the exact KS-test procedure (including sample sizes, multiple-testing correction, and power analysis) are not detailed enough to confirm that 'match on all eight' is robust rather than an artifact of feature selection or low statistical power; this directly supports the central claim that the Bayesian prior is 'essential for structural fidelity.'

Authors: Section 4.1 defines the eight structural features (number of operators, constants, variables, etc.) drawn from the Feynman corpus analysis. Section 4.2 reports KS tests showing matches on all eight for the Bayesian PCFG versus only two for the unsmoothed baseline. We will expand §4.2 to include: (i) explicit sample sizes used for the KS tests (number of sampled equations), (ii) confirmation that no multiple-testing correction was applied because the features are treated as separate univariate comparisons, and (iii) a brief power discussion based on the observed effect sizes. These additions will be made from the existing experimental logs without changing the reported outcomes. revision: yes
Referee: [§3.3] §3.3 (applicability-domain probing): the non-intrusive probing and inter-variable constraint recovery are validated only indirectly through the same narrow downstream result; no direct ablation or sensitivity check on the probing step is provided, leaving open whether domain characterization errors propagate into the reported equivalence.

Authors: The applicability-domain probing in §3.3 is a core methodological step that enables physically meaningful input sampling; its correctness is supported by the subsequent structural fidelity (all eight features matched) and the downstream equivalence. We acknowledge the validation is indirect. We will add a short paragraph in §3.3 and §4 discussing the probing's role and why direct ablation was not performed (computational cost of repeated domain characterization), while noting that errors would have manifested in the KS mismatches or degraded downstream ranks—which were not observed. This clarifies rather than expands the experiments. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's core derivation fits a Bayesian PCFG to the Feynman corpus to sample new equations, then validates structural fidelity via KS tests on eight features (showing superiority to an unsmoothed baseline) and utility via a downstream hyperparameter-tuning task on real data. These steps rely on external corpus statistics, baseline comparison, and real-data performance rather than any self-definitional reduction, fitted parameter renamed as prediction, or load-bearing self-citation chain. The Bayesian prior's role is demonstrated empirically against the baseline, not assumed by construction. No ansatzes, uniqueness theorems, or renamings of known results appear as load-bearing elements. The chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on learning a prior from the given corpus and assumes the PCFG form captures relevant algebraic structure; no new physical entities are postulated.

free parameters (1)

Bayesian prior hyperparameters
Parameters controlling the strength of the Bayesian smoothing on the PCFG are required to achieve the reported structural fidelity and are implicitly fitted or selected from the corpus.

axioms (1)

domain assumption A probabilistic context-free grammar can adequately capture the algebraic structure of equations in the Feynman corpus.
This underpins the sampling of novel equations and is invoked throughout the method description in the abstract.

pith-pipeline@v0.9.1-grok · 5780 in / 1403 out tokens · 41063 ms · 2026-06-28T02:07:00.675348+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 2 canonical work pages

[1]

Accurate predictions on small data with a tabular foundation model.Nature, 637 (8045):319–326, 2025

Noah Hollmann, Samuel Müller, Lennart Purucker, Arjun Krishnakumar, Max Körfer, Shi Bin Hoo, Robin Tibor Schirrmeister, and Frank Hutter. Accurate predictions on small data with a tabular foundation model.Nature, 637 (8045):319–326, 2025

2025
[2]

Laith Alzubaidi, Jinshuai Bai, Aiman Al-Sabaawi, Jose Santamaría, A. S. Albahri, Bashar Sami Nayyef Al- dabbagh, Mohammed A. Fadhel, Mohamed Manoufali, Jinglan Zhang, Ali H. Al-Timemy, Ye Duan, Amjed Abdullah, Laith Farhan, Yi Lu, Ashish Gupta, Felix Albu, Amin Abbosh, and Yuantong Gu. A survey on deep learning tools dealing with data scarcity: definition...

2023
[3]

Mandeep Goyal and Qusay H. Mahmoud. A systematic review of synthetic data generation techniques using generative AI.Electronics, 13(17):3509, 2024

2024
[4]

How useful is synthetic data in developing predictive models for health?Studies in Health Technology and Informatics, 2025

Mohammad Ahmed Basri and Helen Chen. How useful is synthetic data in developing predictive models for health?Studies in Health Technology and Informatics, 2025

2025
[5]

Training deep networks with synthetic data: Bridging the reality gap by domain randomization

Jonathan Tremblay, Aayush Prakash, David Acuna, Mark Brophy, Varun Jampani, Cem Anil, Thang To, Eric Cameracci, Shaad Boochoon, and Stan Birchfield. Training deep networks with synthetic data: Bridging the reality gap by domain randomization. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 969–977, 2018

2018
[6]

Domain randomiza- tion for transferring deep neural networks from simulation to the real world

Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Domain randomiza- tion for transferring deep neural networks from simulation to the real world. InProceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 23–30. IEEE, 2017

2017
[7]

Maddix, Junming Yin, Nick Erickson, Abdul Fatir Ansari, Boran Han, Shuai Zhang, Leman Akoglu, Christos Faloutsos, Michael W

Xiyuan Zhang, Danielle C. Maddix, Junming Yin, Nick Erickson, Abdul Fatir Ansari, Boran Han, Shuai Zhang, Leman Akoglu, Christos Faloutsos, Michael W. Mahoney, Cuixiong Hu, Huzefa Rangwala, George Karypis, and Bernie Wang. Mitra: Mixed synthetic priors for enhancing tabular foundation models. arXiv preprint, 2025

2025
[8]

Physics- informed machine learning.Nature Reviews Physics, 3:422–440, 2021

George Em Karniadakis, Ioannis G Kevrekidis, Lu Lu, Paris Perdikaris, Sifan Wang, and Liu Yang. Physics- informed machine learning.Nature Reviews Physics, 3:422–440, 2021

2021
[9]

John R. Rice. The algorithm selection problem. InAdvances in Computers, volume 15, pages 65–118. Elsevier, 1976

1976
[10]

Smith-Miles

Kate A. Smith-Miles. Cross-disciplinary perspectives on meta-learning for algorithm selection.ACM Computing Surveys, 41(1):6:1–6:25, 2008. doi: 10.1145/1456650.1456656

work page doi:10.1145/1456650.1456656 2008
[11]

Meta-learning: A survey.arXiv preprint arXiv:1810.03548, 2018

Joaquin Vanschoren. Meta-learning: A survey.arXiv preprint arXiv:1810.03548, 2018

Pith/arXiv arXiv 2018
[12]

N., Bischl, B., and Torgo, L

Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. OpenML: Networked science in machine learning.ACM SIGKDD Explorations Newsletter, 15(2):49–60, 2014. doi: 10.1145/2641190.2641198

work page doi:10.1145/2641190.2641198 2014
[13]

Efficient and robust automated machine learning

Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum, and Frank Hutter. Efficient and robust automated machine learning. InAdvances in Neural Information Processing Systems, volume 28, 2015. 13 APREPRINT- JUNE8, 2026

2015
[14]

Kusner, Brooks Paige, and José Miguel Hernández-Lobato

Matt J. Kusner, Brooks Paige, and José Miguel Hernández-Lobato. Grammar variational autoencoder. In Proceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 1945–1954. PMLR, 2017

1945
[15]

Probabilistic grammars for equation discovery.Knowledge- Based Systems, 224:107077, 2021

Jure Brence, Ljupˇco Todorovski, and Sašo Džeroski. Probabilistic grammars for equation discovery.Knowledge- Based Systems, 224:107077, 2021

2021
[16]

Efficient generator of mathematical expressions for symbolic regression.Machine Learning, 112(11):4563–4596, 2023

Sebastian Mežnar, Sašo Džeroski, and Ljupˇco Todorovski. Efficient generator of mathematical expressions for symbolic regression.Machine Learning, 112(11):4563–4596, 2023

2023
[17]

Karakasidis

Dimitrios Angelis, Filippos Sofos, and Theodoros E. Karakasidis. Artificial intelligence in physical sciences: Symbolic regression trends and perspectives.Archives of Computational Methods in Engineering, 30(6):3845– 3865, 2023

2023
[18]

Interpretable scientific discovery with symbolic regression: A review.Artificial Intelligence Review, 57(1):2, 2024

Nour Makke and Sanjay Chawla. Interpretable scientific discovery with symbolic regression: A review.Artificial Intelligence Review, 57(1):2, 2024

2024
[19]

AI Feynman: A physics-inspired method for symbolic regression

Silviu-Marian Udrescu and Max Tegmark. AI Feynman: A physics-inspired method for symbolic regression. Science Advances, 6(16):eaay2631, 2020

2020
[20]

Neural symbolic regression that scales

Luca Biggio, Tommaso Bendinelli, Alexander Neitz, Aurelien Lucchi, and Giambattista Parascandolo. Neural symbolic regression that scales. InProceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 936–945. PMLR, 2021

2021
[21]

End-to-end symbolic regression with transformers

Pierre-Alexandre Kamienny, Stéphane d’Ascoli, Guillaume Lample, and François Charton. End-to-end symbolic regression with transformers. InAdvances in Neural Information Processing Systems, volume 35, pages 10269– 10281, 2022

2022
[22]

Parshin Shojaee, Kazem Meidani, Amir Barati Farimani, and Chandan K. Reddy. Transformer-based planning for symbolic regression. InAdvances in Neural Information Processing Systems, volume 36, 2023

2023
[23]

Can test-time computation mitigate reproduction bias in neural symbolic regression? arXiv preprint arXiv:2505.22081, 2025

Shun Sato and Issei Sato. Can test-time computation mitigate reproduction bias in neural symbolic regression? arXiv preprint arXiv:2505.22081, 2025

arXiv 2025
[24]

Active expansion sampling for learning feasible domains in an unbounded input space

Wei Chen and Mark Fuge. Active expansion sampling for learning feasible domains in an unbounded input space. Structural and Multidisciplinary Optimization, 57:925–945, 2017

2017
[25]

A novel adaptive sampling based methodol- ogy for feasible region identification of compute intensive models using artificial neural network.AIChE Journal, 67(2), 2021

Nirupaplava Metta, Rohit Ramachandran, and Marianthi Ierapetritou. A novel adaptive sampling based methodol- ogy for feasible region identification of compute intensive models using artificial neural network.AIChE Journal, 67(2), 2021

2021
[26]

Constrained adaptive sampling for domain reduction in surrogate model generation: Applications to hydrogen production.AIChE Journal, 67(7), 2021

Julian Straus, Jabir Ali Ouassou, Brage Rugstad Knudsen, and Rahul Anantharaman. Constrained adaptive sampling for domain reduction in surrogate model generation: Applications to hydrogen production.AIChE Journal, 67(7), 2021

2021
[27]

Chance-constrained flow matching for high-fidelity constraint-aware generation

Jinhao Liang, Yixuan Sun, Anirban Samaddar, Sandeep Madireddy, and Ferdinando Fioretto. Chance-constrained flow matching for high-fidelity constraint-aware generation. arXiv preprint arXiv:2509.25157, 2025

arXiv 2025
[28]

The feynman lectures on physics; vol

Richard P Feynman, Robert B Leighton, Matthew Sands, and Everett M Hafner. The feynman lectures on physics; vol. i.American Journal of Physics, 33(9):750–752, 1965

1965
[29]

Feynman symbolic regression database

Silviu-Marian Udrescu and Max Tegmark. Feynman symbolic regression database. https://space.mit.edu/ home/tegmark/aifeynman.html, 2020. Accessed: 2026-05-18

2020
[30]

Optuna: A next-generation hyperparameter optimization framework

Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Optuna: A next-generation hyperparameter optimization framework. InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 2623–2631. ACM, 2019

2019
[31]

Algorithms for hyper-parameter optimization

James Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for hyper-parameter optimization. InAdvances in Neural Information Processing Systems, volume 24, pages 2546–2554, 2011

2011
[32]

The original borda count and partial voting.Social Choice and Welfare, 40(2):353–358, 2013

Peter Emerson. The original borda count and partial voting.Social Choice and Welfare, 40(2):353–358, 2013. 14 APREPRINT- JUNE8, 2026 Appendix Figure A1: Validation of the standard PCFG. Per-feature distribution comparison between the Feynman corpus (blue) and 1000 generated equations (orange); the final panel shows operator frequencies. Two of the eight s...

2013

[1] [1]

Accurate predictions on small data with a tabular foundation model.Nature, 637 (8045):319–326, 2025

Noah Hollmann, Samuel Müller, Lennart Purucker, Arjun Krishnakumar, Max Körfer, Shi Bin Hoo, Robin Tibor Schirrmeister, and Frank Hutter. Accurate predictions on small data with a tabular foundation model.Nature, 637 (8045):319–326, 2025

2025

[2] [2]

Laith Alzubaidi, Jinshuai Bai, Aiman Al-Sabaawi, Jose Santamaría, A. S. Albahri, Bashar Sami Nayyef Al- dabbagh, Mohammed A. Fadhel, Mohamed Manoufali, Jinglan Zhang, Ali H. Al-Timemy, Ye Duan, Amjed Abdullah, Laith Farhan, Yi Lu, Ashish Gupta, Felix Albu, Amin Abbosh, and Yuantong Gu. A survey on deep learning tools dealing with data scarcity: definition...

2023

[3] [3]

Mandeep Goyal and Qusay H. Mahmoud. A systematic review of synthetic data generation techniques using generative AI.Electronics, 13(17):3509, 2024

2024

[4] [4]

How useful is synthetic data in developing predictive models for health?Studies in Health Technology and Informatics, 2025

Mohammad Ahmed Basri and Helen Chen. How useful is synthetic data in developing predictive models for health?Studies in Health Technology and Informatics, 2025

2025

[5] [5]

Training deep networks with synthetic data: Bridging the reality gap by domain randomization

Jonathan Tremblay, Aayush Prakash, David Acuna, Mark Brophy, Varun Jampani, Cem Anil, Thang To, Eric Cameracci, Shaad Boochoon, and Stan Birchfield. Training deep networks with synthetic data: Bridging the reality gap by domain randomization. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 969–977, 2018

2018

[6] [6]

Domain randomiza- tion for transferring deep neural networks from simulation to the real world

Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Domain randomiza- tion for transferring deep neural networks from simulation to the real world. InProceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 23–30. IEEE, 2017

2017

[7] [7]

Maddix, Junming Yin, Nick Erickson, Abdul Fatir Ansari, Boran Han, Shuai Zhang, Leman Akoglu, Christos Faloutsos, Michael W

Xiyuan Zhang, Danielle C. Maddix, Junming Yin, Nick Erickson, Abdul Fatir Ansari, Boran Han, Shuai Zhang, Leman Akoglu, Christos Faloutsos, Michael W. Mahoney, Cuixiong Hu, Huzefa Rangwala, George Karypis, and Bernie Wang. Mitra: Mixed synthetic priors for enhancing tabular foundation models. arXiv preprint, 2025

2025

[8] [8]

Physics- informed machine learning.Nature Reviews Physics, 3:422–440, 2021

George Em Karniadakis, Ioannis G Kevrekidis, Lu Lu, Paris Perdikaris, Sifan Wang, and Liu Yang. Physics- informed machine learning.Nature Reviews Physics, 3:422–440, 2021

2021

[9] [9]

John R. Rice. The algorithm selection problem. InAdvances in Computers, volume 15, pages 65–118. Elsevier, 1976

1976

[10] [10]

Smith-Miles

Kate A. Smith-Miles. Cross-disciplinary perspectives on meta-learning for algorithm selection.ACM Computing Surveys, 41(1):6:1–6:25, 2008. doi: 10.1145/1456650.1456656

work page doi:10.1145/1456650.1456656 2008

[11] [11]

Meta-learning: A survey.arXiv preprint arXiv:1810.03548, 2018

Joaquin Vanschoren. Meta-learning: A survey.arXiv preprint arXiv:1810.03548, 2018

Pith/arXiv arXiv 2018

[12] [12]

N., Bischl, B., and Torgo, L

Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. OpenML: Networked science in machine learning.ACM SIGKDD Explorations Newsletter, 15(2):49–60, 2014. doi: 10.1145/2641190.2641198

work page doi:10.1145/2641190.2641198 2014

[13] [13]

Efficient and robust automated machine learning

Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum, and Frank Hutter. Efficient and robust automated machine learning. InAdvances in Neural Information Processing Systems, volume 28, 2015. 13 APREPRINT- JUNE8, 2026

2015

[14] [14]

Kusner, Brooks Paige, and José Miguel Hernández-Lobato

Matt J. Kusner, Brooks Paige, and José Miguel Hernández-Lobato. Grammar variational autoencoder. In Proceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 1945–1954. PMLR, 2017

1945

[15] [15]

Probabilistic grammars for equation discovery.Knowledge- Based Systems, 224:107077, 2021

Jure Brence, Ljupˇco Todorovski, and Sašo Džeroski. Probabilistic grammars for equation discovery.Knowledge- Based Systems, 224:107077, 2021

2021

[16] [16]

Efficient generator of mathematical expressions for symbolic regression.Machine Learning, 112(11):4563–4596, 2023

Sebastian Mežnar, Sašo Džeroski, and Ljupˇco Todorovski. Efficient generator of mathematical expressions for symbolic regression.Machine Learning, 112(11):4563–4596, 2023

2023

[17] [17]

Karakasidis

Dimitrios Angelis, Filippos Sofos, and Theodoros E. Karakasidis. Artificial intelligence in physical sciences: Symbolic regression trends and perspectives.Archives of Computational Methods in Engineering, 30(6):3845– 3865, 2023

2023

[18] [18]

Interpretable scientific discovery with symbolic regression: A review.Artificial Intelligence Review, 57(1):2, 2024

Nour Makke and Sanjay Chawla. Interpretable scientific discovery with symbolic regression: A review.Artificial Intelligence Review, 57(1):2, 2024

2024

[19] [19]

AI Feynman: A physics-inspired method for symbolic regression

Silviu-Marian Udrescu and Max Tegmark. AI Feynman: A physics-inspired method for symbolic regression. Science Advances, 6(16):eaay2631, 2020

2020

[20] [20]

Neural symbolic regression that scales

Luca Biggio, Tommaso Bendinelli, Alexander Neitz, Aurelien Lucchi, and Giambattista Parascandolo. Neural symbolic regression that scales. InProceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 936–945. PMLR, 2021

2021

[21] [21]

End-to-end symbolic regression with transformers

Pierre-Alexandre Kamienny, Stéphane d’Ascoli, Guillaume Lample, and François Charton. End-to-end symbolic regression with transformers. InAdvances in Neural Information Processing Systems, volume 35, pages 10269– 10281, 2022

2022

[22] [22]

Parshin Shojaee, Kazem Meidani, Amir Barati Farimani, and Chandan K. Reddy. Transformer-based planning for symbolic regression. InAdvances in Neural Information Processing Systems, volume 36, 2023

2023

[23] [23]

Can test-time computation mitigate reproduction bias in neural symbolic regression? arXiv preprint arXiv:2505.22081, 2025

Shun Sato and Issei Sato. Can test-time computation mitigate reproduction bias in neural symbolic regression? arXiv preprint arXiv:2505.22081, 2025

arXiv 2025

[24] [24]

Active expansion sampling for learning feasible domains in an unbounded input space

Wei Chen and Mark Fuge. Active expansion sampling for learning feasible domains in an unbounded input space. Structural and Multidisciplinary Optimization, 57:925–945, 2017

2017

[25] [25]

A novel adaptive sampling based methodol- ogy for feasible region identification of compute intensive models using artificial neural network.AIChE Journal, 67(2), 2021

Nirupaplava Metta, Rohit Ramachandran, and Marianthi Ierapetritou. A novel adaptive sampling based methodol- ogy for feasible region identification of compute intensive models using artificial neural network.AIChE Journal, 67(2), 2021

2021

[26] [26]

Constrained adaptive sampling for domain reduction in surrogate model generation: Applications to hydrogen production.AIChE Journal, 67(7), 2021

Julian Straus, Jabir Ali Ouassou, Brage Rugstad Knudsen, and Rahul Anantharaman. Constrained adaptive sampling for domain reduction in surrogate model generation: Applications to hydrogen production.AIChE Journal, 67(7), 2021

2021

[27] [27]

Chance-constrained flow matching for high-fidelity constraint-aware generation

Jinhao Liang, Yixuan Sun, Anirban Samaddar, Sandeep Madireddy, and Ferdinando Fioretto. Chance-constrained flow matching for high-fidelity constraint-aware generation. arXiv preprint arXiv:2509.25157, 2025

arXiv 2025

[28] [28]

The feynman lectures on physics; vol

Richard P Feynman, Robert B Leighton, Matthew Sands, and Everett M Hafner. The feynman lectures on physics; vol. i.American Journal of Physics, 33(9):750–752, 1965

1965

[29] [29]

Feynman symbolic regression database

Silviu-Marian Udrescu and Max Tegmark. Feynman symbolic regression database. https://space.mit.edu/ home/tegmark/aifeynman.html, 2020. Accessed: 2026-05-18

2020

[30] [30]

Optuna: A next-generation hyperparameter optimization framework

Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Optuna: A next-generation hyperparameter optimization framework. InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 2623–2631. ACM, 2019

2019

[31] [31]

Algorithms for hyper-parameter optimization

James Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for hyper-parameter optimization. InAdvances in Neural Information Processing Systems, volume 24, pages 2546–2554, 2011

2011

[32] [32]

The original borda count and partial voting.Social Choice and Welfare, 40(2):353–358, 2013

Peter Emerson. The original borda count and partial voting.Social Choice and Welfare, 40(2):353–358, 2013. 14 APREPRINT- JUNE8, 2026 Appendix Figure A1: Validation of the standard PCFG. Per-feature distribution comparison between the Feynman corpus (blue) and 1000 generated equations (orange); the final panel shows operator frequencies. Two of the eight s...

2013