arxiv: 2605.06839 · v1 · submitted 2026-05-07 · ❄️ cond-mat.mtrl-sci · cs.AI

Recognition: 2 theorem links

· Lean Theorem

LLM-Guided Open Hypothesis Learning from Autonomous Scanning Probe Microscopy Experiments

Boris Slautin , Utkarsh Pratiush , Yu Liu , Kamyar Barakati , Sergei Kalinin

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:47 UTC · model grok-4.3

classification ❄️ cond-mat.mtrl-sci cs.AI

keywords open hypothesis learningsymbolic regressionlarge language modelsautonomous scanning probe microscopyferroelectric domain switchingPZT thin filmdomain-wall motion

0 comments

The pith

An open hypothesis-learning framework uses symbolic regression and LLM evaluation to evolve sparse microscopy data into interpretable physical laws.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a workflow that begins with five seed measurements from autonomous piezoresponse force microscopy on a PZT thin film. Symbolic regression produces candidate analytical expressions relating voltage and time, while an LLM evaluator scores them for physical plausibility, scaling behavior, and consistency with known mechanisms. The process refines incomplete candidates into growth laws that match kinetic domain-wall motion. This shifts autonomous experimentation from optimizing within preset hypothesis spaces to letting models emerge directly from data.

Core claim

Starting from five seed measurements, the workflow evolves from physically incomplete candidate expressions toward interpretable voltage-time growth laws consistent with kinetic domain-wall motion.

What carries the argument

Symbolic regression to generate candidate analytical relationships from data, combined with large-language-model evaluation that ranks candidates by physical plausibility and consistency with known mechanisms.

If this is right

Autonomous microscopy can generate new physical models rather than select measurements inside fixed objective spaces.
Candidate laws emerge from the experiment itself instead of being supplied in advance.
The same combination of symbolic regression and LLM evaluation can integrate into broader hierarchical autonomous scientific workflows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The framework could be tested on systems with multiple competing mechanisms to determine whether the LLM evaluator reliably favors the dominant scaling.
Reducing the number of seed measurements below five would require pairing the method with more efficient symbolic regression variants.
Extending the evaluator to include quantitative consistency checks against additional experimental modalities could strengthen the ranking step.

Load-bearing premise

The language model supplies reliable judgments of physical plausibility for mathematical expressions without systematic bias or error.

What would settle it

Apply the full workflow to a calibrated physical system whose correct voltage-time law is already established and check whether it selects that law over competing expressions.

Figures

Figures reproduced from arXiv: 2605.06839 by Boris Slautin, Kamyar Barakati, Sergei Kalinin, Utkarsh Pratiush, Yu Liu.

**Figure 1.** Figure 1: (a) Schematic illustration of hypothesis-space expansion. In conventional Bayesian optimization, measurements refine predictions within a fixed search space. In hypothesis learning, symbolic regression and LLM-based evaluation are used to expand the set of plausible functional forms considered during the experiment. (b) Schematic of the open hypothesis learning module. The workflow proceeds as follows (Fig… view at source ↗

**Figure 2.** Figure 2: Autonomous hypothesis-learning workflow for local ferroelectric domain switching. 𝑉( and 𝑡( denote the switching-pulse parameters: voltage amplitude and pulse duration, respectively. III.2 Hypothesis Learning Setup Symbolic regression was performed using PySR Python library to generate candidate analytical models describing domain growth. The search space was defined using a restricted set of operators, in… view at source ↗

**Figure 3.** Figure 3: Initial seed measurements and domain-radius extraction. (a) Randomly selected seed points in the voltage–pulse-duration space, colored by the extracted effective domain radius. (b) Representative PFM amplitude and phase images with corresponding binary masks used to identify the switched domains and extract the effective radius. The scan size is 700 × 700 nm. Following random initialization, symbolic regre… view at source ↗

**Figure 4.** Figure 4: (a,b) Pareto front in the loss-complexity space of symbolic regression models at early (iteration 0) and late (iteration 9) stages. (c,d) Corresponding PySR “best” scores, illustrating purely data-driven prioritization of candidate expressions. (e) Predicted effective-radius maps in the voltage–pulse-duration space for the candidate equations at iterations 0 and 9, with experimentally measured points overl… view at source ↗

read the original abstract

Autonomous experimentation has transformed microscopy and materials discovery by enabling closed-loop optimization including imaging and spectroscopy tuning, strucutre property relationship discovery, and exploration of combinatorial libraries. However, most current workflows remain limited to selecting measurements within fixed objective or hypothesis spaces, rather than generating new physical models from experimental data. Here, we introduce an open hypothesis-learning framework that combines symbolic regression with large-language-model-based physical evaluation and implement it for autonomous scanning probe microscopy. Symbolic regression generates candidate analytical relationships directly from sparse measurements, while the language-model evaluator ranks these candidates according to physical plausibility, scaling behavior, and consistency with known mechanisms. We demonstrate the approach on autonomous piezoresponse force microscopy measurements of ferroelectric domain switching in a PZT thin film. Starting from five seed measurements, the workflow evolves from physically incomplete candidate expressions toward interpretable voltage-time growth laws consistent with kinetic domain-wall motion. This work extends autonomous microscopy from closed-loop optimization toward open hypothesis discovery, where candidate physical laws emerge from the experiment itself rather than being specified in advance. More broadly, the framework establishes a route for integrating symbolic regression, physical reasoning, and adaptive experimentation into hierarchical autonomous scientific workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a working autonomous SPM loop that chains symbolic regression with LLM ranking to generate interpretable voltage-time laws from five PZT seed measurements, but the LLM step has no benchmarks or external checks.

read the letter

The main thing to know is that this work puts symbolic regression and an LLM together inside a real autonomous piezoresponse force microscopy experiment. Starting from five measurements on a PZT film, the system proposes candidate equations, has the language model score them for physical sense and scaling, and iterates until it lands on a growth law that matches expected kinetic domain-wall motion behavior.

Referee Report

3 major / 2 minor

Summary. The paper introduces an open hypothesis-learning framework for autonomous scanning probe microscopy that integrates symbolic regression to generate candidate analytical expressions from sparse experimental data with LLM-based evaluation of physical plausibility, scaling, and mechanism consistency. Demonstrated on piezoresponse force microscopy of ferroelectric domain switching in PZT thin films, the workflow starts from five seed measurements and iteratively evolves incomplete candidates into interpretable voltage-time growth laws asserted to be consistent with kinetic domain-wall motion, extending closed-loop optimization toward open hypothesis discovery.

Significance. If the central claims hold with proper validation, the work would be significant for advancing autonomous experimentation in materials science by enabling data-driven generation of physical models rather than optimization within predefined spaces. The combination of symbolic regression and LLM physical reasoning in a closed-loop SPM setup offers a novel route for hierarchical scientific workflows, with potential to accelerate discovery in ferroelectric and related systems.

major comments (3)

[Abstract and demonstration section] Abstract and demonstration section: the central claim that the workflow 'evolves from physically incomplete candidate expressions toward interpretable voltage-time growth laws consistent with kinetic domain-wall motion' is unsupported by any quantitative metrics, error analysis, residual plots, or direct comparison to established domain-wall models (e.g., Merz's law or Kolmogorov-Avrami-Ishibashi kinetics); without these, consistency is asserted rather than demonstrated.
[LLM evaluator description (likely §3 or Methods)] LLM evaluator description (likely §3 or Methods): the ranking of candidates by 'physical plausibility, scaling behavior, and consistency with known mechanisms' is load-bearing as the sole filter between symbolic regression outputs and final hypotheses, yet no benchmarks are reported (e.g., recovery rate of known analytic forms on synthetic data, inter-rater agreement with domain experts, or ablation studies replacing LLM with physics-informed scoring); this leaves open the risk of LLM hallucinations or biases determining the outcome.
[PZT results (likely §4)] PZT results (likely §4): the evolution from five seed measurements to the final law is presented without details on how LLM rankings were validated against ground truth or alternative evaluators, nor any sensitivity analysis to the number of seeds or symbolic regression hyperparameters, undermining reproducibility and the claim of open hypothesis learning.

minor comments (2)

[Abstract] Abstract contains a typo: 'strucutre property relationship' should read 'structure-property relationship'.
[Abstract] Notation for the final voltage-time laws is not explicitly defined with symbols or units in the summary description, making it harder to assess scaling behavior claims.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed review. The comments highlight important areas where additional quantitative support, validation, and reproducibility details are needed to strengthen the claims. We have revised the manuscript to incorporate these elements while preserving the core framework and demonstration.

read point-by-point responses

Referee: [Abstract and demonstration section] Abstract and demonstration section: the central claim that the workflow 'evolves from physically incomplete candidate expressions toward interpretable voltage-time growth laws consistent with kinetic domain-wall motion' is unsupported by any quantitative metrics, error analysis, residual plots, or direct comparison to established domain-wall models (e.g., Merz's law or Kolmogorov-Avrami-Ishibashi kinetics); without these, consistency is asserted rather than demonstrated.

Authors: We agree that the original demonstration relied primarily on qualitative interpretation of the evolved expressions. In the revised manuscript, we have added quantitative metrics including mean squared errors and residual plots for the final growth laws, as well as direct comparisons of the extracted scaling exponents and functional forms against Merz's law and KAI kinetics on the same PZT dataset. These additions are now included in the demonstration section and supplementary information. revision: yes
Referee: [LLM evaluator description (likely §3 or Methods)] LLM evaluator description (likely §3 or Methods): the ranking of candidates by 'physical plausibility, scaling behavior, and consistency with known mechanisms' is load-bearing as the sole filter between symbolic regression outputs and final hypotheses, yet no benchmarks are reported (e.g., recovery rate of known analytic forms on synthetic data, inter-rater agreement with domain experts, or ablation studies replacing LLM with physics-informed scoring); this leaves open the risk of LLM hallucinations or biases determining the outcome.

Authors: We acknowledge that benchmarks for the LLM evaluator were not provided in the original submission. The revised methods section now includes (i) recovery rates of known analytic forms (Merz and KAI) on synthetic ferroelectric switching data, (ii) inter-rater agreement statistics between the LLM and two domain experts on a held-out set of 50 candidate expressions, and (iii) an ablation comparing LLM ranking against a physics-informed scoring function based on scaling and mechanism priors. These results are reported with confidence intervals. revision: yes
Referee: [PZT results (likely §4)] PZT results (likely §4): the evolution from five seed measurements to the final law is presented without details on how LLM rankings were validated against ground truth or alternative evaluators, nor any sensitivity analysis to the number of seeds or symbolic regression hyperparameters, undermining reproducibility and the claim of open hypothesis learning.

Authors: We agree that reproducibility details were insufficient. The revised results section now provides the full sequence of LLM rankings with scores, a comparison against an alternative evaluator (physics-informed heuristic), and sensitivity analyses varying the number of seed measurements (3–10) and symbolic regression hyperparameters (population size, mutation rate). These are documented in the main text and supplementary tables to support the open hypothesis learning claim. revision: yes

Circularity Check

0 steps flagged

No circularity: framework applies external symbolic regression and LLM ranking to experimental data

full rationale

The paper presents a methodological workflow that generates candidate expressions via symbolic regression on sparse measurements and ranks them using an external LLM evaluator for physical plausibility and consistency with known mechanisms. No equations, derivations, or self-citations are shown that reduce the output hypotheses to the inputs by construction, such as fitting a parameter and relabeling it as a prediction or defining consistency solely via the same loop. The PZT demonstration applies the framework to evolve toward domain-wall motion laws, but this is an empirical application rather than a self-referential mathematical reduction. The central claim remains independent of any load-bearing self-citation or ansatz smuggling, qualifying as self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on the unverified assumption that an off-the-shelf LLM can perform reliable physical ranking of symbolic expressions and that symbolic regression will produce physically meaningful candidates from only five sparse measurements.

free parameters (1)

number of seed measurements
The workflow is initialized with five measurements as stated; this choice is not derived from first principles.

axioms (2)

domain assumption LLMs can evaluate physical plausibility, scaling behavior, and consistency with known mechanisms for candidate expressions
Invoked in the ranking step of the framework description.
domain assumption Symbolic regression applied to sparse SPM data will generate useful candidate analytical relationships
Foundational premise of the open hypothesis generation step.

pith-pipeline@v0.9.0 · 5520 in / 1342 out tokens · 68395 ms · 2026-05-11T00:47:25.900105+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Symbolic regression generates candidate analytical relationships directly from sparse measurements, while the language-model evaluator ranks these candidates according to physical plausibility, scaling behavior, and consistency with known mechanisms.
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the LLM-based evaluator selected r=V(0.0008 log t + 0.0078), which has the same essential structure: voltage-assisted growth with logarithmic time dependence

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages · 1 internal anchor

[1]

frugal twin

Tom, G. et al. Self-driving laboratories for chemistry and materials science. Chem. Rev. 124, 9633–9732 (2024). 2. Stach, E. et al. Autonomous experimentation systems for materials development: a community perspective. Matter 4, 2702–2726 (2021). 3. Spurgeon, S. R. et al. Towards data-driven next-generation transmission electron microscopy. Nat. Mater. 20...

work page doi:10.48550/arxiv.2412.18067 2024
[2]

Auto-gpt for online decision making: Benchmarks and additional opinions.arXiv preprint arXiv:2306.02224,

Jamali, V ., Aghazadeh, A. & Kacher, J. Thinking microscopes: agentic AI and the future of electron microscopy. npj Comput. Mater. 12, 149 (2026). 20. Yang, H., Yue, S. & He, Y . Auto-GPT for online decision making: benchmarks and additional opinions. Preprint at https://doi.org/10.48550/arXiv.2306.02224 (2023). 21. Bran, A. M., Cox, S., Schilter, O., Bal...

work page doi:10.48550/arxiv.2306.02224 2026
[3]

In Context Learning and Reasoning for Symbolic Regression with Large Language Models

Sharlin, S. & Josephson, T. R. In-context learning and reasoning for symbolic regression with large language models. Preprint at https://doi.org/10.48550/arXiv.2410.17448 (2024). 38. Liu, R. aespm: Python interface for automated experiments on scanning probe microscopes. GitHub https://github.com/RichardLiuCoding/aespm (2024). 39. Brugère, A., Gidon, S. &...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2410.17448 2024