arxiv: 2604.11945 · v1 · submitted 2026-04-13 · 💻 cs.LG · cs.AI· cs.MA

Recognition: unknown

AutoSurrogate: An LLM-Driven Multi-Agent Framework for Autonomous Construction of Deep Learning Surrogate Models in Subsurface Flow

Jiale Liu , Nanzhe Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:36 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.MA

keywords LLM multi-agent frameworkdeep learning surrogatessubsurface flow modelingBayesian hyperparameter optimizationautonomous surrogate constructioncarbon storage simulationfailure recovery

0 comments

The pith

LLM agents can autonomously build high-quality deep learning surrogates for subsurface flow from natural language instructions alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a multi-agent system powered by large language models can take over the entire process of creating deep learning surrogate models for expensive subsurface flow simulations. Domain scientists supply simulation data and optional preferences in plain language, after which agents profile the data, pick an architecture from a model collection, run Bayesian hyperparameter optimization, train the model, and check results against accuracy thresholds. This matters because high-fidelity numerical simulations are too slow for many-query tasks like uncertainty quantification, yet building faster surrogates has required scarce machine learning expertise. The agents also fix problems on their own, such as restarting training after instabilities or trying a different architecture if accuracy falls short. On a three-dimensional geological carbon storage example that maps permeability fields to pressure and saturation fields across 31 time steps, the system produced a ready-to-use surrogate that beat both expert-designed models and general AutoML tools with no manual adjustments.

Core claim

AutoSurrogate is an LLM-driven multi-agent framework that enables practitioners without ML expertise to build high-quality surrogates for subsurface flow problems through natural-language instructions. Given simulation data and optional preferences, four specialized agents collaboratively execute data profiling, architecture selection from a model zoo, Bayesian hyperparameter optimization, model training, and quality assessment against user-specified thresholds. The system also handles common failure modes autonomously, including restarting training with adjusted configurations when numerical instabilities occur and switching to alternative architectures when predictive accuracy falls short.

What carries the argument

A multi-agent LLM framework with four specialized agents that handle data profiling, architecture selection, Bayesian hyperparameter optimization, training, and quality assessment, plus autonomous mechanisms for recovering from training instabilities or accuracy shortfalls.

If this is right

Domain scientists can obtain deployment-ready surrogates by providing only simulation data and a single natural-language instruction.
The resulting surrogates outperform both expert-designed baselines and domain-agnostic AutoML methods on subsurface flow tasks.
Common training failures such as numerical instabilities are resolved without human input by restarting with adjusted settings or switching architectures.
Minimum human intervention is required at any intermediate stage once the initial instruction is given.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same agent structure could be tested on surrogate construction for other physics-based simulations such as fluid dynamics or reservoir management.
Integration with existing numerical simulators might allow end-to-end automation from data generation through surrogate deployment.
Longer-term use could reveal whether the agents improve over repeated tasks by retaining problem-specific patterns across different geological settings.
Scaling experiments on larger model zoos or higher-dimensional flow problems would test how far the autonomous recovery mechanisms extend.

Load-bearing premise

The assumption that LLM agents can reliably and autonomously execute the full pipeline including architecture selection, Bayesian optimization, training, failure recovery, and quality assessment without ML expertise or human intervention at any stage.

What would settle it

Applying the system to a fresh subsurface flow dataset and observing that it cannot reach the user-specified accuracy threshold after several autonomous retries or that it requires human intervention to succeed.

Figures

Figures reproduced from arXiv: 2604.11945 by Jiale Liu, Nanzhe Wang.

**Figure 1.** Figure 1: Three approaches to constructing deep learning surrogates for subsurface flow. Approach 1: manual, expertdriven pipeline relying on trial-and-error design and tuning. Approach 2: domain-agnostic AutoML, which performs bruteforce search without exploiting physical priors. Approach 3: the proposed AutoSurrogate, an LLM-driven multi-agent framework that takes a dataset and a natural-language instruction and… view at source ↗

**Figure 2.** Figure 2: Overall architecture of the AutoSurrogate framework. Given simulation data and a natural-language instruction, four specialized agents collaborate through a shared memory context to autonomously produce deploymentready surrogate models. The HPO & Training Agent incorporates a closed-loop self-correction mechanism that handles training instabilities and suboptimal convergence through continuation, stabilit… view at source ↗

**Figure 3.** Figure 3: Schematic of the geological model for CO2 storage, showing the full domain and the central storage aquifer. Following the setup in Han et al. (2025), the geological properties for the surrounding region, the caprock, and the basement are set to be homogeneous, as presented in [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Performance–efficiency Pareto frontier for pressure and CO2 saturation prediction. The step line connects Paretooptimal baseline configurations. AutoSurrogate@1 (star) and @3 extend the frontier beyond all baselines. AutoML methods are fast but achieve substantially lower 𝑅2 , especially on saturation where they fall below hand-tuned baselines. UDeepONet3D and FNO3D (saturation) are omitted for clarity. m… view at source ↗

**Figure 5.** Figure 5: Search efficiency analysis. (a) Search overhead comparison: time spent on HPO trials (and, for AutoSurrogate, data profiling + LLM reasoning). For saturation, AutoSurrogate’s LLM-guided search is faster than all AutoML methods because it focuses trials on two pre-selected architectures. (b) LLM call profile for a representative pipeline run (26 calls, 7.2 min total). Each bar represents one LLM invocation,… view at source ↗

**Figure 6.** Figure 6: CO2 saturation predictions and absolute errors for Samples 822 and 851 at 30th year. Each sample uses two rows, and the rightmost column shows the shared colorbars. Sample 820 RecurrentRUNet3D UFNO3D CNNTransformer3D UDeepONet3D AutoSurrogate Ground Truth 210 220 230 240 250 260 Pressure (bar) 𝑃 [bar] RecurrentRUNet3D UFNO3D CNNTransformer3D UDeepONet3D AutoSurrogate 0 5 10 15 20 |Prs. Error| (bar) |𝑃̂ − 𝑃… view at source ↗

**Figure 7.** Figure 7: Pressure predictions and absolute errors for Samples 820 and 815 at 30th year. Each sample uses two rows, and the rightmost column shows the shared colorbars. Liu and Wang: Preprint submitted to Elsevier Page 19 of 22 [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

read the original abstract

High-fidelity numerical simulation of subsurface flow is computationally intensive, especially for many-query tasks such as uncertainty quantification and data assimilation. Deep learning (DL) surrogates can significantly accelerate forward simulations, yet constructing them requires substantial machine learning (ML) expertise - from architecture design to hyperparameter tuning - that most domain scientists do not possess. Furthermore, the process is predominantly manual and relies heavily on heuristic choices. This expertise gap remains a key barrier to the broader adoption of DL surrogate techniques. For this reason, we present AutoSurrogate, a large-language-model-driven multi-agent framework that enables practitioners without ML expertise to build high-quality surrogates for subsurface flow problems through natural-language instructions. Given simulation data and optional preferences, four specialized agents collaboratively execute data profiling, architecture selection from a model zoo, Bayesian hyperparameter optimization, model training, and quality assessment against user-specified thresholds. The system also handles common failure modes autonomously, including restarting training with adjusted configurations when numerical instabilities occur and switching to alternative architectures when predictive accuracy falls short of targets. In our setting, a single natural-language sentence can be sufficient to produce a deployment-ready surrogate model, with minimum human intervention required at any intermediate stage. We demonstrate the utility of AutoSurrogate on a 3D geological carbon storage modeling task, mapping permeability fields to pressure and CO$_2$ saturation fields over 31 timesteps. Without any manual tuning, AutoSurrogate is able to outperform expert-designed baselines and domain-agnostic AutoML methods, demonstrating strong potential for practical deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AutoSurrogate describes a multi-agent LLM system for automating DL surrogate construction in subsurface flow but supplies no metrics, prompts, or success rates to support its outperformance claims.

read the letter

The main thing to know is that this paper presents AutoSurrogate as a four-agent LLM framework that takes a natural-language instruction and handles data profiling, architecture choice from a model zoo, Bayesian optimization, training, quality checks, and autonomous recovery from failures like instabilities for subsurface flow surrogates. The central claim is that it beats expert baselines and generic AutoML on a 3D carbon-storage task with no manual tuning, but the manuscript gives almost no numbers or implementation details to evaluate that.

Referee Report

3 major / 2 minor

Summary. The paper presents AutoSurrogate, an LLM-driven multi-agent framework that automates construction of deep learning surrogate models for subsurface flow problems. Given simulation data and a natural-language instruction, four specialized agents perform data profiling, architecture selection from a model zoo, Bayesian hyperparameter optimization, training, quality assessment against user thresholds, and autonomous recovery from failures such as numerical instabilities or insufficient accuracy. The central claim is that a single natural-language sentence suffices to produce a deployment-ready surrogate that, on a 3D geological carbon-storage task mapping permeability to pressure and CO2 saturation over 31 timesteps, outperforms both expert-designed baselines and domain-agnostic AutoML methods without any manual tuning or ML expertise.

Significance. If the autonomy and outperformance claims hold under transparent reporting, the work would meaningfully lower the barrier for domain scientists to deploy DL surrogates in geoscience applications such as uncertainty quantification and history matching. The multi-agent orchestration of the full pipeline (including failure recovery) represents a practical advance over existing AutoML tools, but its significance is currently limited by the absence of quantitative metrics, reproducibility details, and ablation evidence in the manuscript.

major comments (3)

[Abstract and §5] Abstract and §5 (Experiments): the claim that AutoSurrogate 'outperforms expert-designed baselines and domain-agnostic AutoML methods' is stated without any numerical results (e.g., relative L2 errors on pressure or saturation fields, wall-clock times, or success rates), baseline specifications, or statistical comparisons. This absence makes the headline empirical result impossible to assess and is load-bearing for the central contribution.
[§3] §3 (Methodology): the four agents' decision logic, exact prompts, and rules for architecture switching or training restarts are described only at a high level. Without these, it is impossible to verify the claimed autonomy or to reproduce the pipeline, undermining the assertion that 'minimum human intervention' is required.
[§5] §5 (Experiments): no ablation is reported that compares LLM-driven choices (architecture selection, hyperparameter proposals, failure recovery) against random or default AutoML selections, nor are success-rate statistics (fraction of runs requiring restarts or switches) provided. These omissions prevent separation of the multi-agent framework's contribution from possible prompt engineering or task-specific defaults.

minor comments (2)

[§3] The model zoo composition and the precise quality-assessment thresholds used by the final agent should be listed explicitly for reproducibility.
[§5] Figure captions in the experimental section lack detail on axis scales, error bars, and what each panel represents (e.g., pressure vs. saturation fields).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful comments on our manuscript. We believe the suggested revisions will significantly improve the clarity, reproducibility, and empirical support of our work. Below we provide point-by-point responses to the major comments.

read point-by-point responses

Referee: [Abstract and §5] Abstract and §5 (Experiments): the claim that AutoSurrogate 'outperforms expert-designed baselines and domain-agnostic AutoML methods' is stated without any numerical results (e.g., relative L2 errors on pressure or saturation fields, wall-clock times, or success rates), baseline specifications, or statistical comparisons. This absence makes the headline empirical result impossible to assess and is load-bearing for the central contribution.

Authors: We fully agree that the central empirical claim requires concrete numerical backing to be properly evaluated. In the revised manuscript, we will augment both the abstract and §5 with specific quantitative results, including relative L2 errors for the pressure and CO2 saturation fields, wall-clock times for surrogate construction and inference, success rates across multiple runs, detailed specifications of the expert-designed baselines and AutoML methods, and appropriate statistical comparisons. These additions will make the outperformance claim verifiable. revision: yes
Referee: [§3] §3 (Methodology): the four agents' decision logic, exact prompts, and rules for architecture switching or training restarts are described only at a high level. Without these, it is impossible to verify the claimed autonomy or to reproduce the pipeline, undermining the assertion that 'minimum human intervention' is required.

Authors: We recognize that the high-level description in §3 limits reproducibility. We will revise §3 to provide more detailed explanations of each agent's decision logic. Additionally, we will include the exact prompts used by the agents in a new appendix, along with explicit rules governing architecture selection, switching criteria, and training restart procedures. This will allow independent verification of the autonomy claims. revision: yes
Referee: [§5] §5 (Experiments): no ablation is reported that compares LLM-driven choices (architecture selection, hyperparameter proposals, failure recovery) against random or default AutoML selections, nor are success-rate statistics (fraction of runs requiring restarts or switches) provided. These omissions prevent separation of the multi-agent framework's contribution from possible prompt engineering or task-specific defaults.

Authors: We agree that ablations are necessary to attribute performance gains specifically to the multi-agent LLM framework. In the revised §5, we will incorporate an ablation study that contrasts the LLM-driven decisions with random or default AutoML baselines. We will also report success-rate statistics detailing the fraction of runs that required restarts or architecture switches. These additions will help distinguish the framework's contributions from other factors. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical claims rest on external test cases independent of inputs.

full rationale

The manuscript presents an LLM multi-agent framework for surrogate construction and reports empirical outperformance on a 3D carbon-storage benchmark against expert baselines and AutoML methods. No mathematical derivation chain, fitted parameters, or self-referential definitions appear in the provided text. The central claim is supported by a reported experiment rather than reducing by construction to the framework's own inputs or prior self-citations. The absence of disclosed prompts or ablation metrics concerns reproducibility but does not constitute circularity under the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the premise that current LLMs can serve as reliable autonomous agents for complex ML engineering tasks in a scientific domain.

axioms (1)

domain assumption LLMs possess sufficient reasoning and tool-use capabilities to manage data profiling, architecture selection, hyperparameter optimization, training, and autonomous recovery from instabilities without human intervention.
Invoked throughout the framework description as the basis for full autonomy from a single natural-language instruction.

pith-pipeline@v0.9.0 · 5587 in / 1350 out tokens · 36279 ms · 2026-05-10T16:36:59.912625+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 14 canonical work pages · 4 internal anchors

[1]

GPT-4 Technical Report

Achiam,J.,Adler,S.,Agarwal,S.,Ahmad,L.,Akkaya,I.,Aleman,F.L.,Almeida,D.,Altenschmidt,J.,Altman,S.,Anadkat,S.,etal.,2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 . Akiba,T.,Sano,S.,Yanase,T.,Ohta,T.,Koyama,M.,2019. Optuna:ANext-generationHyperparameterOptimizationFramework,in:Proceedings of the 25th ACM SIGKDD International Conference on Kn...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1145/3292500.3330701 2023
[2]

Acta numerica 9, 1–38

Radial basis functions. Acta numerica 9, 1–38. Cuomo,S.,DiCola,V.S.,Giampaolo,F.,Rozza,G.,Raissi,M.,Piccialli,F.,2022. ScientificMachineLearningThroughPhysics–InformedNeural Networks: Where we are and What’s Next. Journal of Scientific Computing 92,

2022
[3]

Scientific machine learning through physics-informed neural networks: where we are and what’s next.Journal of Scientific Computing, 92(3):88, 2022

doi:10.1007/s10915-022-01939-z. Diab, W., Al Kobaisi, M.,

work page doi:10.1007/s10915-022-01939-z
[4]

Scientific Reports 14, 21298

U-DeepONet: U-Net enhanced deep operator network for geologic carbon sequestration. Scientific Reports 14, 21298. doi:10.1038/s41598-024-72393-0. Elsheikh,A.H.,Hoteit,I.,Wheeler,M.F.,2014. Efficientbayesianinferenceofsubsurfaceflowmodelsusingnestedsamplingandsparsepolynomial chaos surrogates. Computer Methods in Applied Mechanics and Engineering 269, 515–...

work page doi:10.1038/s41598-024-72393-0 2014
[5]

and Peng, Y

A comprehensive survey of self-evolving ai agents: A new paradigm bridging foundation models and lifelong agentic systems. arXiv preprint arXiv:2508.07407 . Feng, Z., Tariq, Z., Shen, X., Yan, B., Tang, X., Zhang, F.,

work page arXiv
[6]

Gas Science and Engineering 125, 205314

An encoder-decoder ConvLSTM surrogate model for simulating geological CO2 sequestration with dynamic well controls. Gas Science and Engineering 125, 205314. doi:10.1016/j.jgsce.2024.205314. Feng, Z., Yan, B., Shen, X., Zhang, F., Tariq, Z., Ouyang, W., Han, Z.,

work page doi:10.1016/j.jgsce.2024.205314 2024
[7]

Advances in Water Resources 196, 104897

A hybrid cnn-transformer surrogate model for the multi-objective robust optimization of geological carbon sequestration. Advances in Water Resources 196, 104897. Fu,S.,Mao,S.,Carbonero,A.,Srikishan,B.,Creasy,N.,Chellal,H.,Mehana,M.,2025. Deeplearning-basedsurrogatemodelingforunderground hydrogen storage. Advances in Water Resources 203, 105014. Gadd, C., ...

2025
[8]

SPE Journal 30, 7822–7839

Reduced-order modeling for fractured reservoir simulation by use of local resolution trajectory piecewise linearization. SPE Journal 30, 7822–7839. Gao,S.,Fang,A.,Huang,Y.,Giunchiglia,V.,Noori,A.,Schwarz,J.R.,Ektefaie,Y.,Kondic,J.,Zitnik,M.,2024. Empoweringbiomedicaldiscovery with ai agents. Cell 187, 6125–6151. Gu, Y., You, H., Cao, J., Yu, M., Fan, H., ...

2024
[9]

doi:10.48550/arXiv.2411.10478

Large Language Models for Constructing and Optimizing Machine Learning Workflows: A Survey. doi:10.48550/arXiv.2411.10478. Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., et al.,

work page doi:10.48550/arxiv.2411.10478
[10]

Advances in Water Resources 150, 103878

Deep residual u-net convolution neural networks with autoregressive strategy for fluid flow predictions in large-scale geosystems. Advances in Water Resources 150, 103878. Karumuri,S.,Tripathy,R.,Bilionis,I.,Panchal,J.,2020. Simulator-freesolutionofhigh-dimensionalstochasticellipticpartialdifferentialequations using deep neural networks. Journal of Comput...

2020
[11]

Liu, J., Peng, D., Wang, H., Liu, C., Li, Y.F., Xie, M., 2026a

Fourier Neural Operator for Parametric Partial Differential Equations. Liu, J., Peng, D., Wang, H., Liu, C., Li, Y.F., Xie, M., 2026a. AeroGPT: Leveraging Large-Scale Audio Model for Aero-Engine Bearing Fault Diagnosis. IEEE Transactions on Cybernetics , 1–14doi:10.1109/TCYB.2026.3668256. Liu, J., Wang, H., Zhang, Y., Luo, X., Hu, J., Liu, Z., Xie, M., 20...

work page doi:10.1109/tcyb.2026.3668256 2026
[12]

arXiv preprint arXiv:2404.11584 , year=

The landscape of emerging ai agent architectures for reasoning, planning, and tool calling: A survey. arXiv preprint arXiv:2404.11584 . Meng, J., Li, H.,

work page arXiv
[13]

Transport in porous media 82, 3–17

New trapping mechanism in carbon sequestration. Transport in porous media 82, 3–17. Sapkota,R.,Roumeliotis,K.I.,Karkee,M.,2025. Aiagentsvs.agenticai:Aconceptualtaxonomy,applicationsandchallenges. InformationFusion , 103599. Semaan,R.,Kumar,P.,Burnazzi,M.,Tissot,G.,Cordier,L.,Noack,B.R.,2016.Reduced-ordermodellingoftheflowaroundahigh-liftconfiguration with...

2025
[14]

International Journal of Greenhouse Gas Control 145, 104404

Graph network surrogate model for optimizing the placement of horizontal injection wells for co2 storage. International Journal of Greenhouse Gas Control 145, 104404. Tang,M.,Liu,Y.,Durlofsky,L.J.,2020. Adeep-learning-basedsurrogatemodelfordataassimilationindynamicsubsurfaceflowproblems. Journal of Computational Physics 413, 109456. Tang,M.,Liu,Y.,Durlofs...

work page doi:10.1016/j.cma.2020.113636 2020
[15]

LLaMA: Open and Efficient Foundation Language Models

Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 . Wang, N., Chang, H., Kong, X.Z., Zhang, D.,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Renewable Energy 211, 379–394

Deep learning based closed-loop well control optimization of geothermal reservoir with uncertain permeability. Renewable Energy 211, 379–394. Wang, N., Chang, H., Zhang, D., 2021a. Efficient uncertainty quantification for dynamic subsurface flow with surrogate by theory-guided neural network. Computer Methods in Applied Mechanics and Engineering 373, 1134...

2022
[17]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Openhands: An open platform for ai software developers as generalist agents. arXiv preprint arXiv:2407.16741 . Liu and Wang:Preprint submitted to ElsevierPage 21 of 22 AutoSurrogate Wen,G.,Li,Z.,Azizzadenesheli,K.,Anandkumar,A.,Benson,S.M.,2022. U-fno—anenhancedfourierneuraloperator-baseddeep-learningmodel for multiphase flow. Advances in Water Resources ...

work page internal anchor Pith review arXiv 2022
[18]

Advanced Engineering Informatics 74, 104661

Data-driven surrogate material model for the mechanical simulation of additively manufactured architected weaves. Advanced Engineering Informatics 74, 104661. doi:10.1016/j.aei.2026.104661. Xie, Y., Liu, J., Wang, R., Wang, Z., Yu, K., Song, Z.,

work page doi:10.1016/j.aei.2026.104661 2026
[19]

Advanced Engineering Informatics 68, 103733

Rapid generation method of process routes based on multi-agent collaboration with LLMs. Advanced Engineering Informatics 68, 103733. doi:10.1016/j.aei.2025.103733. Yang,L.,Shami,A.,2020. Onhyperparameteroptimizationofmachinelearningalgorithms:Theoryandpractice. Neurocomputing415,295–316. doi:10.1016/j.neucom.2020.07.061. Yao, S., Zhao, J., Yu, D., Du, N.,...

work page doi:10.1016/j.aei.2025.103733 2025
[20]

ReAct: Synergizing Reasoning and Acting in Language Models

ReAct: Synergizing Reasoning and Acting in Language Models. doi:10.48550/arXiv.2210.03629. Zhu,B.,Chao,Q.,Wang,Z.,Xia,P.,Liu,C.,2026. Digitaltwinsurrogatemodelingforreal-timemonitoringofgeartransmissionsusingadynamic graph attention network. Advanced Engineering Informatics 72, 104509. doi:10.1016/j.aei.2026.104509. Zhu, Y., Zabaras, N., Koutsourelakis, P...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2210.03629 2026