Recognition: unknown
AutoSurrogate: An LLM-Driven Multi-Agent Framework for Autonomous Construction of Deep Learning Surrogate Models in Subsurface Flow
Pith reviewed 2026-05-10 16:36 UTC · model grok-4.3
The pith
LLM agents can autonomously build high-quality deep learning surrogates for subsurface flow from natural language instructions alone.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AutoSurrogate is an LLM-driven multi-agent framework that enables practitioners without ML expertise to build high-quality surrogates for subsurface flow problems through natural-language instructions. Given simulation data and optional preferences, four specialized agents collaboratively execute data profiling, architecture selection from a model zoo, Bayesian hyperparameter optimization, model training, and quality assessment against user-specified thresholds. The system also handles common failure modes autonomously, including restarting training with adjusted configurations when numerical instabilities occur and switching to alternative architectures when predictive accuracy falls short.
What carries the argument
A multi-agent LLM framework with four specialized agents that handle data profiling, architecture selection, Bayesian hyperparameter optimization, training, and quality assessment, plus autonomous mechanisms for recovering from training instabilities or accuracy shortfalls.
If this is right
- Domain scientists can obtain deployment-ready surrogates by providing only simulation data and a single natural-language instruction.
- The resulting surrogates outperform both expert-designed baselines and domain-agnostic AutoML methods on subsurface flow tasks.
- Common training failures such as numerical instabilities are resolved without human input by restarting with adjusted settings or switching architectures.
- Minimum human intervention is required at any intermediate stage once the initial instruction is given.
Where Pith is reading between the lines
- The same agent structure could be tested on surrogate construction for other physics-based simulations such as fluid dynamics or reservoir management.
- Integration with existing numerical simulators might allow end-to-end automation from data generation through surrogate deployment.
- Longer-term use could reveal whether the agents improve over repeated tasks by retaining problem-specific patterns across different geological settings.
- Scaling experiments on larger model zoos or higher-dimensional flow problems would test how far the autonomous recovery mechanisms extend.
Load-bearing premise
The assumption that LLM agents can reliably and autonomously execute the full pipeline including architecture selection, Bayesian optimization, training, failure recovery, and quality assessment without ML expertise or human intervention at any stage.
What would settle it
Applying the system to a fresh subsurface flow dataset and observing that it cannot reach the user-specified accuracy threshold after several autonomous retries or that it requires human intervention to succeed.
Figures
read the original abstract
High-fidelity numerical simulation of subsurface flow is computationally intensive, especially for many-query tasks such as uncertainty quantification and data assimilation. Deep learning (DL) surrogates can significantly accelerate forward simulations, yet constructing them requires substantial machine learning (ML) expertise - from architecture design to hyperparameter tuning - that most domain scientists do not possess. Furthermore, the process is predominantly manual and relies heavily on heuristic choices. This expertise gap remains a key barrier to the broader adoption of DL surrogate techniques. For this reason, we present AutoSurrogate, a large-language-model-driven multi-agent framework that enables practitioners without ML expertise to build high-quality surrogates for subsurface flow problems through natural-language instructions. Given simulation data and optional preferences, four specialized agents collaboratively execute data profiling, architecture selection from a model zoo, Bayesian hyperparameter optimization, model training, and quality assessment against user-specified thresholds. The system also handles common failure modes autonomously, including restarting training with adjusted configurations when numerical instabilities occur and switching to alternative architectures when predictive accuracy falls short of targets. In our setting, a single natural-language sentence can be sufficient to produce a deployment-ready surrogate model, with minimum human intervention required at any intermediate stage. We demonstrate the utility of AutoSurrogate on a 3D geological carbon storage modeling task, mapping permeability fields to pressure and CO$_2$ saturation fields over 31 timesteps. Without any manual tuning, AutoSurrogate is able to outperform expert-designed baselines and domain-agnostic AutoML methods, demonstrating strong potential for practical deployment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents AutoSurrogate, an LLM-driven multi-agent framework that automates construction of deep learning surrogate models for subsurface flow problems. Given simulation data and a natural-language instruction, four specialized agents perform data profiling, architecture selection from a model zoo, Bayesian hyperparameter optimization, training, quality assessment against user thresholds, and autonomous recovery from failures such as numerical instabilities or insufficient accuracy. The central claim is that a single natural-language sentence suffices to produce a deployment-ready surrogate that, on a 3D geological carbon-storage task mapping permeability to pressure and CO2 saturation over 31 timesteps, outperforms both expert-designed baselines and domain-agnostic AutoML methods without any manual tuning or ML expertise.
Significance. If the autonomy and outperformance claims hold under transparent reporting, the work would meaningfully lower the barrier for domain scientists to deploy DL surrogates in geoscience applications such as uncertainty quantification and history matching. The multi-agent orchestration of the full pipeline (including failure recovery) represents a practical advance over existing AutoML tools, but its significance is currently limited by the absence of quantitative metrics, reproducibility details, and ablation evidence in the manuscript.
major comments (3)
- [Abstract and §5] Abstract and §5 (Experiments): the claim that AutoSurrogate 'outperforms expert-designed baselines and domain-agnostic AutoML methods' is stated without any numerical results (e.g., relative L2 errors on pressure or saturation fields, wall-clock times, or success rates), baseline specifications, or statistical comparisons. This absence makes the headline empirical result impossible to assess and is load-bearing for the central contribution.
- [§3] §3 (Methodology): the four agents' decision logic, exact prompts, and rules for architecture switching or training restarts are described only at a high level. Without these, it is impossible to verify the claimed autonomy or to reproduce the pipeline, undermining the assertion that 'minimum human intervention' is required.
- [§5] §5 (Experiments): no ablation is reported that compares LLM-driven choices (architecture selection, hyperparameter proposals, failure recovery) against random or default AutoML selections, nor are success-rate statistics (fraction of runs requiring restarts or switches) provided. These omissions prevent separation of the multi-agent framework's contribution from possible prompt engineering or task-specific defaults.
minor comments (2)
- [§3] The model zoo composition and the precise quality-assessment thresholds used by the final agent should be listed explicitly for reproducibility.
- [§5] Figure captions in the experimental section lack detail on axis scales, error bars, and what each panel represents (e.g., pressure vs. saturation fields).
Simulated Author's Rebuttal
We thank the referee for their insightful comments on our manuscript. We believe the suggested revisions will significantly improve the clarity, reproducibility, and empirical support of our work. Below we provide point-by-point responses to the major comments.
read point-by-point responses
-
Referee: [Abstract and §5] Abstract and §5 (Experiments): the claim that AutoSurrogate 'outperforms expert-designed baselines and domain-agnostic AutoML methods' is stated without any numerical results (e.g., relative L2 errors on pressure or saturation fields, wall-clock times, or success rates), baseline specifications, or statistical comparisons. This absence makes the headline empirical result impossible to assess and is load-bearing for the central contribution.
Authors: We fully agree that the central empirical claim requires concrete numerical backing to be properly evaluated. In the revised manuscript, we will augment both the abstract and §5 with specific quantitative results, including relative L2 errors for the pressure and CO2 saturation fields, wall-clock times for surrogate construction and inference, success rates across multiple runs, detailed specifications of the expert-designed baselines and AutoML methods, and appropriate statistical comparisons. These additions will make the outperformance claim verifiable. revision: yes
-
Referee: [§3] §3 (Methodology): the four agents' decision logic, exact prompts, and rules for architecture switching or training restarts are described only at a high level. Without these, it is impossible to verify the claimed autonomy or to reproduce the pipeline, undermining the assertion that 'minimum human intervention' is required.
Authors: We recognize that the high-level description in §3 limits reproducibility. We will revise §3 to provide more detailed explanations of each agent's decision logic. Additionally, we will include the exact prompts used by the agents in a new appendix, along with explicit rules governing architecture selection, switching criteria, and training restart procedures. This will allow independent verification of the autonomy claims. revision: yes
-
Referee: [§5] §5 (Experiments): no ablation is reported that compares LLM-driven choices (architecture selection, hyperparameter proposals, failure recovery) against random or default AutoML selections, nor are success-rate statistics (fraction of runs requiring restarts or switches) provided. These omissions prevent separation of the multi-agent framework's contribution from possible prompt engineering or task-specific defaults.
Authors: We agree that ablations are necessary to attribute performance gains specifically to the multi-agent LLM framework. In the revised §5, we will incorporate an ablation study that contrasts the LLM-driven decisions with random or default AutoML baselines. We will also report success-rate statistics detailing the fraction of runs that required restarts or architecture switches. These additions will help distinguish the framework's contributions from other factors. revision: yes
Circularity Check
No circularity; empirical claims rest on external test cases independent of inputs.
full rationale
The manuscript presents an LLM multi-agent framework for surrogate construction and reports empirical outperformance on a 3D carbon-storage benchmark against expert baselines and AutoML methods. No mathematical derivation chain, fitted parameters, or self-referential definitions appear in the provided text. The central claim is supported by a reported experiment rather than reducing by construction to the framework's own inputs or prior self-citations. The absence of disclosed prompts or ablation metrics concerns reproducibility but does not constitute circularity under the enumerated patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs possess sufficient reasoning and tool-use capabilities to manage data profiling, architecture selection, hyperparameter optimization, training, and autonomous recovery from instabilities without human intervention.
Reference graph
Works this paper leans on
-
[1]
Achiam,J.,Adler,S.,Agarwal,S.,Ahmad,L.,Akkaya,I.,Aleman,F.L.,Almeida,D.,Altenschmidt,J.,Altman,S.,Anadkat,S.,etal.,2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 . Akiba,T.,Sano,S.,Yanase,T.,Ohta,T.,Koyama,M.,2019. Optuna:ANext-generationHyperparameterOptimizationFramework,in:Proceedings of the 25th ACM SIGKDD International Conference on Kn...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1145/3292500.3330701 2023
-
[2]
Acta numerica 9, 1–38
Radial basis functions. Acta numerica 9, 1–38. Cuomo,S.,DiCola,V.S.,Giampaolo,F.,Rozza,G.,Raissi,M.,Piccialli,F.,2022. ScientificMachineLearningThroughPhysics–InformedNeural Networks: Where we are and What’s Next. Journal of Scientific Computing 92,
2022
-
[3]
doi:10.1007/s10915-022-01939-z. Diab, W., Al Kobaisi, M.,
-
[4]
U-DeepONet: U-Net enhanced deep operator network for geologic carbon sequestration. Scientific Reports 14, 21298. doi:10.1038/s41598-024-72393-0. Elsheikh,A.H.,Hoteit,I.,Wheeler,M.F.,2014. Efficientbayesianinferenceofsubsurfaceflowmodelsusingnestedsamplingandsparsepolynomial chaos surrogates. Computer Methods in Applied Mechanics and Engineering 269, 515–...
-
[5]
A comprehensive survey of self-evolving ai agents: A new paradigm bridging foundation models and lifelong agentic systems. arXiv preprint arXiv:2508.07407 . Feng, Z., Tariq, Z., Shen, X., Yan, B., Tang, X., Zhang, F.,
-
[6]
Gas Science and Engineering 125, 205314
An encoder-decoder ConvLSTM surrogate model for simulating geological CO2 sequestration with dynamic well controls. Gas Science and Engineering 125, 205314. doi:10.1016/j.jgsce.2024.205314. Feng, Z., Yan, B., Shen, X., Zhang, F., Tariq, Z., Ouyang, W., Han, Z.,
-
[7]
Advances in Water Resources 196, 104897
A hybrid cnn-transformer surrogate model for the multi-objective robust optimization of geological carbon sequestration. Advances in Water Resources 196, 104897. Fu,S.,Mao,S.,Carbonero,A.,Srikishan,B.,Creasy,N.,Chellal,H.,Mehana,M.,2025. Deeplearning-basedsurrogatemodelingforunderground hydrogen storage. Advances in Water Resources 203, 105014. Gadd, C., ...
2025
-
[8]
SPE Journal 30, 7822–7839
Reduced-order modeling for fractured reservoir simulation by use of local resolution trajectory piecewise linearization. SPE Journal 30, 7822–7839. Gao,S.,Fang,A.,Huang,Y.,Giunchiglia,V.,Noori,A.,Schwarz,J.R.,Ektefaie,Y.,Kondic,J.,Zitnik,M.,2024. Empoweringbiomedicaldiscovery with ai agents. Cell 187, 6125–6151. Gu, Y., You, H., Cao, J., Yu, M., Fan, H., ...
2024
-
[9]
Large Language Models for Constructing and Optimizing Machine Learning Workflows: A Survey. doi:10.48550/arXiv.2411.10478. Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., et al.,
-
[10]
Advances in Water Resources 150, 103878
Deep residual u-net convolution neural networks with autoregressive strategy for fluid flow predictions in large-scale geosystems. Advances in Water Resources 150, 103878. Karumuri,S.,Tripathy,R.,Bilionis,I.,Panchal,J.,2020. Simulator-freesolutionofhigh-dimensionalstochasticellipticpartialdifferentialequations using deep neural networks. Journal of Comput...
2020
-
[11]
Liu, J., Peng, D., Wang, H., Liu, C., Li, Y.F., Xie, M., 2026a
Fourier Neural Operator for Parametric Partial Differential Equations. Liu, J., Peng, D., Wang, H., Liu, C., Li, Y.F., Xie, M., 2026a. AeroGPT: Leveraging Large-Scale Audio Model for Aero-Engine Bearing Fault Diagnosis. IEEE Transactions on Cybernetics , 1–14doi:10.1109/TCYB.2026.3668256. Liu, J., Wang, H., Zhang, Y., Luo, X., Hu, J., Liu, Z., Xie, M., 20...
-
[12]
arXiv preprint arXiv:2404.11584 , year=
The landscape of emerging ai agent architectures for reasoning, planning, and tool calling: A survey. arXiv preprint arXiv:2404.11584 . Meng, J., Li, H.,
-
[13]
Transport in porous media 82, 3–17
New trapping mechanism in carbon sequestration. Transport in porous media 82, 3–17. Sapkota,R.,Roumeliotis,K.I.,Karkee,M.,2025. Aiagentsvs.agenticai:Aconceptualtaxonomy,applicationsandchallenges. InformationFusion , 103599. Semaan,R.,Kumar,P.,Burnazzi,M.,Tissot,G.,Cordier,L.,Noack,B.R.,2016.Reduced-ordermodellingoftheflowaroundahigh-liftconfiguration with...
2025
-
[14]
International Journal of Greenhouse Gas Control 145, 104404
Graph network surrogate model for optimizing the placement of horizontal injection wells for co2 storage. International Journal of Greenhouse Gas Control 145, 104404. Tang,M.,Liu,Y.,Durlofsky,L.J.,2020. Adeep-learning-basedsurrogatemodelfordataassimilationindynamicsubsurfaceflowproblems. Journal of Computational Physics 413, 109456. Tang,M.,Liu,Y.,Durlofs...
-
[15]
LLaMA: Open and Efficient Foundation Language Models
Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 . Wang, N., Chang, H., Kong, X.Z., Zhang, D.,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Renewable Energy 211, 379–394
Deep learning based closed-loop well control optimization of geothermal reservoir with uncertain permeability. Renewable Energy 211, 379–394. Wang, N., Chang, H., Zhang, D., 2021a. Efficient uncertainty quantification for dynamic subsurface flow with surrogate by theory-guided neural network. Computer Methods in Applied Mechanics and Engineering 373, 1134...
2022
-
[17]
OpenHands: An Open Platform for AI Software Developers as Generalist Agents
Openhands: An open platform for ai software developers as generalist agents. arXiv preprint arXiv:2407.16741 . Liu and Wang:Preprint submitted to ElsevierPage 21 of 22 AutoSurrogate Wen,G.,Li,Z.,Azizzadenesheli,K.,Anandkumar,A.,Benson,S.M.,2022. U-fno—anenhancedfourierneuraloperator-baseddeep-learningmodel for multiphase flow. Advances in Water Resources ...
work page internal anchor Pith review arXiv 2022
-
[18]
Advanced Engineering Informatics 74, 104661
Data-driven surrogate material model for the mechanical simulation of additively manufactured architected weaves. Advanced Engineering Informatics 74, 104661. doi:10.1016/j.aei.2026.104661. Xie, Y., Liu, J., Wang, R., Wang, Z., Yu, K., Song, Z.,
-
[19]
Advanced Engineering Informatics 68, 103733
Rapid generation method of process routes based on multi-agent collaboration with LLMs. Advanced Engineering Informatics 68, 103733. doi:10.1016/j.aei.2025.103733. Yang,L.,Shami,A.,2020. Onhyperparameteroptimizationofmachinelearningalgorithms:Theoryandpractice. Neurocomputing415,295–316. doi:10.1016/j.neucom.2020.07.061. Yao, S., Zhao, J., Yu, D., Du, N.,...
-
[20]
ReAct: Synergizing Reasoning and Acting in Language Models
ReAct: Synergizing Reasoning and Acting in Language Models. doi:10.48550/arXiv.2210.03629. Zhu,B.,Chao,Q.,Wang,Z.,Xia,P.,Liu,C.,2026. Digitaltwinsurrogatemodelingforreal-timemonitoringofgeartransmissionsusingadynamic graph attention network. Advanced Engineering Informatics 72, 104509. doi:10.1016/j.aei.2026.104509. Zhu, Y., Zabaras, N., Koutsourelakis, P...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2210.03629 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.