OpenSeisML: Open Large-Scale Real Seismic and well-log Dataset for Generative AI
Pith reviewed 2026-05-21 06:44 UTC · model grok-4.3
The pith
OpenSeisML supplies curated public seismic volumes and well logs to train generative models that produce multiple subsurface realizations for uncertainty quantification in inversion.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors present OpenSeisML as an open large-scale dataset of real seismic volumes and well logs, assembled through an automated curation pipeline that performs time-to-depth conversion via checkshot interpolation, specifically to enable generative models that capture the statistical distribution of subsurface properties and thereby generate multiple realizations for uncertainty quantification in seismic inversion.
What carries the argument
The OpenSeisML dataset together with its automated curation pipeline that converts time-domain seismic to depth using checkshot interpolation to produce reproducible velocity models suitable for generative modeling.
Load-bearing premise
The selected UK public survey data, once processed through the automated time-to-depth conversion, sufficiently represent the statistical distribution of subsurface properties so that generative models trained on them can produce useful realizations for other regions or surveys.
What would settle it
Train a generative model on OpenSeisML and test whether the resulting realizations, when used as priors, measurably improve inversion accuracy or uncertainty calibration on a held-out seismic survey from a different geological setting.
Figures
read the original abstract
The advent of machine learning (ML) and computer vision has significantly accelerated seismic inversion workflows by reducing the computational cost of traditionally expensive iterative methods. However, the development and evaluation of ML methods remain limited by the scarcity of realistic velocity models, as most high-quality data are privately owned by oil and gas companies. To address this gap, we present OpenSeisML, a collection of real seismic datasets designed to support generative AI (Gen-AI) workflows for seismic inversion. The datasets are curated from publicly available surveys in the UK National Data Repository (NDR). When seismic volumes are in the time domain and wells are in depth, a time-to-depth conversion is required. We use checkshot data to establish the time-depth relationship and construct a velocity model through interpolation for accurate conversion of post-stack seismic data. Here, we present an automated data curation pipeline that enables seismic data preparation while ensuring reproducibility. The objective is to train a generative model that captures the statistical distribution of subsurface properties, enabling the synthesis of multiple statistically consistent realizations for uncertainty quantification which can act as a prior for seismic inversion.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents OpenSeisML, a collection of real seismic and well-log datasets curated from publicly available UK National Data Repository (NDR) surveys. It describes an automated data curation pipeline that performs time-to-depth conversion of post-stack seismic volumes by interpolating checkshot data to construct velocity models. The objective is to enable training of generative AI models that capture the statistical distribution of subsurface properties for synthesizing multiple realizations to support uncertainty quantification in seismic inversion.
Significance. If the curation and conversion steps are shown to preserve the relevant statistical properties of real subsurface geology, the open release of this large-scale dataset would meaningfully advance machine-learning applications in geophysics by removing a key barrier of data scarcity. The emphasis on an automated, reproducible pipeline is a concrete strength that supports open science and community reuse.
major comments (1)
- [Automated data curation pipeline] Automated data curation pipeline: the description of time-to-depth conversion via checkshot interpolation supplies no quantitative validation (mis-tie analysis, comparison to sonic logs or depth-migrated volumes, or checks on preservation of autocorrelation lengths, impedance contrasts, or variograms). This directly affects whether the released volumes can serve as training data whose statistical distribution matches real subsurface properties, which is load-bearing for the central claim that the dataset supports useful generative models for uncertainty quantification.
minor comments (1)
- [Abstract] The abstract would benefit from explicit statements of dataset scale (number of surveys, total inline/crossline counts, or total volume in GB) to substantiate the 'large-scale' descriptor.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and for recognizing the potential of OpenSeisML to advance machine learning applications in geophysics. We provide a point-by-point response to the major comment below.
read point-by-point responses
-
Referee: Automated data curation pipeline: the description of time-to-depth conversion via checkshot interpolation supplies no quantitative validation (mis-tie analysis, comparison to sonic logs or depth-migrated volumes, or checks on preservation of autocorrelation lengths, impedance contrasts, or variograms). This directly affects whether the released volumes can serve as training data whose statistical distribution matches real subsurface properties, which is load-bearing for the central claim that the dataset supports useful generative models for uncertainty quantification.
Authors: We acknowledge the validity of this observation. The manuscript as submitted emphasizes the design of the automated, reproducible curation pipeline but does not present quantitative validation results for the time-to-depth conversion step. To address this, we will revise the manuscript to include quantitative assessments. Specifically, we will perform and report mis-tie analyses at well locations using the interpolated velocity models, compare converted seismic data with any available depth-domain equivalents where they exist in the public domain, and evaluate preservation of statistical properties including autocorrelation lengths and variogram models on selected volumes. These additions will be supported by figures and tables in a new validation subsection. We believe this will confirm that the converted data retain the essential geological statistics needed for generative modeling. revision: yes
Circularity Check
Data-release paper with no derivations, predictions, or self-referential claims
full rationale
The manuscript is a description of a curated public seismic dataset and an automated curation pipeline for time-to-depth conversion. No equations, fitted parameters, generative-model outputs, or statistical predictions are presented as results derived from first principles. The central contribution is the release of the data volumes themselves; the pipeline steps are procedural rather than deductive. No self-citations are invoked to justify uniqueness or to close a logical loop. The work is therefore self-contained against external benchmarks and contains no load-bearing step that reduces to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Checkshot data from wells can be interpolated to produce a reliable time-depth relationship for converting post-stack seismic volumes.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We use checkshot data to establish the time-depth relationship and construct a velocity model through interpolation for accurate conversion of post-stack seismic data.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The objective is to train a generative model that captures the statistical distribution of subsurface properties
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Alaudah, Y., P. Micha owicz, M. Alfarraj, and G. AlRegib, 2019, A machine-learning benchmark for facies classification: Interpretation, 7 , SE175--SE187
work page 2019
-
[2]
Chen, H., J. Chen, M. D. Sacchi, J. Gao, and P. Yang, 2025, Unsupervised seismic acoustic impedance inversion based on generative diffusion model: Geophysics, 90 , M109--M121
work page 2025
-
[3]
(Medium article, accessed 2026)
Consolvo, B., 2023, Seismic data to subsurface models with openfwi: Training an ai model on the latest intel xeon cpu with pytorch 2.0: https://medium.com/better-programming/seismic-data-to-subsurface-models-with-openfwi-bcca0218b4e8. (Medium article, accessed 2026)
work page 2023
-
[4]
Deng, C., S. Feng, H. Wang, X. Zhang, P. Jin, Y. Feng, Q. Zeng, Y. Chen, and Y. Lin, 2022, Openfwi: Large-scale multi-structural benchmark datasets for full waveform inversion: Presented at the Advances in Neural Information Processing Systems (NeurIPS), Curran Associates, Inc
work page 2022
-
[5]
dGB Earth Sciences , 2026, Opendtect pro & dgb plugins documentation - 7.0: https://doc.opendtect.org/7.0.0/doc/dgb_userdoc/Default.htm. (Accessed: 2026-02-12)
work page 2026
-
[6]
(https://hdsr.mitpress.mit.edu/pub/g9mau4m0)
Donoho, D., 2024, Data Science at the Singularity : Harvard Data Science Review, 6 . (https://hdsr.mitpress.mit.edu/pub/g9mau4m0)
work page 2024
-
[8]
Cheng, 2007, Seam: The seg advanced modeling project, phase i: AGU Fall Meeting Abstracts
Fehler, M., and A. Cheng, 2007, Seam: The seg advanced modeling project, phase i: AGU Fall Meeting Abstracts
work page 2007
-
[10]
GeeksforGeeks , 2025, Largest rectangular area in a histogram using stack: https://www.geeksforgeeks.org/dsa/largest-rectangular-area-in-a-histogram-using-stack/. (Accessed: 2026)
work page 2025
-
[11]
Herron, D. A., 2011, First steps in seismic interpretation: Society of Exploration Geophysicists, volume 16 of Society of Exploration Geophysicists Geophysical Monograph Series
work page 2011
-
[12]
Janssen, V., 2009, Understanding coordinate reference systems, datums and transformations: International Journal of Geoinformatics, 5
work page 2009
-
[13]
Jin, P., Y. Feng, S. Feng, H. Wang, Y. Chen, B. Consolvo, Z. Liu, and Y. Lin, 2024, An empirical study of large-scale data-driven full waveform inversion: Scientific Reports, 14 , 20034
work page 2024
-
[14]
Jones, C. E., J. A. Edgar, J. I. Selvage, and H. Crook, 2012, Building complex synthetic models to evaluate acquisition geometries and velocity inversion technologies: 74th EAGE Conference and Exhibition Incorporating EUROPEC 2012, European Association of Geoscientists & Engineers, cp--293--00580
work page 2012
- [15]
-
[16]
Mekonnin, A., K. Wacławiak, M. Humayun, S. Zhang, and H. Ullah, 2025, Hydrogen storage technology, and its challenges: A review: Catalysts, 15 , 260
work page 2025
-
[17]
(Contains information provided by the North Sea Transition Authority and/or other third parties)
North Sea Transition Authority , 2026, Uk national data repository: https://www.nstauthority.co.uk/data-and-insights/data/uk-national-data-repository/. (Contains information provided by the North Sea Transition Authority and/or other third parties)
work page 2026
-
[18]
Orozco, R., A. Siahkoohi, M. Louboutin, and F. J. Herrmann, 2025, Aspire: Iterative amortized posterior inference for bayesian inverse problems: Inverse Problems, 41 , 045001
work page 2025
-
[19]
Blythe, 2014, Seam update: Seam participants share their views: The Leading Edge, 33 , 234--236
Pangman, P., and N. Blythe, 2014, Seam update: Seam participants share their views: The Leading Edge, 33 , 234--236
work page 2014
-
[20]
Skala, V., 2017, Radial basis function interpolation and applications: An incremental approach: Latest Trends on Applied Mathematics, Simulation, Modelling, 1--8
work page 2017
-
[21]
Yin, Z., R. Orozco, and F. J. Herrmann, 2025, Wiser: Multimodal variational inference for full-waveform inversion without dimensionality reduction: Geophysics, 90 , A1--A7
work page 2025
- [22]
-
[23]
Advances in Neural Information Processing Systems (NeurIPS) , year =
OpenFWI: Large-scale Multi-structural Benchmark Datasets for Full Waveform Inversion , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
- [24]
-
[25]
Jin, Peng and Feng, Yinan and Feng, Shihang and Wang, Hanchen and Chen, Yinpeng and Consolvo, Benjamin and Liu, Zicheng and Lin, Youzuo , title =. Scientific Reports , year =. doi:10.1038/s41598-024-20034-0 , url =
-
[26]
2024 , howpublished =
work page 2024
- [27]
-
[28]
and Busby, Mark and Nealon, Jeff and Zaske, Joerg , title =
Bartel, David C. and Busby, Mark and Nealon, Jeff and Zaske, Joerg , title =. SEG Technical Program Expanded Abstracts , year =. doi:10.1190/1.2369965 , url =
-
[29]
Yin, Ziyi and Orozco, Rafael and Louboutin, Mathias and Herrmann, Felix J. , title =. Geophysics , year =. doi:10.1190/geo2023-0744.1 , url =
-
[30]
UK National Data Repository , year =
-
[31]
OpendTect Pro & dGB Plugins Documentation - 7.0 , year =
-
[32]
A Machine-Learning Benchmark for Facies Classification , author =. Interpretation , volume =. 2019 , doi =
work page 2019
-
[33]
Jones, C. E. and Edgar, J. A. and Selvage, J. I. and Crook, H. , title =. 74th EAGE Conference and Exhibition Incorporating EUROPEC 2012 , year =. doi:10.3997/2214-4609.20148575 , isbn =
- [34]
-
[35]
International Conference on Learning Representations (ICLR) , year =
Unsupervised Learning of Full-Waveform Inversion: Connecting CNN and Partial Differential Equation in a Loop , author =. International Conference on Learning Representations (ICLR) , year =
-
[36]
The Marmousi experience: Velocity model determination on a synthetic complex data set , author =. The Leading Edge , volume =. 1994 , doi =
work page 1994
-
[37]
WISER: Multimodal variational inference for full-waveform inversion without dimensionality reduction , author =. Geophysics , volume =. 2025 , publisher =. doi:10.1190/geo2024-0483.1 , url =
-
[38]
arXiv preprint arXiv:2509.20238 , year =
Velocity Model Building from Seismic Images Using a Convolutional Neural Operator , author =. arXiv preprint arXiv:2509.20238 , year =. doi:10.48550/arXiv.2509.20238 , url =
-
[39]
Latest Trends on Applied Mathematics, Simulation, Modelling , pages=
Radial Basis Function Interpolation and Applications: An Incremental Approach , author=. Latest Trends on Applied Mathematics, Simulation, Modelling , pages=. 2017 , publisher=
work page 2017
- [40]
-
[41]
arXiv preprint arXiv:2406.05136 , year=
Generative geostatistical modeling from incomplete well and imaged seismic observations with diffusion models , author=. arXiv preprint arXiv:2406.05136 , year=
-
[42]
IEEE Transactions on Geoscience and Remote Sensing , year =
Wu, Han and Lu, Shaoping and Dong, Xintong and Deng, Xiaofan , title =. IEEE Transactions on Geoscience and Remote Sensing , year =
-
[43]
arXiv preprint arXiv:2502.07169 , year=
Advancing Geological Carbon Storage Monitoring with 3D Digital Shadow Technology , author=. arXiv preprint arXiv:2502.07169 , year=
-
[44]
arXiv preprint arXiv:2508.12939 , year=
Simulation-Based Inference: A Practical Guide , author=. arXiv preprint arXiv:2508.12939 , year=. doi:10.48550/arXiv.2508.12939 , url=
- [45]
-
[46]
arXiv preprint arXiv:2309.02791 , year=
Seismic Foundation Model (SFM): a new generation deep learning model in geophysics , author=. arXiv preprint arXiv:2309.02791 , year=. doi:10.48550/arXiv.2309.02791 , url=
-
[47]
Advances in Geophysics , volume=
An overview of multimethod imaging approaches in environmental geophysics , author=. Advances in Geophysics , volume=. 2021 , publisher=
work page 2021
-
[48]
74th EAGE Conference and Exhibition incorporating EUROPEC 2012 , year=
Building Complex Synthetic Models to Evaluate Acquisition Geometries and Velocity Inversion Technologies , author=. 74th EAGE Conference and Exhibition incorporating EUROPEC 2012 , year=
work page 2012
-
[49]
Hydrogen Storage Technology, and Its Challenges: A Review , volume =
Mekonnin, Abdisa and Wacławiak, Krzysztof and Humayun, Muhammad and Zhang, Shaowei and Ullah, Habib , year =. Hydrogen Storage Technology, and Its Challenges: A Review , volume =. Catalysts , doi =
-
[50]
Unsupervised seismic acoustic impedance inversion based on generative diffusion model , author=. Geophysics , volume=. 2025 , doi=
work page 2025
-
[51]
Fehler, M. and Cheng, A. , year =. SEAM: The SEG Advanced Modeling Project, Phase I , journal =
- [52]
-
[53]
Largest Rectangular Area in a Histogram using Stack , author =. 2025 , howpublished =
work page 2025
-
[54]
Understanding coordinate reference systems, datums and transformations , volume =
Janssen, Volker , year =. Understanding coordinate reference systems, datums and transformations , volume =
- [55]
-
[56]
Kosloff, Dan D. and Sudman, Yonadav , title =. Geophysics , volume =. 2002 , doi =
work page 2002
-
[57]
Pangman, Peter and Blythe, Natalie , title =. The Leading Edge , volume =. 2014 , doi =
work page 2014
-
[58]
Seismic Dataset Curation from UK National Data Repository to Validate SAGE and WISE , booktitle =. 2025 , month =
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.