ThousandWorlds: A benchmark for climate emulation of potentially habitable exoplanets
Pith reviewed 2026-06-27 01:32 UTC · model grok-4.3
The pith
ThousandWorlds supplies roughly 1800 simulations from five climate models to train emulators that map eight planet parameters to 3D atmospheric fields.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce ThousandWorlds, an ML-ready benchmark for exoclimate emulation and for the broader regime of low-data, multi-simulator, parameter-to-field regression. The dataset contains approximately 1800 simulations from five GCMs, mapping eight planet parameters to 3D atmospheric fields including temperature, humidity, winds, clouds, and radiation. Three nested subsets define progressively harder challenges: single-simulator regression, multi-simulator regression with complete observations, and multi-simulator regression with structured missingness. We propose two evaluation protocols: one for ranking methods, and one that measures performance relative to the disagreement between GCMs thems
What carries the argument
The ThousandWorlds dataset, which assembles runs from five GCMs into three nested regression tasks that map eight planet parameters onto 3D atmospheric fields.
If this is right
- Emulators can be ranked by how closely they reproduce fields within one model and how well they stay inside the spread across models.
- Gaussian-process methods appear better suited than current deep networks for this parameter-to-field mapping under limited data.
- The benchmark supplies a concrete testbed for handling structured missingness across multiple simulators.
- Progress on the hardest task level directly reduces the cost of exploring habitable-zone climates before running new full simulations.
Where Pith is reading between the lines
- The same dataset structure could be reused to test whether emulators trained on synthetic climates improve interpretation of real atmospheric spectra once those spectra become available.
- Adding runs from additional climate models would tighten the inter-model disagreement baseline and expose whether current GP superiority holds under greater simulator diversity.
- The missingness task level offers a natural probe for methods that must infer fields when only partial vertical or horizontal data are supplied, a situation likely to arise with sparse observations.
Load-bearing premise
The five chosen GCMs and the sampled planet-parameter space produce a representative enough ensemble that performance on the benchmark will translate to useful emulation on real exoplanet observations.
What would settle it
An emulator achieving high scores on ThousandWorlds yet producing temperature or cloud fields that systematically diverge from independent GCM runs or from actual telescope spectra of known exoplanets would show the benchmark does not capture the needed generalization.
Figures
read the original abstract
The search for life beyond Earth will depend on detecting faint signatures in the atmospheres of potentially habitable exoplanets. Interpreting those signatures requires understanding the host planet's climate: the same molecule may signal life on one planet and abiotic chemistry on another. Global climate models (GCMs) provide this understanding, but individual runs can require up to millions of core-hours and substantial domain expert time. Machine-learning emulators could remove this bottleneck, but progress has been limited by the absence of a curated, multi-model exoclimate dataset. We introduce ThousandWorlds, an ML-ready benchmark for exoclimate emulation and for the broader regime of low-data, multi-simulator, parameter-to-field regression. The dataset contains approximately 1800 simulations from five GCMs, mapping eight planet parameters to 3D atmospheric fields including temperature, humidity, winds, clouds, and radiation. Three nested subsets define progressively harder challenges: single-simulator regression, multi-simulator regression with complete observations, and multi-simulator regression with structured missingness. We propose two evaluation protocols: one for ranking methods, and one that measures performance relative to the disagreement between GCMs themselves. We evaluate seven baselines spanning simple methods, deep learning, and Gaussian processes. GP-based methods perform best, suggesting that ThousandWorlds exposes a regime where off-the-shelf deep learning does not yet succeed. Data: https://doi.org/10.57967/hf/8695. Code: https://github.com/edstevenson/ThousandWorlds.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces ThousandWorlds, an ML-ready benchmark dataset for exoclimate emulation consisting of approximately 1800 simulations from five GCMs that map eight planet parameters to 3D atmospheric fields (temperature, humidity, winds, clouds, radiation). It defines three nested task subsets (single-simulator regression, multi-simulator with complete observations, multi-simulator with structured missingness), two evaluation protocols (one for method ranking and one measuring performance relative to inter-GCM disagreement), and reports baseline results from seven methods (simple, deep learning, and GP) where GP-based approaches perform best.
Significance. If the released resource matches the description, this addresses a documented gap by providing a public, multi-GCM, parameter-to-field dataset with DOI and code repository. The nested tasks, structured missingness, and evaluation against GCM disagreement supply a concrete, falsifiable benchmark for low-data multi-simulator regression that extends beyond single-model emulation. Credit is due for the data release and the explicit comparison to inter-model spread rather than to self-derived quantities.
minor comments (2)
- [abstract / §3] The abstract states 'approximately 1800 simulations' without breaking down the count per GCM or per task subset; adding a table or explicit counts in §3 would improve reproducibility of the baseline splits.
- [§4] The description of the 'structured missingness' structure in the third task is referenced but not illustrated with an example mask or pseudocode; a small figure or listing in the methods would clarify the protocol for downstream users.
Simulated Author's Rebuttal
We thank the referee for their careful reading and positive assessment. The report accurately summarizes the contribution and recommends acceptance with no major comments.
Circularity Check
No significant circularity detected
full rationale
This is a data-release and benchmarking paper whose central claim is the public release of ~1800 multi-GCM simulations together with three nested task definitions and two evaluation protocols. No derivations, equations, fitted parameters, or self-citations are invoked as load-bearing steps in any claimed prediction or uniqueness result. The evaluation protocols compare emulators directly to inter-GCM disagreement rather than to quantities derived from the same fitted objects, rendering the work self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
ISSN 0004-637X. doi: 10.3847/1538-4357/aa7cf9. URL https://dx.doi.org/10. 3847/1538-4357/aa7cf9. F. H. Lambert, P. G. Challenor, N. T. Lewis, D. J. McNeall, N. Owen, I. A. Boutle, H. M. Chris- tensen, R. J. Keane, N. J. Mayne, A. Stirling, and M. J. Webb. Continuous structural param- eterization: A proposed method for representing different model paramete...
-
[2]
Climate Transition to Temperate Nightside at High Atmosphere Mass
ISSN 0004-637X. doi: 10.3847/1538-4357/adb0cb. URL https://dx.doi.org/10. 3847/1538-4357/adb0cb. Mei Ting Mak, Denis Sergeev, Nathan Mayne, Nahum Banks, Jake Eager-Nash, James Manners, Giada Arney, Eric Hebrard, and Krisztian Kohary. 3D simulations of TRAPPIST-1e with varying CO2, CH4 and haze profiles.Monthly Notices of the Royal Astronomical Society, 52...
-
[3]
Adiv Paradise, Bo Lin Fan, Evelyn Macdonald, Kristen Menou, and Christopher Lee
URLhttp://arxiv.org/abs/2412.00568. Adiv Paradise, Bo Lin Fan, Evelyn Macdonald, Kristen Menou, and Christopher Lee. A Large Repository of 3D Climate Model Outputs for Community Analysis and Postprocessing, December
-
[4]
Climate Diversity in the Solar-Like Habitable Zone due to Varying Background Gas Pressure
URLhttp://arxiv.org/abs/2008.02339. Adiv Paradise, Bo Lin Fan, Kristen Menou, and Christopher Lee. Climate Diversity in the Solar-Like Habitable Zone due to Varying Background Gas Pressure.Icarus, 358:114301, April 2021. ISSN 00191035. doi: 10.1016/j.icarus.2020.114301. URLhttp://arxiv.org/abs/1910.02355. 14 Adiv Paradise, Evelyn Macdonald, Kristen Menou,...
-
[5]
and Scher, Sebastian and Weyn, Jonathan A
doi: 10.1029/2020MS002203. URL https://onlinelibrary.wiley.com/doi/abs/ 10.1029/2020MS002203. Stephan Rasp, Stephan Hoyer, Alexander Merose, Ian Langmore, Peter Battaglia, Tyler Russell, Alvaro Sanchez-Gonzalez, Vivian Yang, Rob Carver, Shreya Agrawal, Matthew Chantry, Zied Ben Bouallegue, Peter Dueben, Carla Bromberg, Jared Sisk, Luke Barrington, Aaron B...
-
[6]
URLhttp://arxiv.org/abs/2501.19374. Gabrielle Suissa, Eric T. Wolf, Ravi kumar Kopparapu, Geronimo L. Villanueva, Thomas Fauchez, Avi M. Mandell, Giada Arney, Emily A. Gilbert, Joshua E. Schlieder, Thomas Barclay, Elisa V . Quintana, Eric Lopez, Joseph E. Rodriguez, and Andrew Vanderburg. The First Habitable-zone Earth-sized Planet from TESS. III. Climate...
-
[7]
URL https://onlinelibrary.wiley.com/doi/abs/ 10.1029/2021MS002954
doi: 10.1029/2021MS002954. URL https://onlinelibrary.wiley.com/doi/abs/ 10.1029/2021MS002954. E. T. Wolf, R. K. Kopparapu, and J. Haqq-Misra. Simulated Phase-dependent Spectra of Terrestrial Aquaplanets in M Dwarf Systems.The Astrophysical Journal, 877(1):35, May 2019. ISSN 0004- 637X. doi: 10.3847/1538-4357/ab184a. URL https://dx.doi.org/10.3847/1538-435...
-
[8]
Abiotic oxygen-dominated atmospheres on terrestrial habitable zone planets
doi: 10.3847/PSJ/ae031e. URL https://iopscience.iop.org/article/10.3847/ PSJ/ae031e. Nigel Wood, Andrew Staniforth, Andy White, Thomas Allen, Michail Diamantakis, Markus Gross, Thomas Melvin, Chris Smith, Simon V osper, Mohamed Zerroukat, and John Thuburn. An inherently mass-conserving semi-implicit semi-Lagrangian discretization of the deep-atmosphere gl...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.3847/psj/ae031e 2014
-
[9]
doi: 10.48550/arXiv.2402.14535. G. J. Zhang and N. A. McFarlane. Sensitivity of climate simulations to the parameterization of cumulus convection in the Canadian climate centre general circulation model.Atmosphere-Ocean, 33:407–446, 1995. doi: 10.1080/07055900.1995.9649539. 18 A Dataset details A.1 Sampling design The bespoke simulations in Table 2 were s...
-
[10]
This assumes output dimensions are conditionally independent given the latents
Learned field–field correlations.The default GPLFR output coregionalization matrix is B=I Dy, where Dy is the output dimensionality. This assumes output dimensions are conditionally independent given the latents. This is reasonable across spectral coefficients, which are approximately uncorrelated by construction, but restrictive across physical fields – ...
-
[11]
winds” collects both E–W and N–S winds, and “radiation
Variable-group weights.The GPLFR likelihood treats all output dimensions equally by default. However, different physical quantities differ in their predictability, so equal weighting may not allocate modelling capacity efficiently. To address this, we introduce a learned weight per variable group, where groups collect variables that we expect to have broa...
arXiv 2022
-
[12]
match these claims. Scope limitations (tidally locked waterworlds only, low-data regime) are stated explicitly in the Introduction and Dataset sections (Sections 1, 3). Guidelines: • The answer [N/A] means that the abstract and introduction do not include the claims made in the paper. • The abstract and/or introduction should clearly state the claims made...
-
[13]
Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects
Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.