pith. sign in

arxiv: 2606.18078 · v1 · pith:5XUZ7NKTnew · submitted 2026-06-16 · 📊 stat.ME

Spatial prediction of environmental processes using random forests: How best to account for spatial dependence?

Pith reviewed 2026-06-26 23:26 UTC · model grok-4.3

classification 📊 stat.ME
keywords spatial predictionrandom forestsspatial dependencegeostatisticsair pollutionenvironmental processesMalawi
0
0 comments X

The pith

Spatial basis functions consistently perform well when incorporating spatial dependence into random forest predictions for environmental processes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines multiple ways to add spatial autocorrelation handling to random forests, which are otherwise strong for non-spatial prediction. It runs a simulation study across different autocorrelation patterns and applies the methods to air pollution prediction in a Malawi tuberculosis study. Results indicate no approach wins in every scenario, yet spatial basis functions deliver steady predictive gains in both the controlled simulations and the real data case. This comparison helps environmental scientists select practical methods without defaulting to traditional Kriging or unmodified machine learning.

Core claim

No single approach to accounting for spatial dependence in random forests is universally superior across different types of spatial autocorrelation, but utilising spatial basis functions appears to perform consistently well across both the simulation and real data studies.

What carries the argument

Spatial basis functions added to random forests to capture spatial autocorrelation, compared against Gaussian process fusion, observation-driven correlations, and local geographical fitting.

Load-bearing premise

The simulation experiment's chosen spatial autocorrelation types and the single real-data case in Blantyre are representative enough to identify a consistently superior approach for general environmental processes.

What would settle it

A new simulation study or real-world dataset where spatial basis functions fail to rank among the top performers while another method succeeds across multiple autocorrelation structures would challenge the consistency claim.

read the original abstract

Geostatistical spatial prediction for environmental processes is typically undertaken using Gaussian process models via Kriging, while machine learning (ML) algorithms are state-of-the-art for non-spatial prediction. An exciting recent fusion of these ideas imbibes traditional ML algorithms with the capacity to deal with spatial autocorrelation, leading to improved predictive performance. A range of approaches have been proposed, including fusion with Gaussian processes, observation-driven correlation structures, spatial basis functions and local geographical fitting. However, there has been no numerical comparison of their relative predictive performances, which is needed to advise environmental scientists on the optimal approach to use. This paper fills this knowledge gap, and focuses on random forests as the ML algorithm because they are more computationally and conceptually straightforward to implement than deep learning algorithms. The results from two studies are presented, the first being a controlled simulation experiment investigating whether any single approach is consistently superior across different spatial autocorrelation types. The second study focuses on the prediction of air pollution concentrations within a tuberculosis prevalence study in Blantyre, Malawi. The results show that whilst no single approach is universally superior, utilising spatial basis functions appears to perform consistently well across both the simulation and real data studies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript compares methods for incorporating spatial dependence into random forest models for spatial prediction of environmental processes. It reports results from a controlled simulation experiment across different spatial autocorrelation types and a real-data application predicting air pollution in Blantyre, Malawi. The central claim is that no single approach is universally superior, but spatial basis functions perform consistently well across both studies.

Significance. If the findings hold, the work supplies practical empirical guidance for environmental scientists choosing between Kriging, non-spatial random forests, and spatial extensions such as basis functions. The dual design of simulation plus real data is a strength that allows both controlled comparisons and practical relevance.

major comments (1)
  1. [Simulation study and real-data study] The simulation experiment's chosen spatial autocorrelation types plus the single Blantyre case are not demonstrated to cover non-stationary fields, anisotropic structures, or multiple independent real-world regimes; this is load-bearing for the claim that spatial basis functions 'perform consistently well' and for the advice on 'how best to account for spatial dependence' in arbitrary environmental processes.
minor comments (1)
  1. [Abstract] The abstract lists approaches ('fusion with Gaussian processes, observation-driven correlation structures, spatial basis functions and local geographical fitting') but does not name the exact implementations compared; adding the specific variants would aid reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and constructive feedback on the scope of our studies. We address the major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Simulation study and real-data study] The simulation experiment's chosen spatial autocorrelation types plus the single Blantyre case are not demonstrated to cover non-stationary fields, anisotropic structures, or multiple independent real-world regimes; this is load-bearing for the claim that spatial basis functions 'perform consistently well' and for the advice on 'how best to account for spatial dependence' in arbitrary environmental processes.

    Authors: We agree that the simulation was restricted to common stationary and isotropic autocorrelation structures (e.g., exponential and Matérn covariances with varying ranges) and that the real-data application comprises only the single Blantyre air-pollution case. These choices do not encompass non-stationary or anisotropic fields, nor do they represent multiple independent real-world regimes. Consequently, the manuscript's phrasing that spatial basis functions 'perform consistently well' and the associated practical advice should be qualified. In the revision we will (i) explicitly state the stationarity and isotropy assumptions in the simulation design, (ii) add a limitations subsection in the Discussion that cautions against extrapolation beyond the examined scenarios, and (iii) replace the general claim with the more precise statement that, within the stationary isotropic settings and single real-data example considered, spatial basis functions showed robust performance. We will also note the need for future work on non-stationary and anisotropic cases. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparison of methods via simulation and real data

full rationale

The paper conducts a simulation study across chosen autocorrelation types and evaluates methods on one real air-pollution dataset in Blantyre. Its central claim (spatial basis functions perform consistently well) is an empirical observation from those experiments, not a derived quantity obtained by fitting parameters to the target metric or by reducing via self-citation to an unverified premise. No equations define a prediction that equals its own inputs by construction, and no load-bearing uniqueness theorem or ansatz is imported from prior author work. The study is self-contained against external benchmarks (simulated fields and observed concentrations) and reports performance metrics directly.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the comparison relies on standard random-forest and spatial-statistical assumptions not detailed here.

pith-pipeline@v0.9.1-grok · 5752 in / 889 out tokens · 22802 ms · 2026-06-26T23:26:33.534634+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 1 canonical work pages

  1. [1]

    Chapman and Hall/CRC, Boca Raton, FL (2014)

    Banerjee, S., Carlin, B., Gelfand, A.: Hierarchical Modeling and Analysis for Spatial Data, 2nd edn. Chapman and Hall/CRC, Boca Raton, FL (2014)

  2. [2]

    Chapman & Hall/CRC, Boca Raton, FL (2020)

    Boehmke, B., Greenwell, B.: Hands-on Machine Learning with R. Chapman & Hall/CRC, Boca Raton, FL (2020)

  3. [3]

    Journal of Agricultural, Biological, and Environmental Statistics 15, 176–197 (2010)

    Berrocal, V., Gelfand, A., Holland, D.: A spatio-temporal downscaler for outputs from numerical models. Journal of Agricultural, Biological, and Environmental Statistics 15, 176–197 (2010)

  4. [4]

    Atmospheric Environment222, 117130 (2020)

    Berrocal, V., Guan, Y., Muyskens, A., Wang, H., Reich, B., Mulholland, J., Chang, H.: A comparison of statistical and machine learning methods for creating national daily maps of ambient pm2.5 concentration. Atmospheric Environment222, 117130 (2020)

  5. [5]

    Routledge, New York (1984)

    Breiman, L.: Classification and Regression Trees. Routledge, New York (1984)

  6. [6]

    Machine Learning45, 5–32 (2001)

    Breiman, L.: Random forests. Machine Learning45, 5–32 (2001)

  7. [7]

    Statistica Sinica34, 291–311 (2024) 32

    Chen, W., Li, Y., Reich, B., Y, S.: Deepkriging: Spatialy dependent deep neural networks for spatial prediction. Statistica Sinica34, 291–311 (2024) 32

  8. [8]

    Journal of the Royal Statistical Society Series C: Applied Statistics59, 191–232 (2010)

    Diggle, P., Menezes, R., Su, T.: Geostatistical inference under preferential sampling. Journal of the Royal Statistical Society Series C: Applied Statistics59, 191–232 (2010)

  9. [9]

    John Wiley & Sons, ??? (2003)

    Fotheringham, A., Brunsdon, C., Charlton, M.: Geographically Weighted Regression: the Analysis of Spatially Varying Relationships. John Wiley & Sons, ??? (2003)

  10. [10]

    Figueira, M., Cameletti, M., Patelli, L.: INLA-RF: A Hybrid Modeling Strategy for Spatio-Temporal Environmental Data (2025)

  11. [11]

    PLOS Global Public Health3, 1–14 (2023)

    MacPherson, P., Corbett, E.: Prevalence of bacteriologically-confirmed pulmonary tuberculosis in urban blantyre, malawi 2019–20: Substantial decline compared to 2013–14 national survey. PLOS Global Public Health3, 1–14 (2023)

  12. [12]

    Geocarto Iinternational36, 121–136 (2021)

    Mboga, N., Wolff, E., Kalogirou, S.: Geographical random forests: a spatial extension of the random forest algorithm to address spatial heterogeneity in remote sensing and population modelling. Geocarto Iinternational36, 121–136 (2021)

  13. [13]

    Grinsztajn, L., Oyallon, E., Varoquaux, G.: Why do tree-based models still outperform deep learning on typical tabular data? In: 36th Conference on Neural Information Processing Systems (2022)

  14. [14]

    Biostatistics2, 31–45 (2001)

    Gelfand, A., Zhu, L., Carlin, B.: On the change of support problem for spatio-temporal data. Biostatistics2, 31–45 (2001)

  15. [15]

    PLOS ONE10, 1–26 (2015)

    Hengl, T., Heuvelink, G., Kempen, B., Leenaars, J., Walsh, M., Shepherd, K., Sila, A., MacMillan, R., Mendes de Jesus, J., Tamene, L., Tondoh, J.: Mapping soil properties of africa at 250m resolution: random forests significantly improve current 33 predictions. PLOS ONE10, 1–26 (2015)

  16. [16]

    CRC Press, Boca Raton (2021)

    Haining, R., Li, G.: Modelling Spatial and Spatio-Temporal Data: A Bayesian Approach. CRC Press, Boca Raton (2021)

  17. [17]

    https://arxiv.org/abs/2410.04312

    Heaton, M., Millane, A., Rhodes, J.: Adjusting for Spatial Correlation in Machine and Deep Learning (2024). https://arxiv.org/abs/2410.04312

  18. [18]

    Journal of the Southern African Institute of Mining and Metallurgy52, 119–139 (1951)

    Krige, D.: A statistical approach to some basic mine valuation problems on the Wit- watersrand. Journal of the Southern African Institute of Mining and Metallurgy52, 119–139 (1951)

  19. [19]

    Advances in Neural Information Processing Systems, 396–404 (1990)

    LeCun, Y., Boser, B., Denker, J., Henderson, D., Howard, R., Hubbard, W., Jackel, L.: Handwritten digit recognition with a back-propagation network. Advances in Neural Information Processing Systems, 396–404 (1990)

  20. [20]

    Journal of the Royal Statistical Society Series B: Statistical Methodology73, 423–498 (2011)

    Lindgren, F., Rue, H., Lindstr¨ om, J.: An explicit link between gaussian fields and gaus- sian markov random fields: The stochastic partial differential equation approach. Journal of the Royal Statistical Society Series B: Statistical Methodology73, 423–498 (2011)

  21. [21]

    The Lancet

    Shaibu, S., Ticklay, I., Grigg, J., Barratt, B.: Characterising sources of PM 2.5 expo- sure for school children with asthma: a personal exposure study across six cities in sub-saharan africa. The Lancet. Child & Adolescent Health8, 17–27 (2024)

  22. [22]

    Annals of Applied Statistics19, 485–504 (2025) 34

    MacBride, C., Davies, V., Lee, D.: A spatial autoregressive random forest algorithm for small-area spatial prediction. Annals of Applied Statistics19, 485–504 (2025) 34

  23. [23]

    Journal of Machine Learning Research 7, 983–999 (2006)

    Meinshausen, N.: Quantile regression forests. Journal of Machine Learning Research 7, 983–999 (2006)

  24. [24]

    Journal of Computational and Graphical Statistics24, 579–599 (2015)

    Nychka, D., Bandyopadhyay, S., Hammerling, D., Lindgren, F., Sain, S.: A multires- olution gaussian process model for the analysis of large spatial datasets. Journal of Computational and Graphical Statistics24, 579–599 (2015)

  25. [25]

    467–489 (2024)

    Patelli, L., Cameletti, M., Golini, N., Ignaccolo, R.: A Path in Regression Random Forest Looking for Spatial Dependence: A Taxonomy and a Systematic Review, pp. 467–489 (2024)

  26. [26]

    Journal of the American Statistical Association118, 665–683 (2023)

    Saha, A., Basu, S., Datta, A.: Random forests for spatially dependent data. Journal of the American Statistical Association118, 665–683 (2023)

  27. [27]

    Cities131, 103941 (2022)

    Soltani, A., Heydari, M., Aghaei, F., Pettit, C.: Housing price prediction incorporating spatio-temporal dependency into machine learning algorithms. Cities131, 103941 (2022)

  28. [28]

    Watanabe, S.: Asymptotic equivalence of bayes cross validation and widely applica- ble information criterion in singular learning theory. Journal of Machine Learning Research11, 3571–3594 (2010) WHO Regional Office for Europe: Health risks of air pollution in europe: Hrapie- 2 project: updated guidance on concentration–response functions for health risk a...

  29. [29]

    Cand Zammit-Mangion: Statistical deep learning for spatial and spa- tiotemporal data

    Wikle, A. Cand Zammit-Mangion: Statistical deep learning for spatial and spa- tiotemporal data. Annual Review of Statistics and Its Application10, 247–270 (2023) 35 World Health Organization: Who global air quality guidelines: particulate matter (PM2.5 and PM 10), ozone, nitrogen dioxide, sulfur dioxide and carbon monoxide

  30. [30]

    https://iris.who.int/handle/ 10665/345329

    Report, World Health Organization, Geneva (2021). https://iris.who.int/handle/ 10665/345329

  31. [31]

    Journal of the American Statistical Association120, 535–547 (2025)

    Zhan, W., Datta, A.: Neural networks for geospatial data. Journal of the American Statistical Association120, 535–547 (2025)

  32. [32]

    Geoinformatica26, 645–676 (2022)

    Zhu, D., Liu, Y., Yao, X., Fischer, M.: Spatial regression graph convolutional neu- ral networks: A deep learning paradigm for spatial multivariate distributions. Geoinformatica26, 645–676 (2022)

  33. [33]

    Zhang, H., Zimmerman, J., Nettleton, D., Nordman, D.: Random forest prediction intervals. The American Statistician74, 392–406 (2020) 36 Table 1: Comparison of the out-of-sample predictive abil- ities of a non-spatial random forest and a number of spatially adapted alternatives. The table presents bias, root mean square prediction error (RMSPE), coverage ...