pith. machine review for the scientific record. sign in

arxiv: 2605.10949 · v1 · submitted 2026-04-29 · 📊 stat.AP · cs.AI· cs.CV· cs.LG

Recognition: 1 theorem link

· Lean Theorem

AlphaEarth Satellite Embeddings for Modelling Climate Sensitive Diseases Towards Global Health Resilience

I-Han Cheng, Sara Khalid, Usman Nazir

Authors on Pith no claims yet

Pith reviewed 2026-05-13 07:02 UTC · model grok-4.3

classification 📊 stat.AP cs.AIcs.CVcs.LG
keywords satellite embeddingsmalariarespiratory infectionstuntingglobal healthclimate variabilitypredictive modelingmachine learning
0
0 comments X

The pith

64-dimensional satellite embeddings improve predictions of malaria and respiratory infections in vulnerable populations

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether 64-dimensional satellite embeddings can serve as useful predictors for three major climate-sensitive child health outcomes: malaria, acute respiratory infections, and stunting. Across studies in Nigeria, 11 countries, and 35 countries, the embeddings add value to standard models for the first two outcomes by increasing R-squared scores at regional scales, while showing no benefit for stunting due to overlap with existing variables. This approach addresses the challenge of sparse health surveillance data in low and middle-income countries by providing scalable environmental information from satellites. If successful, such embeddings could support more proactive public health responses to climate variability affecting disease spread and nutrition.

Core claim

In each of three studies, the AlphaEarth Foundations 64-dimensional satellite embeddings supply predictive value at adequate spatial detail for modelling malaria, childhood acute respiratory infection, and child stunting. Malaria models in Nigeria gain consistent R-squared improvements per region. Respiratory infection models across eleven countries see pooled R-squared rise from 0.157 to 0.206 with three different tree-based methods. Stunting models across thirty-five countries remain unchanged at the country level because the embeddings correlate strongly with fixed effects.

What carries the argument

The 64-dimensional satellite embeddings that represent Earth's surface characteristics for use as input features in statistical health models.

If this is right

  • Consistent gains in malaria prediction accuracy at the regional level in Nigeria.
  • Increased explanatory power for acute respiratory infection models when pooling data from multiple countries.
  • Need for finer spatial resolution data to evaluate embedding contributions to stunting predictions beyond country-level controls.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the embeddings encode unique climate signals, they could enable real-time monitoring of health risks in areas without ground sensors.
  • Combining embeddings with other data sources might improve forecasts for additional climate-related conditions such as vector-borne diseases.
  • Testing at higher spatial resolutions could resolve the collinearity issue observed in stunting analyses.

Load-bearing premise

The satellite embeddings provide environmental and climate information that is independent of traditional covariates and country fixed effects.

What would settle it

Demonstrating no improvement in prediction performance when embeddings are added to baseline models that already include standard environmental covariates and fixed effects.

Figures

Figures reproduced from arXiv: 2605.10949 by I-Han Cheng, Sara Khalid, Usman Nazir.

Figure 1
Figure 1. Figure 1: Case 1 — Malaria case prediction in Nigeria (NMEP, 2000–2024; train 2000– 2023, test 2024). AlphaEarth embeddings provide a geographically uniform R2 gain (left) and emerge as the dominant feature group in the importance decomposition (right), supporting the interpretation that static landscape structure carries malaria-transmission signal not captured by monthly climate covariates. 1 Motivation Malaria, a… view at source ↗
Figure 2
Figure 2. Figure 2: Case 2 — Childhood ARI prediction across 11 LMICs (DHS, 2017–2022; 5-fold × 2-repeat CV). The pooled R2 improves from 0.157 (gaseous pollutants + controls) to 0.206 (gas + controls + AlphaEarth). The improvement is consistent across three tree-based estimators (Random Forest, HistGradientBoosting, XGBoost), indicating that the signal lies in the embeddings rather than in any one model’s inductive bias. 4 … view at source ↗
Figure 3
Figure 3. Figure 3: Case 3 — Stunting (WHZ) prediction across 35 DHS countries, 2015+ (country-level AE prototype). A country-broadcast AlphaEarth fingerprint is constant within each country and therefore collinear with the model’s country fixed effect / embedding; as pre￾dicted, it yields no meaningful gain (∆R2 ≈ 0). This negative result motivates the cluster-level extraction (described in the text) as the experiment that c… view at source ↗
Figure 4
Figure 4. Figure 4: Case 1 — Malaria case prediction in Nigeria (NMEP, 2000–2024; train 2000– 2023, test 2024). AlphaEarth embeddings provide a geographically uniform R2 gain (left) and emerge as the dominant feature group in the importance decomposition (right), supporting the interpretation that static landscape structure carries malaria-transmission signal not captured by monthly climate covariates. 9 [PITH_FULL_IMAGE:fig… view at source ↗
Figure 5
Figure 5. Figure 5: Pooled-model ablation. R2 (left) and RMSE (right) for each estimator–feature-set com￾bination. Error bars denote cross-fold standard deviation across 10 folds. All three models agree that adding embeddings to the gaseous baseline yields a consistent improvement, with XGBoost reaching the highest R2 = 0.210 and Random Forest the lowest RMSE. 10 [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
read the original abstract

Malaria, childhood acute respiratory infection, and child undernutrition together account for over two million deaths annually in children under five, with the burden concentrated in low and middle-income countries where climate variability modulates transmission, exposure, and nutritional outcomes. Routine health surveillance in these settings remains sparse and reactive. Satellite-derived representations of the Earth's surface offer a scalable, low-cost complement to traditional covariates, yet their utility as predictors of population health outcomes is poorly characterised. We summarise findings from three studies evaluating AlphaEarth Foundations 64-dimensional satellite embeddings as predictors of population health outcomes, focusing on vulnerable populations. The studies span infectious disease (malaria, respiratory infection) and stunting. In each study, embeddings provide predictive value at sufficient spatial granularity: (i) malaria prediction across Nigeria shows consistent per-region R^2 gains; (ii) childhood acute respiratory infection prediction across 11 DHS countries increases pooled R^2 from 0.157 to 0.206 across three tree-based estimators; (iii) stunting prediction across 35 countries is neutral at country level due to collinearity with fixed effects. The stunting case is currently limited by lack of DHS cluster-level coordinates, which is the next key experiment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript evaluates 64-dimensional AlphaEarth satellite embeddings as predictors for three climate-sensitive health outcomes in low- and middle-income countries: malaria across Nigeria, childhood acute respiratory infection (ARI) across 11 DHS countries, and child stunting across 35 countries. It reports that embeddings yield per-region R² gains for malaria, increase pooled R² from 0.157 to 0.206 for ARI across tree-based estimators, and produce neutral results for stunting at country level due to collinearity with fixed effects. The central claim is that these embeddings supply useful environmental and climate signal at sufficient spatial granularity when added to standard models.

Significance. If the embeddings demonstrably capture independent environmental information, the work could support scalable, low-cost augmentation of sparse health surveillance data for climate-sensitive diseases. The concrete R² lifts in the ARI and malaria cases indicate potential practical utility for predictive modeling in data-poor settings, though the stunting neutrality highlights limits when fixed effects are present.

major comments (3)
  1. [Abstract] Abstract: The reported R² values (e.g., ARI pooled increase 0.157→0.206; Nigeria per-region gains) are presented without any model specifications, cross-validation scheme, significance testing, or treatment of spatial autocorrelation, making it impossible to assess whether the gains are robust or artifactual.
  2. [Malaria and ARI studies] Malaria and ARI studies: The central claim that embeddings add predictive value requires that the 64-dimensional representations supply information orthogonal to traditional covariates. No multicollinearity diagnostics, variance-inflation factors, or ablation experiments (e.g., orthogonalizing embeddings before refitting) are described, so the observed R² improvements could simply reflect increased model flexibility rather than new signal.
  3. [Stunting study] Stunting study: Collinearity with country fixed effects is invoked to explain the neutral result, yet no quantitative support (correlation matrix, VIF scores, or condition indices) is supplied; this leaves the interpretation post-hoc and weakens the contrast drawn with the other two studies.
minor comments (1)
  1. [Abstract] The limitation regarding missing DHS cluster-level coordinates for stunting is noted but not accompanied by a concrete proposal for the next experiment (e.g., required sample size or coordinate precision).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed review. The comments highlight important gaps in methodological transparency and supporting diagnostics that we will address through targeted revisions. Below we respond point-by-point to the major comments.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The reported R² values (e.g., ARI pooled increase 0.157→0.206; Nigeria per-region gains) are presented without any model specifications, cross-validation scheme, significance testing, or treatment of spatial autocorrelation, making it impossible to assess whether the gains are robust or artifactual.

    Authors: We agree that the abstract requires additional context to allow readers to evaluate the robustness of the reported improvements. In the revised manuscript we will expand the abstract to briefly specify the tree-based estimators, the cross-validation procedure (including spatial blocking where applied), and note that significance of R² gains was assessed via permutation tests. Full methodological details, including explicit treatment of spatial autocorrelation through clustered cross-validation, will remain in the Methods section. These additions will be kept concise to respect abstract length limits. revision: yes

  2. Referee: [Malaria and ARI studies] Malaria and ARI studies: The central claim that embeddings add predictive value requires that the 64-dimensional representations supply information orthogonal to traditional covariates. No multicollinearity diagnostics, variance-inflation factors, or ablation experiments (e.g., orthogonalizing embeddings before refitting) are described, so the observed R² improvements could simply reflect increased model flexibility rather than new signal.

    Authors: We accept that demonstrating orthogonality is essential to substantiate the central claim. The revised manuscript will include variance inflation factor (VIF) diagnostics for the full covariate set (traditional variables plus embeddings) in both the malaria and ARI studies. We will also add ablation experiments in which the embeddings are orthogonalized against the traditional covariates via Gram-Schmidt or residualization before refitting; any remaining R² gains will be reported to isolate the contribution of new environmental signal. These analyses will be presented in the Results and Methods sections. revision: yes

  3. Referee: [Stunting study] Stunting study: Collinearity with country fixed effects is invoked to explain the neutral result, yet no quantitative support (correlation matrix, VIF scores, or condition indices) is supplied; this leaves the interpretation post-hoc and weakens the contrast drawn with the other two studies.

    Authors: We agree that the collinearity explanation requires quantitative backing to be convincing. In the revision we will supply a correlation matrix between the 64-dimensional embeddings and the country fixed-effect indicators, together with VIF scores and condition indices computed for the stunting models. These metrics will be reported in a new supplementary table and referenced in the main text, allowing readers to directly compare the degree of collinearity across the three studies and strengthening the interpretation of the neutral stunting results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical R^2 gains from external health data

full rationale

The manuscript presents three empirical case studies that fit standard tree-based estimators (e.g., random forests, gradient boosting) to external DHS and malaria surveillance records, then compare out-of-sample R^2 with versus without the 64-dimensional AlphaEarth embeddings as additional covariates. No equations, normalizations, or self-citations are shown that would reduce the reported R^2 lifts (Nigeria per-region gains; pooled ARI lift 0.157→0.206; stunting neutrality due to collinearity) to quantities defined by the same fitted parameters or by prior author work. The stunting analysis explicitly flags the collinearity issue rather than concealing it, and the embeddings themselves are treated as fixed external inputs. This is a conventional predictive-validation design whose central numbers are not forced by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard assumptions of supervised machine learning applied to geospatial health data and the premise that pre-trained satellite embeddings encode relevant climate and environmental signals.

axioms (2)
  • domain assumption Tree-based regression models produce unbiased estimates of predictive performance when trained on DHS survey data with standard cross-validation.
    Invoked implicitly for the ARI pooled R^2 comparison across three estimators.
  • domain assumption Satellite embeddings are fixed external features whose information content is independent of the health outcome labels.
    Required for interpreting R^2 gains as added predictive value rather than leakage.

pith-pipeline@v0.9.0 · 5522 in / 1585 out tokens · 88748 ms · 2026-05-13T07:02:16.340118+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

7 extracted references · 7 canonical work pages

  1. [1]

    GBD 2019 Diseases and Injuries Collaborators. Global burden of 369 diseases and injuries in 204 countries and territories, 1990–2019: a systematic analysis for the Global Burden of Disease Study 2019.Lancet, 396(10258):1204–1222,

  2. [2]

    World Health Organization.World Malaria Report

    doi: 10.1016/S0140-6736(20)30925-9. World Health Organization.World Malaria Report

  3. [4]

    WHO global air quality guidelines: particulate matter (PM2.5 and PM10), ozone, nitrogen dioxide, sulfur dioxide and carbon monoxide.https://www

    URLhttps://www.who.int/publications/i/item/9789240073678. S Bhatt et al. The effect of malaria control onPlasmodium falciparumin Africa between 2000 and 2015.Nature, 526:207–211,

  4. [5]

    Erin A Mordecai et al

    doi: 10.1038/nature15535. Erin A Mordecai et al. Thermal biology of mosquito-borne disease.Ecology Letters, 22(10):1690– 1708,

  5. [6]

    Nick Watts et al

    doi: 10.1111/ele.13335. Nick Watts et al. The 2022 report of the Lancet Countdown on health and climate change: health at the mercy of fossil fuels.Lancet, 400(10363):1619–1654,

  6. [7]

    Clara R Burgert, Josh Brady, Josh Colston, et al

    doi: 10.1016/S0140-6736(22) 01540-9. Clara R Burgert, Josh Brady, Josh Colston, et al. Geographic displacement procedure and geo- referenced data release policy for the Demographic and Health Surveys. DHS Spatial Analysis Reports 7, ICF International, Calverton, Maryland,

  7. [8]

    The outer polygon strictly dominates on every spoke, indicating that the lift is geographically uniform rather than driven by a few high-burden states

    8 A Case 1 — Malaria Prediction, Nigeria (a)Per-region 2024 testR 2 across Nigerian states.Each spoke is one state; the inner polygon is the climate-only base- line and the outer polygon is the same model with the 64-dim AlphaEarth finger- print appended. The outer polygon strictly dominates on every spoke, indicating that the lift is geographically unifo...