arxiv: 2604.16841 · v1 · submitted 2026-04-18 · 💻 cs.CV · cs.LG

Recognition: unknown

When Earth Foundation Models Meet Diffusion: An Application to Land Surface Temperature Super-Resolution

Lingyao Li, Peishi Jiang, Qikai Hu, Rita Sousa, Runlong Yu, Xinyue Ye, Yiheng Chen, Yilong Dai, Zihui Ma

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:59 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords land surface temperaturesuper-resolutiondiffusion modelsearth foundation modelsremote sensingcross-attentiongeospatial embeddings

0 comments

The pith

Pretrained Earth foundation models can guide diffusion models to reconstruct fine-scale land surface temperatures from 32-times degraded satellite observations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes using embeddings from a pretrained Earth foundation model to condition a diffusion model for super-resolving land surface temperature data. The core idea is that these embeddings, derived from high-resolution multispectral images, provide the missing fine-scale geospatial context needed to generate realistic high-resolution thermal maps from very coarse inputs. A large-scale evaluation on over 240,000 global patches demonstrates consistent improvements over baseline approaches, with cross-attention proving more effective than direct feature concatenation. If correct, this means remote sensing applications could achieve higher resolution thermal imagery without requiring new sensors or data collection. The method also offers two variants that trade off between visual realism and measurement accuracy.

Core claim

The paper claims that the EFDiff framework, which encodes high-resolution multispectral reflectance using the Prithvi-EO-2.0 model and injects the resulting embeddings into a diffusion denoising network via cross-attention, enables effective super-resolution of land surface temperature under a 32x scale gap. This conditioning guides the generative process to recover fine-scale structures that are otherwise underdetermined by the degraded thermal observations. Two variants, EFDiff-ε and EFDiff-x0, provide complementary strengths in perceptual quality and pixel fidelity. Experiments on a benchmark of 242,416 co-registered patches show that EFDiff outperforms baselines and that the cross-attent

What carries the argument

Cross-attention injection of geospatial embeddings from the Prithvi-EO-2.0 Earth foundation model into a diffusion model's denoising network to condition generation of high-resolution land surface temperature from degraded inputs.

If this is right

Super-resolution performance improves consistently across a diverse global dataset when using foundation model guidance.
Cross-attention conditioning extracts more useful information from the embeddings than simple concatenation of input channels.
The approach generalizes beyond LST to other remote sensing reconstruction tasks where pretrained representations are available.
Variants allow selection based on whether perceptual realism or exact pixel values are prioritized.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Pretrained models trained on multispectral data may implicitly learn temperature-related spatial patterns that transfer to thermal super-resolution.
This conditioning strategy could reduce the need for paired high-low resolution training data in future remote sensing applications.
Extending the framework to other foundation models or modalities might reveal which pretraining objectives best support generative downscaling.
Testing on real-world operational data streams would show practical utility for environmental monitoring.

Load-bearing premise

Embeddings from the Prithvi-EO-2.0 model encode enough fine-scale geospatial structure to accurately guide land surface temperature reconstruction even when the input thermal data is degraded by a factor of 32.

What would settle it

If EFDiff does not outperform baselines on a new set of co-registered Landsat patches at 32x degradation or if channel concatenation performs better than cross-attention, the advantage of the proposed conditioning would be called into question.

Figures

Figures reproduced from arXiv: 2604.16841 by Lingyao Li, Peishi Jiang, Qikai Hu, Rita Sousa, Runlong Yu, Xinyue Ye, Yiheng Chen, Yilong Dai, Zihui Ma.

**Figure 4.1.** Figure 4.1: Overall framework of EFDiff. it decouples the EFM’s parameter count from the training cost, since no optimizer state is maintained for the encoder. Third, the resulting embeddings provide a stable, globally informed spatial context: because the encoder has seen millions of diverse land surfaces during pretraining, it captures vegetation boundaries, urban fabric, water bodies, and terrain texture that are… view at source ↗

**Figure 5.1.** Figure 5.1: Global distribution of training and test tiles. [PITH_FULL_IMAGE:figures/full_fig_p005_5_1.png] view at source ↗

**Figure 5.2.** Figure 5.2: Per-pixel RMSE maps aggregated over the test set. [PITH_FULL_IMAGE:figures/full_fig_p006_5_2.png] view at source ↗

**Figure 5.3.** Figure 5.3: Qualitative comparison on a representative test patch. From top to bottom: reconstructed field [PITH_FULL_IMAGE:figures/full_fig_p007_5_3.png] view at source ↗

**Figure 5.** Figure 5: presents a qualita [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 5.4.** Figure 5.4: Per-patch ∆RMSE versus spatial complexity for the two diffusion formulations. Grey dots denote individual patches; colored circles denote binned means with standard error bars. HLS RGB GT EFDiff- EFDiff-x0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 E r r o r [° C] [PITH_FULL_IMAGE:figures/full_fig_p008_5_4.png] view at source ↗

**Figure 5.5.** Figure 5.5: Error difference maps for two selected test [PITH_FULL_IMAGE:figures/full_fig_p008_5_5.png] view at source ↗

read the original abstract

Land surface temperature (LST) super-resolution is important for environmental monitoring. However, it remains challenging as coarse thermal observations severely underdetermine fine-scale structure. In this paper, we propose Earth Foundation Model-guided Diffusion (EFDiff), a novel framework for super-resolution under extreme spatial degradation. EFDiff uses the Prithvi-EO-2.0 Earth foundation model to encode high-resolution multispectral reflectance into geospatial embeddings, which are injected into the denoising network via cross-attention to guide fine-scale reconstruction from highly degraded observations. We study two variants, EFDiff-$\epsilon$ and EFDiff-$x_0$, which offer complementary trade-offs between perceptual realism and pixel-level fidelity. We evaluate EFDiff under an extreme $32\times$ scale gap using a globally diverse benchmark comprising 242,416 co-registered Landsat thermal-reflectance patches. Results show that EFDiff consistently outperforms baseline methods and that cross-attention conditioning by EFM is more effective than HLS channel concatenation. Although we present EFDiff in the context of LST super-resolution, the framework is broadly applicable to remote sensing problems in which pretrained geospatial representations can guide generative reconstruction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EFDiff conditions a diffusion model on Prithvi-EO-2.0 embeddings via cross-attention to handle 32x LST super-resolution and beats simple concatenation on a large patch set, but the embeddings' ability to supply missing fine-scale structure is not directly checked.

read the letter

The paper's core move is to take a pretrained Earth foundation model, pull embeddings from high-res multispectral reflectance, and feed them into a diffusion denoiser through cross-attention so the model can hallucinate plausible fine thermal detail from 32x coarser inputs. That specific conditioning choice on this scale looks new enough to be worth noting, and the two variants (epsilon and x0) give a clean way to trade perceptual quality against pixel accuracy. They also built a solid evaluation set of over 240k co-registered Landsat patches that span different regions, which is more than most remote-sensing super-resolution papers bother with. The claim that cross-attention works better than just stacking the HLS channels as extra input channels is the main empirical result, and the abstract presents it as consistent across the benchmark. That part is straightforward and easy to understand. The soft spot is exactly the one the stress-test flags: nothing in the write-up shows that the Prithvi embeddings actually carry the high-frequency spatial patterns needed when the thermal input is almost blank. Foundation models trained at 10-30 m scales often collapse to coarser semantic or spectral features, so it is possible the reported gains come more from the extra capacity of cross-attention than from genuinely informative conditioning. A frequency analysis or resolution ablation on the embeddings would have settled this quickly, but it is missing. The paper is aimed at people who already work on remote-sensing data enhancement and want a generative route that leverages existing foundation models rather than training everything from scratch. It is not a broad methodological advance, but the application is practical and the dataset size gives it some weight. I would send it to peer review; the central idea is clear, the benchmark is usable, and the main open question is narrow enough that referees can address it directly.

Referee Report

2 major / 1 minor

Summary. The paper proposes Earth Foundation Model-guided Diffusion (EFDiff), a diffusion framework for LST super-resolution under 32× degradation. It extracts geospatial embeddings from the pretrained Prithvi-EO-2.0 model applied to high-resolution multispectral reflectance and injects them into the denoising network via cross-attention. Two variants (EFDiff-ε and EFDiff-x0) are presented for perceptual vs. fidelity trade-offs, evaluated on a benchmark of 242,416 global Landsat patches, with claims of consistent outperformance over baselines and superiority of cross-attention conditioning over HLS concatenation.

Significance. If the results hold, the work would demonstrate a practical way to leverage large-scale pretrained Earth foundation models for guiding generative reconstruction in remote sensing, addressing the severe under-determination of fine-scale thermal structure from coarse observations. The scale of the benchmark and the framework's stated generality to other modalities are notable strengths that could influence downstream applications in environmental monitoring.

major comments (2)

Abstract: the claim that 'EFDiff consistently outperforms baseline methods' and that 'cross-attention conditioning by EFM is more effective than HLS channel concatenation' is asserted without any quantitative metrics, tables, or error statistics, preventing assessment of effect size or statistical significance.
Abstract / Results: the central claim that cross-attention from Prithvi-EO-2.0 embeddings enables accurate 32× reconstruction rests on the untested assumption that these embeddings preserve high-frequency spatial structure. Under 32× degradation the thermal input supplies essentially no spatial cues, yet no ablation, frequency analysis, or resolution-specific experiment is described to show that the ViT embeddings transfer fine-scale geospatial detail rather than coarse semantics.

minor comments (1)

Abstract: the symbols ε and x0 in the variant names EFDiff-ε and EFDiff-x0 are not defined; a one-sentence clarification of their diffusion-process meaning would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important opportunities to strengthen the presentation of results and the mechanistic justification for our approach. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: Abstract: the claim that 'EFDiff consistently outperforms baseline methods' and that 'cross-attention conditioning by EFM is more effective than HLS channel concatenation' is asserted without any quantitative metrics, tables, or error statistics, preventing assessment of effect size or statistical significance.

Authors: We agree that the abstract would benefit from explicit quantitative support. In the revised manuscript we will update the abstract to report key aggregate metrics from the 242,416-patch global benchmark (e.g., mean PSNR, SSIM, and LPIPS improvements of EFDiff over the strongest baselines, together with the corresponding gains of cross-attention over HLS concatenation). These numbers will be drawn directly from the results tables already present in the paper and will be accompanied by a brief statement of statistical significance where appropriate. revision: yes
Referee: Abstract / Results: the central claim that cross-attention from Prithvi-EO-2.0 embeddings enables accurate 32× reconstruction rests on the untested assumption that these embeddings preserve high-frequency spatial structure. Under 32× degradation the thermal input supplies essentially no spatial cues, yet no ablation, frequency analysis, or resolution-specific experiment is described to show that the ViT embeddings transfer fine-scale geospatial detail rather than coarse semantics.

Authors: This is a fair and substantive critique of the current evidence for the conditioning mechanism. While Prithvi-EO-2.0 is pretrained on high-resolution multispectral imagery, the manuscript does not yet contain explicit ablations or frequency-domain diagnostics. In the revision we will add (i) an ablation that replaces the high-resolution-derived embeddings with embeddings computed from spatially degraded inputs, (ii) power-spectrum and wavelet-based frequency analyses comparing reconstructions with and without the cross-attention pathway, and (iii) a resolution-sweep experiment at intermediate scale factors (4×, 8×, 16×) to isolate the contribution of fine-scale geospatial detail. These additions will directly test whether the embeddings supply high-frequency structure beyond coarse semantics. revision: yes

Circularity Check

0 steps flagged

No significant circularity; relies on external pretrained model

full rationale

The paper proposes EFDiff by encoding high-resolution multispectral reflectance with the external pretrained Prithvi-EO-2.0 model and injecting the resulting embeddings into a diffusion denoiser via cross-attention. This conditioning step and the subsequent empirical evaluation on 242,416 co-registered Landsat patches against baselines do not reduce to self-definition, fitted inputs renamed as predictions, or load-bearing self-citations. The framework is self-contained against external benchmarks with no equations or claims that equate outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; full text would be required to identify any.

pith-pipeline@v0.9.0 · 5539 in / 1042 out tokens · 53071 ms · 2026-05-10T06:59:44.569089+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 5 canonical work pages · 1 internal anchor

[1]

[1]N. Agam, W. P. Kustas, M. C. Anderson, F. Li, and C. M. Neale,A vegetation index based technique for spatial sharpening of thermal im- agery, Remote sensing of Environment, 107 (2007), pp. 545–558. [2]Y. Blau and T. Michaeli,The perception- distortion tradeoff, in Proceedings of the IEEE con- ferenceoncomputervisionandpatternrecognition, 2018, pp. 6228...

work page arXiv 2007
[2]

[5]J. Chen, L. Jia, J. Zhang, Y. Feng, X. Zhao, and R. Tao,Super-resolution for land surface temperature retrieval images via cross-scale diffu- sion model using reference images, Remote Sens- ing, 16 (2024), p

2024
[3]

[6]X. Chen, X. W ang, J. Zhou, Y. Qiao, and C. Dong,Activating more pixels in image super-resolution transformer, in Proceedings of the IEEE/CVFconferenceoncomputervisionandpat- tern recognition, 2023, pp. 22367–22377. [7]M. Cla verie, J. Ju, J. G. Masek, J. L. Dun- gan, E. F. Vermote, J.-C. Roger, S. V. Skakun, and C. Justice,The harmonized land- sat and...

2023
[4]

Open-source AI model and interface for Earth

[8]Clay Foundation,Clay foundation model, 2024, https://github.com/Clay-foundation/model. Open-source AI model and interface for Earth. [9]Y. Cong, S. Khanna, C. Meng, P. Liu, E. Rozi, Y. He, M. Burke, D. Lobell, and S. Ermon,Satmae: Pre-training transformers for temporal and multi-spectral satellite imagery, Ad- vances in Neural Information Processing Sy...

2024
[5]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

[13]A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al.,An image is worth 16x16 words: Transformers for image recognition at scale, arXiv preprint arXiv:2010.11929, (2020). [14]Earth Resources Obser v ation and Sci- ence (EROS) Center,Landsat 8-9 opera- tional land imager...

work page internal anchor Pith review Pith/arXiv arXiv 2010
[6]

[21]G. C. Hulley, F. M. Göttsche, G. Rivera, S. J. Hook, R. J. Freepartner, M. A. Mar- tin, K. Ca wse-Nicholson, and W. R. John- son,Validation and quality assessment of the ecostress level-2 land surface temperature and emis- sivity product, IEEE Transactions on Geoscience and Remote Sensing, 60 (2021), pp. 1–23. [22]C. Hutengs and M. Vohland,Downscaling...

2021
[7]

Lacoste, N

[24]A. Lacoste, N. Lehmann, P. Rodriguez, E. Sher win, H. Kerner, B. Lütjens, J. Ir vin, D. Dao, H. Alemohammad, A. Drouin, et al., Geo-bench: Toward foundation models for earth monitoring, Advances in Neural Information Pro- cessing Systems, 36 (2023), pp. 51080–51093. [25]C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, ...

work page doi:10.1016/j.rsase 2023
[8]

W ang, Z

[40]J. W ang, Z. Fu, B. Tang, and J. Xu, Information-guided diffusion model for downscal- ing land surface temperature from sdgsat-1 re- mote sensing images, Remote Sensing, 17 (2025), p

2025
[9]

Neural plasticity- inspired multimodal foundation model for earth observation.arXiv preprint arXiv:2403.15356, 2024

[41]Z. W ang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli,Image quality assess- ment: from error visibility to structural similarity, IEEE transactions on image processing, 13 (2004), pp. 600–612. [42]Z. Xiong, Y. W ang, F. Zhang, A. J. Stew- art, J. Hanna, D. Borth, I. Papoutsis, B. Le Saux, G. Camps-V alls, and X. X. Zhu, Neural plasticity-inspired ...

work page arXiv 2004
[10]

[43]N. Xu, Z. Yin, M. Gao, J. Li, L. Song, and P. Wu,Downscaling cldas land surface tempera- ture using modis data and a multi-attention multi- residual super-resolution network, IEEE Transac- tions on Geoscience and Remote Sensing, (2025). [44]Z. Yue, J. W ang, and C. C. Loy,Resshift: Efficient diffusion model for image super-resolution by residual shift...

work page arXiv 2025
[11]

Tiles from Antarctica and central Green- land are removed due to limited relevance and sparse valid thermal coverage

and the RESOLVE 2017 ecoregion dataset [12]. Tiles from Antarctica and central Green- land are removed due to limited relevance and sparse valid thermal coverage. The remaining 3,094 MGRS tiles are then split geographically into 2,943 training tiles and 151 test tiles. For each selected tile, we re- trieve up to 48 observations from 2014 to 2026, sub- jec...

2017
[12]

Because extreme32×super-resolution is highly ill- posed, we report both classes of metrics to capture the perception–distortion trade-off [2]

and FID [19], which assess whether generated thermal fields exhibit realistic local texture and match the overall distribution of real sam- ples. Because extreme32×super-resolution is highly ill- posed, we report both classes of metrics to capture the perception–distortion trade-off [2]. To further expose performance differences across terrain types, all ...

2026