pith. sign in

arxiv: 2606.27277 · v1 · pith:ZNUTAKF6new · submitted 2026-06-25 · 💻 cs.AI · cs.CV

EO-WM: A Physically Informed World Model for Probabilistic Earth Observation Forecasting

Pith reviewed 2026-06-26 04:29 UTC · model grok-4.3

classification 💻 cs.AI cs.CV
keywords earth observation forecastingworld modeldiffusion transformerNDVI predictionweather conditioningprobabilistic forecastingextreme weather benchmarkvegetation response
0
0 comments X

The pith

EO-WM conditions a video diffusion transformer on separate pathways for climatological baseline, weather anomalies, and accumulated physical stress to produce weather-responsive Earth surface forecasts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper frames Earth observation forecasting as a partially observed world modeling problem driven by weather, where existing deterministic and diffusion methods fail to isolate how forecasts should change with altered meteorological forcing. EO-WM introduces distinct conditioning pathways for a climatological baseline, weather anomalies, and cumulative stress signals that accumulate over time to represent sustained heat or drought effects. It adds two new benchmarks that measure whether predictions correctly track NDVI decline amplitude and direction under extreme or altered weather rather than only pixel-level accuracy. On these tests the model reduces NDVI decline error by a relative 5.63 percent and raises directional hit rate by a relative 7.80 percent while matching standard metrics.

Core claim

EO-WM is a video diffusion transformer for multispectral EO forecasting that incorporates a physically informed conditioning framework representing meteorological forcing through a climatological baseline, weather anomalies, and cumulative physical stress signals. It separates baseline and anomaly through distinct conditioning pathways and accumulates anomalous forcing over time to capture sustained heat and drought stress. On the introduced Extreme Summer Benchmark and Seasonal Matched-Pair Benchmark this yields a relative 5.63 percent reduction in error for predicted NDVI decline amplitude and a relative 7.80 percent improvement in directional hit rate while remaining competitive on pixel-

What carries the argument

The physically informed conditioning framework that separates meteorological forcing into climatological baseline, weather anomalies, and cumulative physical stress signals through distinct pathways with temporal accumulation.

If this is right

  • Forecasts show reduced error in the amplitude of NDVI decline during extreme summer conditions.
  • Directional accuracy improves when testing response to deliberately changed weather forcing.
  • Performance stays competitive on conventional pixel-level reconstruction metrics.
  • Diagnostic benchmarks shift evaluation from reconstruction accuracy toward weather-response fidelity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same decomposition of forcing signals could be tested on forecasting other surface properties such as soil moisture or land surface temperature.
  • Explicit separation of baseline, anomaly and stress may reduce the data needed to learn long-term land dynamics compared with undifferentiated conditioning.
  • The new benchmarks could serve as a standard check for any EO forecasting model to verify sensitivity to specific weather perturbations.
  • Extending the accumulation mechanism to multi-year stress signals might allow longer-horizon seasonal forecasts.

Load-bearing premise

Separating meteorological forcing into climatological baseline, weather anomalies, and cumulative physical stress signals through distinct conditioning pathways is sufficient to capture the relevant unobserved land-surface dynamics without additional state variables or more detailed physical process models.

What would settle it

A test that removes or swaps the cumulative stress pathway while keeping baseline and anomaly inputs fixed and measures whether the directional NDVI response under extreme weather collapses to random levels.

Figures

Figures reproduced from arXiv: 2606.27277 by Hengshuang Zhao, Junwei Luo, Shuai Yuan, Yansheng Li, Zhe Liu, Zhenya Yang.

Figure 1
Figure 1. Figure 1: Overview of EO world model and the proposed evaluation benchmarks. (a): EO forecasting differs from standard action-conditioned world modeling: satellite observations are sparse and incomplete, and exogenous weather forcing drives future surface change in ways that depend on unobserved latent Earth-surface states. (b) and (c): We define two EO-specific evaluation dimensions beyond standard reconstruction: … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of EO-WM. Sparse EO observations are encoded into visual latents, while dense daily weather forcing is decomposed using Clim(tile, month), a precomputed monthly climatology for each geographic tile. The climatological features are injected through a shallow conditioning path, whereas anomaly, DEM, visual, and cumulative-stress features are combined into a spatial condition. the packed video-token … view at source ↗
Figure 3
Figure 3. Figure 3: Visual diagnostics apart from benchmark metrics. (a) Predicted versus ground-truth NDVI drop amplitude on the Extreme Summer Benchmark, where the dashed line is perfect severity reproduction. DRA (Drop Reproduction Accuracy) measures relative agreement between predicted and ground-truth drop amplitudes. (b) Extreme-event detection rate by severity bin, measured as the fraction of forecasts whose target-per… view at source ↗
read the original abstract

Earth Observation (EO) forecasting aims to predict future Earth surface dynamics from satellite observations under changing meteorological conditions. In this paper, we view this task as a partially observed, weather-driven world modeling problem, in which weather acts as a conditioning signal, while forecasting remains uncertain due to sparse observations and unobserved land-surface states. However, existing methods do not fully capture this setting: deterministic models collapse uncertainty into a single future prediction, while diffusion-based methods typically treat weather variables as undifferentiated conditioning signals, and existing benchmarks focus mainly on reconstruction accuracy rather than whether forecasts respond correctly to changed weather forcing.We introduce EO-WM, a video diffusion transformer for multispectral EO forecasting. EO-WM incorporates a physically informed conditioning framework that represents meteorological forcing through a climatological baseline, weather anomalies, and cumulative physical stress signals. Specifically, it separates baseline and anomaly through distinct conditioning pathways, and accumulates anomalous forcing over time to capture sustained heat and drought stress. To evaluate weather-response behavior beyond standard metrics, we introduce two diagnostic benchmarks: an Extreme Summer Benchmark for severity-aware prediction of vegetation degradation under extreme weather, and a Seasonal Matched-Pair Benchmark for testing response fidelity under changed weather forcing. Experiments show that EO-WM reduces the error in predicted Normalized Difference Vegetation Index (NDVI) decline amplitude by a relative 5.63% and improves directional hit rate by a relative 7.80%, while remaining competitive on standard pixel-level metrics. The benchmarks and model will be made open-source at https://github.com/Luo-Z13/EO-WM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces EO-WM, a video diffusion transformer for multispectral Earth Observation forecasting that incorporates a physically informed conditioning framework separating meteorological forcing into climatological baseline, weather anomalies, and cumulative physical stress signals through distinct pathways. It introduces two new diagnostic benchmarks (Extreme Summer Benchmark and Seasonal Matched-Pair Benchmark) to evaluate weather-response behavior beyond standard reconstruction metrics. Experiments report that EO-WM reduces error in predicted NDVI decline amplitude by a relative 5.63% and improves directional hit rate by a relative 7.80% while remaining competitive on pixel-level metrics; the model and benchmarks are to be open-sourced.

Significance. If the central claim holds, the work would be significant for embedding structured physical conditioning into probabilistic EO world models, addressing the gap between undifferentiated weather inputs and observable vegetation response to extremes. The focus on diagnostic benchmarks for directional fidelity and severity-aware prediction, rather than solely pixel-level accuracy, represents a constructive contribution. Open-sourcing the model and benchmarks strengthens reproducibility. However, the significance is limited by the absence of statistical validation and direct tests of the conditioning design's sufficiency.

major comments (2)
  1. [Abstract] Abstract: The reported relative gains (5.63% reduction in NDVI decline amplitude error; 7.80% improvement in directional hit rate) are presented without error bars, dataset sizes, statistical tests, or ablation details. These omissions are load-bearing for the central claim that the three-way conditioning improves weather-response behavior on the new benchmarks.
  2. [Method and Experiments] Method and Experiments sections: No comparison is reported to an otherwise identical architecture augmented with explicit state variables (e.g., recurrent soil-moisture or vegetation-state buffer) or to a version replacing the cumulative stress accumulator with a more detailed process model. This is load-bearing because the gains on the author-introduced benchmarks are attributed to the sufficiency of the baseline/anomaly/stress pathways for capturing unobserved land-surface dynamics.
minor comments (1)
  1. [Experiments] The high-level description of benchmark construction leaves open the possibility that metric choices influence the reported improvements; more explicit details on how the Extreme Summer and Seasonal Matched-Pair benchmarks are constructed would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments identify areas where additional statistical detail and justification of design choices would strengthen the manuscript. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The reported relative gains (5.63% reduction in NDVI decline amplitude error; 7.80% improvement in directional hit rate) are presented without error bars, dataset sizes, statistical tests, or ablation details. These omissions are load-bearing for the central claim that the three-way conditioning improves weather-response behavior on the new benchmarks.

    Authors: We agree that the abstract and results would benefit from explicit statistical support. In the revised manuscript we will report mean performance with standard deviations across multiple random seeds, the exact sizes of the Extreme Summer and Seasonal Matched-Pair benchmarks, and the outcomes of paired statistical tests (e.g., Wilcoxon signed-rank) for the reported relative improvements. revision: yes

  2. Referee: [Method and Experiments] Method and Experiments sections: No comparison is reported to an otherwise identical architecture augmented with explicit state variables (e.g., recurrent soil-moisture or vegetation-state buffer) or to a version replacing the cumulative stress accumulator with a more detailed process model. This is load-bearing because the gains on the author-introduced benchmarks are attributed to the sufficiency of the baseline/anomaly/stress pathways for capturing unobserved land-surface dynamics.

    Authors: The EO setting is defined by partially observed inputs; explicit recurrent state buffers would require auxiliary variables (soil moisture, vegetation state) that are not part of the multispectral observation stream used in our benchmarks. We will add a dedicated paragraph in the revised Method section clarifying this modeling choice and will include additional pathway-specific ablations that isolate the contribution of the cumulative-stress accumulator to the directional and amplitude metrics. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces a conditioning framework and evaluates empirical performance gains on newly proposed benchmarks using independent metrics (NDVI decline amplitude error, directional hit rate). These outcomes are measured results on held-out or constructed test cases rather than quantities defined in terms of the fitted parameters or model architecture. No self-citations, self-definitional equations, or fitted-input-as-prediction patterns appear in the abstract or described claims. The derivation remains self-contained against external benchmarks and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only review limits visibility; the approach rests on the domain assumption that weather can be usefully decomposed into baseline, anomaly, and cumulative stress signals for conditioning a generative model.

axioms (1)
  • domain assumption Weather acts as a conditioning signal while forecasting remains uncertain due to sparse observations and unobserved land-surface states.
    Stated in the opening paragraph of the abstract as the problem framing.
invented entities (1)
  • cumulative physical stress signals no independent evidence
    purpose: Capture sustained heat and drought stress by accumulating anomalous forcing over time.
    Introduced as part of the conditioning framework; no independent falsifiable prediction outside the model is mentioned.

pith-pipeline@v0.9.1-grok · 5825 in / 1347 out tokens · 37695 ms · 2026-06-26T04:29:30.925695+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 9 linked inside Pith

  1. [1]

    Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

    Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

  2. [2]

    Diffusion for world modeling: Visual details matter in atari.Advances in Neural Information Processing Systems, 37:58757–58791, 2024

    Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari.Advances in Neural Information Processing Systems, 37:58757–58791, 2024

  3. [3]

    Multi-modal learning for geospatial vegetation forecasting

    Vitus Benson, Claire Robin, Christian Requena-Mesa, Lazaro Alonso, Nuno Carvalhais, José Cortés, Zhihan Gao, Nora Linscheid, Mélanie Weynants, and Markus Reichstein. Multi-modal learning for geospatial vegetation forecasting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27788–27799, 2024

  4. [4]

    Genie: Generative interactive environments

    Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 2024

  5. [5]

    Insights from earth system model initial-condition large ensembles and future prospects.Nature climate change, 10 (4):277–286, 2020

    Clara Deser, Flavio Lehner, Keith B Rodgers, Toby Ault, Thomas L Delworth, Pedro N DiNezio, Arlene Fiore, Claude Frankignoul, John C Fyfe, Daniel E Horton, et al. Insights from earth system model initial-condition large ensembles and future prospects.Nature climate change, 10 (4):277–286, 2020

  6. [6]

    Understand- ing the role of weather data for earth surface forecasting using a convlstm-based model

    Codrut,-Andrei Diaconu, Sudipan Saha, Stephan Günnemann, and Xiao Xiang Zhu. Understand- ing the role of weather data for earth surface forecasting using a convlstm-based model. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1362–1371, 2022

  7. [7]

    Simvp: Simpler yet better video prediction

    Zhangyang Gao, Cheng Tan, Lirong Wu, and Stan Z Li. Simvp: Simpler yet better video prediction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3170–3180, 2022

  8. [8]

    Earthformer: Exploring space-time transformers for earth system forecasting.Advances in Neural Information Processing Systems, 35:25390–25403, 2022

    Zhihan Gao, Xingjian Shi, Hao Wang, Yi Zhu, Yuyang Bernie Wang, Mu Li, and Dit-Yan Yeung. Earthformer: Exploring space-time transformers for earth system forecasting.Advances in Neural Information Processing Systems, 35:25390–25403, 2022

  9. [9]

    Ecomapper: Generative modeling for climate-aware satellite imagery

    Muhammed Goktepe, Amir hossein Shamseddin, Erencan Uysal, Javier Muinelo Monteagudo, Lukas Drees, Aysim Toker, Senthold Asseng, and Malte V on Bloh. Ecomapper: Generative modeling for climate-aware satellite imagery. InForty-second International Conference on Machine Learning, 2025

  10. [10]

    Disentangling physical dynamics from unknown factors for unsupervised video prediction

    Vincent Le Guen and Nicolas Thome. Disentangling physical dynamics from unknown factors for unsupervised video prediction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11474–11484, 2020

  11. [11]

    World models.arXiv preprint arXiv:1803.10122, 2(3):440, 2018

    David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3):440, 2018

  12. [12]

    Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019

    Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019. 10

  13. [13]

    Learning latent dynamics for planning from pixels

    Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. InInternational conference on machine learning, pages 2555–2565. PMLR, 2019

  14. [14]

    Classifier-free diffusion guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. InNeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2022

  15. [15]

    Diffusion models for video prediction and infilling.Transactions on Machine Learning Research, 2022, 2022

    Tobias Höppe, Arash Mehrjou, Stefan Bauer, Didrik Nielsen, and Andrea Dittadi. Diffusion models for video prediction and infilling.Transactions on Machine Learning Research, 2022, 2022

  16. [16]

    LoRA: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=nZeVKeeFYf9

  17. [17]

    Vid2world: Crafting video diffusion models to interactive world models.arXiv preprint arXiv:2505.14357, 2025

    Siqiao Huang, Jialong Wu, Qixing Zhou, Shangchen Miao, and Mingsheng Long. Vid2world: Crafting video diffusion models to interactive world models.arXiv preprint arXiv:2505.14357, 2025

  18. [18]

    Global vegetation modeling with pre-trained weather transformers.arXiv preprint arXiv:2403.18438, 2024

    Pascal Janetzky, Florian Gallusser, Simon Hentschel, Andreas Hotho, and Anna Krause. Global vegetation modeling with pre-trained weather transformers.arXiv preprint arXiv:2403.18438, 2024

  19. [19]

    Diffusionsat: A generative foundation model for satellite imagery

    Samar Khanna, Patrick Liu, Linqi Zhou, Chenlin Meng, Robin Rombach, Marshall Burke, David B Lobell, and Stefano Ermon. Diffusionsat: A generative foundation model for satellite imagery. InThe Twelfth International Conference on Learning Representations, 2023

  20. [20]

    Enhanced prediction of vegetation responses to extreme drought using deep learning and earth observation data.Ecological Informatics, 80:102474, 2024

    Klaus-Rudolf Kladny, Marco Milanta, Oto Mraz, Koen Hufkens, and Benjamin D Stocker. Enhanced prediction of vegetation responses to extreme drought using deep learning and earth observation data.Ecological Informatics, 80:102474, 2024

  21. [21]

    Eo-vae: Towards a multi-sensor tokenizer for earth observation data.arXiv preprint arXiv:2602.12177, 2026

    Nils Lehmann, Yi Wang, Zhitong Xiong, and Xiaoxiang Zhu. Eo-vae: Towards a multi-sensor tokenizer for earth observation data.arXiv preprint arXiv:2602.12177, 2026

  22. [22]

    Stiv: Scalable text and image conditioned video generation

    Zongyu Lin, Wei Liu, Chen Chen, Jiasen Lu, Wenze Hu, Tsu-Jui Fu, Jesse Allardice, Zhengfeng Lai, Liangchen Song, Bowen Zhang, et al. Stiv: Scalable text and image conditioned video generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16249–16259, 2025

  23. [23]

    Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

  24. [24]

    Vdt: General-purpose video diffusion transformers via mask modeling

    Haoyu Lu, Guoxing Yang, Nanyi Fei, Yuqi Huo, Zhiwu Lu, Ping Luo, and Mingyu Ding. Vdt: General-purpose video diffusion transformers via mask modeling. InThe Twelfth International Conference on Learning Representations, 2023

  25. [25]

    Remote sensing-oriented world model.arXiv preprint arXiv:2509.17808, 2025

    Yuxi Lu, Biao Wu, Zhidong Li, Kunqi Li, Chenya Huang, Huacan Wang, Qizhen Lan, Ronghao Chen, Ling Chen, and Bin Liang. Remote sensing-oriented world model.arXiv preprint arXiv:2509.17808, 2025

  26. [26]

    Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024

    Xin Ma, Yaohui Wang, Xinyuan Chen, Gengyun Jia, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024

  27. [27]

    Controllable video generation: A survey.arXiv preprint arXiv:2507.16869, 2025

    Yue Ma, Kunyu Feng, Zhongyuan Hu, Xinyu Wang, Yucheng Wang, Mingzhe Zheng, Bingyuan Wang, Qinghe Wang, Xuanhua He, Hongfa Wang, et al. Controllable video generation: A survey.arXiv preprint arXiv:2507.16869, 2025

  28. [28]

    Driveworld: 4d pre-trained scene understanding via world models for autonomous driving

    Chen Min, Dawei Zhao, Liang Xiao, Jian Zhao, Xinli Xu, Zheng Zhu, Lei Jin, Jianshu Li, Yulan Guo, Junliang Xing, et al. Driveworld: 4d pre-trained scene understanding via world models for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15522–15533, 2024. 11

  29. [29]

    Syncvp: joint diffusion for synchronous multi-modal video prediction

    Enrico Pallotta, Sina Mokhtarzadeh Azar, Shuai Li, Olga Zatsarynna, and Juergen Gall. Syncvp: joint diffusion for synchronous multi-modal video prediction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13787–13797, 2025

  30. [30]

    Explainable earth surface forecasting under extreme events.Earth’s Future, 13(9):e2024EF005446, 2025

    Oscar J Pellicer-Valero, Miguel-Ángel Fernández-Torres, Chaonan Ji, Miguel D Mahecha, and Gustau Camps-Valls. Explainable earth surface forecasting under extreme events.Earth’s Future, 13(9):e2024EF005446, 2025

  31. [31]

    Film: Visual reasoning with a general conditioning layer

    Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

  32. [32]

    Earthnet2021: A large-scale dataset and challenge for earth surface forecasting as a guided video prediction task

    Christian Requena-Mesa, Vitus Benson, Markus Reichstein, Jakob Runge, and Joachim Denzler. Earthnet2021: A large-scale dataset and challenge for earth surface forecasting as a guided video prediction task. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1132–1142, 2021

  33. [33]

    Avid: Adapting video diffusion models to world models.arXiv preprint arXiv:2410.12822, 2024

    Marc Rigter, Tarun Gupta, Agrin Hilmkil, and Chao Ma. Avid: Adapting video diffusion models to world models.arXiv preprint arXiv:2410.12822, 2024

  34. [34]

    Worldarena: A unified benchmark for evaluating perception and functional utility of embodied world models.arXiv preprint arXiv:2602.08971, 2026

    Yu Shang, Zhuohang Li, Yiding Ma, Weikang Su, Xin Jin, Ziyou Wang, Lei Jin, Xin Zhang, Yinzhou Tang, Haisheng Su, et al. Worldarena: A unified benchmark for evaluating perception and functional utility of embodied world models.arXiv preprint arXiv:2602.08971, 2026

  35. [35]

    Vit-koop: Vision-transformer-koopman operators for efficient time-series forecasting of earth-observation data

    Takayuki Shinohara. Vit-koop: Vision-transformer-koopman operators for efficient time-series forecasting of earth-observation data. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pages 2835–2844, October 2025

  36. [36]

    Restore-dit: Reliable satellite image time series reconstruction by multimodal sequential diffusion transformer.Remote Sensing of Environment, 328:114872, 2025

    Qidi Shu, Xiaolin Zhu, Shuai Xu, Yan Wang, and Denghong Liu. Restore-dit: Reliable satellite image time series reconstruction by multimodal sequential diffusion transformer.Remote Sensing of Environment, 328:114872, 2025

  37. [37]

    Earthpt: a foundation model for earth observation.European Geosciences Union General Assembly 2024 (EGU24), page 1760, 2024

    Michael Smith, Luke Fleming, and James Geach. Earthpt: a foundation model for earth observation.European Geosciences Union General Assembly 2024 (EGU24), page 1760, 2024

  38. [38]

    Diffobs: Generative diffusion for global forecasting of satellite observations.arXiv preprint arXiv:2404.06517, 2024

    Jason Stock, Jaideep Pathak, Yair Cohen, Mike Pritchard, Piyush Garg, Dale Durran, Morteza Mardani, and Noah Brenowitz. Diffobs: Generative diffusion for global forecasting of satellite observations.arXiv preprint arXiv:2404.06517, 2024

  39. [39]

    Temporal attention unit: Towards efficient spatiotemporal predictive learning

    Cheng Tan, Zhangyang Gao, Lirong Wu, Yongjie Xu, Jun Xia, Siyuan Li, and Stan Z Li. Temporal attention unit: Towards efficient spatiotemporal predictive learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18770–18782, 2023

  40. [40]

    Openstl: A comprehensive benchmark of spatio-temporal predictive learning

    Cheng Tan, Siyuan Li, Zhangyang Gao, Wenfei Guan, Zedong Wang, Zicheng Liu, Lirong Wu, and Stan Z Li. Openstl: A comprehensive benchmark of spatio-temporal predictive learning. In Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023

  41. [41]

    Advancing open-source world models

    Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, et al. Advancing open-source world models. arXiv preprint arXiv:2601.20540, 2026

  42. [42]

    Forecasting dryland vegetation condition months in advance through satellite data assimilation.Nature Communica- tions, 10(1):469, 2019

    Siyuan Tian, Albert IJM Van Dijk, Paul Tregoning, and Luigi J Renzullo. Forecasting dryland vegetation condition months in advance through satellite data assimilation.Nature Communica- tions, 10(1):469, 2019

  43. [43]

    A control-centric benchmark for video prediction

    Stephen Tian, Chelsea Finn, and Jiajun Wu. A control-centric benchmark for video prediction. InInternational Conference on Learning Representations, 2023

  44. [44]

    Attribution of climate extreme events.Nature climate change, 5(8):725–730, 2015

    Kevin E Trenberth, John T Fasullo, and Theodore G Shepherd. Attribution of climate extreme events.Nature climate change, 5(8):725–730, 2015

  45. [45]

    Crop yield prediction using machine learning: A systematic literature review.Computers and electronics in agriculture, 177:105709, 2020

    Thomas Van Klompenburg, Ayalew Kassahun, and Cagatay Catal. Crop yield prediction using machine learning: A systematic literature review.Computers and electronics in agriculture, 177:105709, 2020. 12

  46. [46]

    Mcvd-masked conditional video dif- fusion for prediction, generation, and interpolation.Advances in neural information processing systems, 35:23371–23385, 2022

    Vikram V oleti, Alexia Jolicoeur-Martineau, and Chris Pal. Mcvd-masked conditional video dif- fusion for prediction, generation, and interpolation.Advances in neural information processing systems, 35:23371–23385, 2022

  47. [47]

    Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  48. [48]

    Ati: Any trajectory instruction for controllable video generation.arXiv preprint arXiv:2505.22944, 2025

    Angtian Wang, Haibin Huang, Jacob Zhiyuan Fang, Yiding Yang, and Chongyang Ma. Ati: Any trajectory instruction for controllable video generation.arXiv preprint arXiv:2505.22944, 2025

  49. [49]

    Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms.Advances in neural information processing systems, 30, 2017

    Yunbo Wang, Mingsheng Long, Jianmin Wang, Zhifeng Gao, and Philip S Yu. Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms.Advances in neural information processing systems, 30, 2017

  50. [50]

    Predrnn: A recurrent neural network for spatiotemporal predictive learning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):2208–2225, 2022

    Yunbo Wang, Haixu Wu, Jianjin Zhang, Zhifeng Gao, Jianmin Wang, Philip S Yu, and Ming- sheng Long. Predrnn: A recurrent neural network for spatiotemporal predictive learning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):2208–2225, 2022

  51. [51]

    Rs-worldmodel: a unified model for remote sensing understanding and future sense forecasting

    Linrui Xu, Zhongan Wang, Fei Shen, Gang Xu, Huiping Zhuang, Ming Li, and Haifeng Li. Rs-worldmodel: a unified model for remote sensing understanding and future sense forecasting. arXiv preprint arXiv:2603.14941, 2026

  52. [52]

    Worldmark: A unified benchmark suite for interactive video world models.arXiv preprint arXiv:2604.21686, 2026

    Xiaojie Xu, Zhengyuan Lin, Kang He, Yukang Feng, Xiaofeng Mao, Yuanyang Yin, Kaipeng Zhang, and Yongtao Ge. Worldmark: A unified benchmark suite for interactive video world models.arXiv preprint arXiv:2604.21686, 2026

  53. [53]

    Geniedrive: Towards physics-aware driving world model with 4d occupancy guided video generation.arXiv preprint arXiv:2512.12751, 2025

    Zhenya Yang, Zhe Liu, Yuxiang Lu, Liping Hou, Chenxuan Miao, Siyi Peng, Bailan Feng, Xiang Bai, and Hengshuang Zhao. Geniedrive: Towards physics-aware driving world model with 4d occupancy guided video generation.arXiv preprint arXiv:2512.12751, 2025

  54. [54]

    Stdiff: Spatio-temporal diffusion for continuous stochastic video prediction

    Xi Ye and Guillaume-Alexandre Bilodeau. Stdiff: Spatio-temporal diffusion for continuous stochastic video prediction. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 6666–6674, 2024

  55. [55]

    Units: Unified time series generative model for remote sensing.arXiv preprint arXiv:2512.04461, 2025

    Yuxiang Zhang, Shunlin Liang, Wenyuan Li, Han Ma, Jianglei Xu, Yichuan Ma, Jiangwei Xie, Wei Li, Mengmeng Zhang, Ran Tao, et al. Units: Unified time series generative model for remote sensing.arXiv preprint arXiv:2512.04461, 2025

  56. [56]

    Extdm: Distri- bution extrapolation diffusion model for video prediction

    Zhicheng Zhang, Junyao Hu, Wentao Cheng, Danda Paudel, and Jufeng Yang. Extdm: Distri- bution extrapolation diffusion model for video prediction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19310–19320, 2024

  57. [57]

    Vegediff: Latent diffusion model for geospatial vegetation forecasting.IEEE Transactions on Geoscience and Remote Sensing, 2025

    Sijie Zhao, Hao Chen, Xueliang Zhang, Pengfeng Xiao, and Lei Bai. Vegediff: Latent diffusion model for geospatial vegetation forecasting.IEEE Transactions on Geoscience and Remote Sensing, 2025

  58. [58]

    Open-sora 2.0: Training a commercial-level video generation model in $200 k.arXiv preprint arXiv:2503.09642, 2025

    Zangwei Zheng, Xiangyu Peng, Yuxuan Lou, Chenhui Shen, Tom Young, Xinying Guo, Binluo Wang, Hang Xu, Hongxin Liu, Mingyan Jiang, et al. Open-sora 2.0: Training a commercial-level video generation model in $200 k.arXiv preprint arXiv:2503.09642, 2025

  59. [59]

    EO cond

    Zhuo Zheng, Stefano Ermon, Dongjun Kim, Liangpei Zhang, and Yanfei Zhong. Changen2: Multi-temporal remote sensing generative change foundation model.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(2):725–741, 2024. 13 A Technical appendices and supplementary material We organize our supplementary material as follows. Section A.1 provide...

  60. [60]

    Within each track, we rank pairs by their respective divergence score (descending)

  61. [61]

    Per seasonal phase (3 phases: offsets 0, 20, 40), we select the top 50 highest-divergence pairs, yielding 150 pairs per track

  62. [62]

    A per-cube cap of 3 pairs prevents any single location from dominating the benchmark

  63. [63]

    how much

    The three track selections are merged via union and deduplicated by pair identity. Each pair retains metadata indicating which track(s) selected it and its rank within each track. The final benchmark contains422 unique pairs(844 inference windows). Multi-track membership provides complementary evaluation perspectives: 394 pairs belong to a single paper-fa...