EO-WM: A Physically Informed World Model for Probabilistic Earth Observation Forecasting

Hengshuang Zhao; Junwei Luo; Shuai Yuan; Yansheng Li; Zhe Liu; Zhenya Yang

arxiv: 2606.27277 · v1 · pith:ZNUTAKF6new · submitted 2026-06-25 · 💻 cs.AI · cs.CV

EO-WM: A Physically Informed World Model for Probabilistic Earth Observation Forecasting

Junwei Luo , Shuai Yuan , Zhenya Yang , Yansheng Li , Zhe Liu , Hengshuang Zhao This is my paper

Pith reviewed 2026-06-26 04:29 UTC · model grok-4.3

classification 💻 cs.AI cs.CV

keywords earth observation forecastingworld modeldiffusion transformerNDVI predictionweather conditioningprobabilistic forecastingextreme weather benchmarkvegetation response

0 comments

The pith

EO-WM conditions a video diffusion transformer on separate pathways for climatological baseline, weather anomalies, and accumulated physical stress to produce weather-responsive Earth surface forecasts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper frames Earth observation forecasting as a partially observed world modeling problem driven by weather, where existing deterministic and diffusion methods fail to isolate how forecasts should change with altered meteorological forcing. EO-WM introduces distinct conditioning pathways for a climatological baseline, weather anomalies, and cumulative stress signals that accumulate over time to represent sustained heat or drought effects. It adds two new benchmarks that measure whether predictions correctly track NDVI decline amplitude and direction under extreme or altered weather rather than only pixel-level accuracy. On these tests the model reduces NDVI decline error by a relative 5.63 percent and raises directional hit rate by a relative 7.80 percent while matching standard metrics.

Core claim

EO-WM is a video diffusion transformer for multispectral EO forecasting that incorporates a physically informed conditioning framework representing meteorological forcing through a climatological baseline, weather anomalies, and cumulative physical stress signals. It separates baseline and anomaly through distinct conditioning pathways and accumulates anomalous forcing over time to capture sustained heat and drought stress. On the introduced Extreme Summer Benchmark and Seasonal Matched-Pair Benchmark this yields a relative 5.63 percent reduction in error for predicted NDVI decline amplitude and a relative 7.80 percent improvement in directional hit rate while remaining competitive on pixel-

What carries the argument

The physically informed conditioning framework that separates meteorological forcing into climatological baseline, weather anomalies, and cumulative physical stress signals through distinct pathways with temporal accumulation.

If this is right

Forecasts show reduced error in the amplitude of NDVI decline during extreme summer conditions.
Directional accuracy improves when testing response to deliberately changed weather forcing.
Performance stays competitive on conventional pixel-level reconstruction metrics.
Diagnostic benchmarks shift evaluation from reconstruction accuracy toward weather-response fidelity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decomposition of forcing signals could be tested on forecasting other surface properties such as soil moisture or land surface temperature.
Explicit separation of baseline, anomaly and stress may reduce the data needed to learn long-term land dynamics compared with undifferentiated conditioning.
The new benchmarks could serve as a standard check for any EO forecasting model to verify sensitivity to specific weather perturbations.
Extending the accumulation mechanism to multi-year stress signals might allow longer-horizon seasonal forecasts.

Load-bearing premise

Separating meteorological forcing into climatological baseline, weather anomalies, and cumulative physical stress signals through distinct conditioning pathways is sufficient to capture the relevant unobserved land-surface dynamics without additional state variables or more detailed physical process models.

What would settle it

A test that removes or swaps the cumulative stress pathway while keeping baseline and anomaly inputs fixed and measures whether the directional NDVI response under extreme weather collapses to random levels.

Figures

Figures reproduced from arXiv: 2606.27277 by Hengshuang Zhao, Junwei Luo, Shuai Yuan, Yansheng Li, Zhe Liu, Zhenya Yang.

**Figure 1.** Figure 1: Overview of EO world model and the proposed evaluation benchmarks. (a): EO forecasting differs from standard action-conditioned world modeling: satellite observations are sparse and incomplete, and exogenous weather forcing drives future surface change in ways that depend on unobserved latent Earth-surface states. (b) and (c): We define two EO-specific evaluation dimensions beyond standard reconstruction: … view at source ↗

**Figure 2.** Figure 2: Overview of EO-WM. Sparse EO observations are encoded into visual latents, while dense daily weather forcing is decomposed using Clim(tile, month), a precomputed monthly climatology for each geographic tile. The climatological features are injected through a shallow conditioning path, whereas anomaly, DEM, visual, and cumulative-stress features are combined into a spatial condition. the packed video-token … view at source ↗

**Figure 3.** Figure 3: Visual diagnostics apart from benchmark metrics. (a) Predicted versus ground-truth NDVI drop amplitude on the Extreme Summer Benchmark, where the dashed line is perfect severity reproduction. DRA (Drop Reproduction Accuracy) measures relative agreement between predicted and ground-truth drop amplitudes. (b) Extreme-event detection rate by severity bin, measured as the fraction of forecasts whose target-per… view at source ↗

read the original abstract

Earth Observation (EO) forecasting aims to predict future Earth surface dynamics from satellite observations under changing meteorological conditions. In this paper, we view this task as a partially observed, weather-driven world modeling problem, in which weather acts as a conditioning signal, while forecasting remains uncertain due to sparse observations and unobserved land-surface states. However, existing methods do not fully capture this setting: deterministic models collapse uncertainty into a single future prediction, while diffusion-based methods typically treat weather variables as undifferentiated conditioning signals, and existing benchmarks focus mainly on reconstruction accuracy rather than whether forecasts respond correctly to changed weather forcing.We introduce EO-WM, a video diffusion transformer for multispectral EO forecasting. EO-WM incorporates a physically informed conditioning framework that represents meteorological forcing through a climatological baseline, weather anomalies, and cumulative physical stress signals. Specifically, it separates baseline and anomaly through distinct conditioning pathways, and accumulates anomalous forcing over time to capture sustained heat and drought stress. To evaluate weather-response behavior beyond standard metrics, we introduce two diagnostic benchmarks: an Extreme Summer Benchmark for severity-aware prediction of vegetation degradation under extreme weather, and a Seasonal Matched-Pair Benchmark for testing response fidelity under changed weather forcing. Experiments show that EO-WM reduces the error in predicted Normalized Difference Vegetation Index (NDVI) decline amplitude by a relative 5.63% and improves directional hit rate by a relative 7.80%, while remaining competitive on standard pixel-level metrics. The benchmarks and model will be made open-source at https://github.com/Luo-Z13/EO-WM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The three-way weather conditioning and new response-focused benchmarks are the actual additions here, with modest reported gains but thin validation details.

read the letter

The paper's core move is to condition a video diffusion transformer on meteorological forcing split into climatological baseline, weather anomalies, and accumulated anomalous stress, then test the outputs on two new benchmarks that check vegetation response to weather shifts rather than just reconstruction error.

This separation plus the cumulative stress path is the concrete novelty. It gives the model explicit routes for different time scales of forcing, which aligns with how drought or heat actually builds up. The reported 5.63% relative drop in NDVI decline error and 7.80% hit-rate lift on the extreme-summer and matched-pair benchmarks are small but point in the expected direction if the conditioning is doing useful work. Open-sourcing the code and benchmarks is also a plus for anyone who wants to check the setup.

The evaluation is the main soft spot. The abstract gives relative percentages with no error bars, no dataset sizes, no statistical tests, and no ablations that isolate whether the three pathways are necessary versus a simpler conditioning scheme. Because the benchmarks are new, it's hard to know if the gains would survive different metric choices or external datasets. The stress-test point about unobserved land-surface states also lands: nothing in the description compares the model against an otherwise identical version that adds recurrent state buffers or a more detailed process model, so the sufficiency claim rests on the assumption that the three signals are enough.

This is aimed at the EO forecasting and remote-sensing AI crowd who already work with diffusion models and care about weather sensitivity. A reader looking for incremental conditioning tricks or diagnostic benchmarks could extract something usable. It deserves a serious referee because the idea is specific and the benchmarks address a real evaluation gap, even though the paper will need more ablations and statistical grounding to hold up.

I'd send it to review with requests for those details rather than desk-reject.

Referee Report

2 major / 1 minor

Summary. The paper introduces EO-WM, a video diffusion transformer for multispectral Earth Observation forecasting that incorporates a physically informed conditioning framework separating meteorological forcing into climatological baseline, weather anomalies, and cumulative physical stress signals through distinct pathways. It introduces two new diagnostic benchmarks (Extreme Summer Benchmark and Seasonal Matched-Pair Benchmark) to evaluate weather-response behavior beyond standard reconstruction metrics. Experiments report that EO-WM reduces error in predicted NDVI decline amplitude by a relative 5.63% and improves directional hit rate by a relative 7.80% while remaining competitive on pixel-level metrics; the model and benchmarks are to be open-sourced.

Significance. If the central claim holds, the work would be significant for embedding structured physical conditioning into probabilistic EO world models, addressing the gap between undifferentiated weather inputs and observable vegetation response to extremes. The focus on diagnostic benchmarks for directional fidelity and severity-aware prediction, rather than solely pixel-level accuracy, represents a constructive contribution. Open-sourcing the model and benchmarks strengthens reproducibility. However, the significance is limited by the absence of statistical validation and direct tests of the conditioning design's sufficiency.

major comments (2)

[Abstract] Abstract: The reported relative gains (5.63% reduction in NDVI decline amplitude error; 7.80% improvement in directional hit rate) are presented without error bars, dataset sizes, statistical tests, or ablation details. These omissions are load-bearing for the central claim that the three-way conditioning improves weather-response behavior on the new benchmarks.
[Method and Experiments] Method and Experiments sections: No comparison is reported to an otherwise identical architecture augmented with explicit state variables (e.g., recurrent soil-moisture or vegetation-state buffer) or to a version replacing the cumulative stress accumulator with a more detailed process model. This is load-bearing because the gains on the author-introduced benchmarks are attributed to the sufficiency of the baseline/anomaly/stress pathways for capturing unobserved land-surface dynamics.

minor comments (1)

[Experiments] The high-level description of benchmark construction leaves open the possibility that metric choices influence the reported improvements; more explicit details on how the Extreme Summer and Seasonal Matched-Pair benchmarks are constructed would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments identify areas where additional statistical detail and justification of design choices would strengthen the manuscript. We respond to each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The reported relative gains (5.63% reduction in NDVI decline amplitude error; 7.80% improvement in directional hit rate) are presented without error bars, dataset sizes, statistical tests, or ablation details. These omissions are load-bearing for the central claim that the three-way conditioning improves weather-response behavior on the new benchmarks.

Authors: We agree that the abstract and results would benefit from explicit statistical support. In the revised manuscript we will report mean performance with standard deviations across multiple random seeds, the exact sizes of the Extreme Summer and Seasonal Matched-Pair benchmarks, and the outcomes of paired statistical tests (e.g., Wilcoxon signed-rank) for the reported relative improvements. revision: yes
Referee: [Method and Experiments] Method and Experiments sections: No comparison is reported to an otherwise identical architecture augmented with explicit state variables (e.g., recurrent soil-moisture or vegetation-state buffer) or to a version replacing the cumulative stress accumulator with a more detailed process model. This is load-bearing because the gains on the author-introduced benchmarks are attributed to the sufficiency of the baseline/anomaly/stress pathways for capturing unobserved land-surface dynamics.

Authors: The EO setting is defined by partially observed inputs; explicit recurrent state buffers would require auxiliary variables (soil moisture, vegetation state) that are not part of the multispectral observation stream used in our benchmarks. We will add a dedicated paragraph in the revised Method section clarifying this modeling choice and will include additional pathway-specific ablations that isolate the contribution of the cumulative-stress accumulator to the directional and amplitude metrics. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces a conditioning framework and evaluates empirical performance gains on newly proposed benchmarks using independent metrics (NDVI decline amplitude error, directional hit rate). These outcomes are measured results on held-out or constructed test cases rather than quantities defined in terms of the fitted parameters or model architecture. No self-citations, self-definitional equations, or fitted-input-as-prediction patterns appear in the abstract or described claims. The derivation remains self-contained against external benchmarks and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only review limits visibility; the approach rests on the domain assumption that weather can be usefully decomposed into baseline, anomaly, and cumulative stress signals for conditioning a generative model.

axioms (1)

domain assumption Weather acts as a conditioning signal while forecasting remains uncertain due to sparse observations and unobserved land-surface states.
Stated in the opening paragraph of the abstract as the problem framing.

invented entities (1)

cumulative physical stress signals no independent evidence
purpose: Capture sustained heat and drought stress by accumulating anomalous forcing over time.
Introduced as part of the conditioning framework; no independent falsifiable prediction outside the model is mentioned.

pith-pipeline@v0.9.1-grok · 5825 in / 1347 out tokens · 37695 ms · 2026-06-26T04:29:30.925695+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

63 extracted references · 9 linked inside Pith

[1]

Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

Pith/arXiv arXiv 2025
[2]

Diffusion for world modeling: Visual details matter in atari.Advances in Neural Information Processing Systems, 37:58757–58791, 2024

Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari.Advances in Neural Information Processing Systems, 37:58757–58791, 2024

2024
[3]

Multi-modal learning for geospatial vegetation forecasting

Vitus Benson, Claire Robin, Christian Requena-Mesa, Lazaro Alonso, Nuno Carvalhais, José Cortés, Zhihan Gao, Nora Linscheid, Mélanie Weynants, and Markus Reichstein. Multi-modal learning for geospatial vegetation forecasting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27788–27799, 2024

2024
[4]

Genie: Generative interactive environments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 2024

2024
[5]

Insights from earth system model initial-condition large ensembles and future prospects.Nature climate change, 10 (4):277–286, 2020

Clara Deser, Flavio Lehner, Keith B Rodgers, Toby Ault, Thomas L Delworth, Pedro N DiNezio, Arlene Fiore, Claude Frankignoul, John C Fyfe, Daniel E Horton, et al. Insights from earth system model initial-condition large ensembles and future prospects.Nature climate change, 10 (4):277–286, 2020

2020
[6]

Understand- ing the role of weather data for earth surface forecasting using a convlstm-based model

Codrut,-Andrei Diaconu, Sudipan Saha, Stephan Günnemann, and Xiao Xiang Zhu. Understand- ing the role of weather data for earth surface forecasting using a convlstm-based model. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1362–1371, 2022

2022
[7]

Simvp: Simpler yet better video prediction

Zhangyang Gao, Cheng Tan, Lirong Wu, and Stan Z Li. Simvp: Simpler yet better video prediction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3170–3180, 2022

2022
[8]

Earthformer: Exploring space-time transformers for earth system forecasting.Advances in Neural Information Processing Systems, 35:25390–25403, 2022

Zhihan Gao, Xingjian Shi, Hao Wang, Yi Zhu, Yuyang Bernie Wang, Mu Li, and Dit-Yan Yeung. Earthformer: Exploring space-time transformers for earth system forecasting.Advances in Neural Information Processing Systems, 35:25390–25403, 2022

2022
[9]

Ecomapper: Generative modeling for climate-aware satellite imagery

Muhammed Goktepe, Amir hossein Shamseddin, Erencan Uysal, Javier Muinelo Monteagudo, Lukas Drees, Aysim Toker, Senthold Asseng, and Malte V on Bloh. Ecomapper: Generative modeling for climate-aware satellite imagery. InForty-second International Conference on Machine Learning, 2025

2025
[10]

Disentangling physical dynamics from unknown factors for unsupervised video prediction

Vincent Le Guen and Nicolas Thome. Disentangling physical dynamics from unknown factors for unsupervised video prediction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11474–11484, 2020

2020
[11]

World models.arXiv preprint arXiv:1803.10122, 2(3):440, 2018

David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3):440, 2018

Pith/arXiv arXiv 2018
[12]

Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019. 10

Pith/arXiv arXiv 1912
[13]

Learning latent dynamics for planning from pixels

Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. InInternational conference on machine learning, pages 2555–2565. PMLR, 2019

2019
[14]

Classifier-free diffusion guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. InNeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2022

2021
[15]

Diffusion models for video prediction and infilling.Transactions on Machine Learning Research, 2022, 2022

Tobias Höppe, Arash Mehrjou, Stefan Bauer, Didrik Nielsen, and Andrea Dittadi. Diffusion models for video prediction and infilling.Transactions on Machine Learning Research, 2022, 2022

2022
[16]

LoRA: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=nZeVKeeFYf9

2022
[17]

Vid2world: Crafting video diffusion models to interactive world models.arXiv preprint arXiv:2505.14357, 2025

Siqiao Huang, Jialong Wu, Qixing Zhou, Shangchen Miao, and Mingsheng Long. Vid2world: Crafting video diffusion models to interactive world models.arXiv preprint arXiv:2505.14357, 2025

arXiv 2025
[18]

Global vegetation modeling with pre-trained weather transformers.arXiv preprint arXiv:2403.18438, 2024

Pascal Janetzky, Florian Gallusser, Simon Hentschel, Andreas Hotho, and Anna Krause. Global vegetation modeling with pre-trained weather transformers.arXiv preprint arXiv:2403.18438, 2024

arXiv 2024
[19]

Diffusionsat: A generative foundation model for satellite imagery

Samar Khanna, Patrick Liu, Linqi Zhou, Chenlin Meng, Robin Rombach, Marshall Burke, David B Lobell, and Stefano Ermon. Diffusionsat: A generative foundation model for satellite imagery. InThe Twelfth International Conference on Learning Representations, 2023

2023
[20]

Enhanced prediction of vegetation responses to extreme drought using deep learning and earth observation data.Ecological Informatics, 80:102474, 2024

Klaus-Rudolf Kladny, Marco Milanta, Oto Mraz, Koen Hufkens, and Benjamin D Stocker. Enhanced prediction of vegetation responses to extreme drought using deep learning and earth observation data.Ecological Informatics, 80:102474, 2024

2024
[21]

Eo-vae: Towards a multi-sensor tokenizer for earth observation data.arXiv preprint arXiv:2602.12177, 2026

Nils Lehmann, Yi Wang, Zhitong Xiong, and Xiaoxiang Zhu. Eo-vae: Towards a multi-sensor tokenizer for earth observation data.arXiv preprint arXiv:2602.12177, 2026

arXiv 2026
[22]

Stiv: Scalable text and image conditioned video generation

Zongyu Lin, Wei Liu, Chen Chen, Jiasen Lu, Wenze Hu, Tsu-Jui Fu, Jesse Allardice, Zhengfeng Lai, Liangchen Song, Bowen Zhang, et al. Stiv: Scalable text and image conditioned video generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16249–16259, 2025

2025
[23]

Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

Pith/arXiv arXiv 2022
[24]

Vdt: General-purpose video diffusion transformers via mask modeling

Haoyu Lu, Guoxing Yang, Nanyi Fei, Yuqi Huo, Zhiwu Lu, Ping Luo, and Mingyu Ding. Vdt: General-purpose video diffusion transformers via mask modeling. InThe Twelfth International Conference on Learning Representations, 2023

2023
[25]

Remote sensing-oriented world model.arXiv preprint arXiv:2509.17808, 2025

Yuxi Lu, Biao Wu, Zhidong Li, Kunqi Li, Chenya Huang, Huacan Wang, Qizhen Lan, Ronghao Chen, Ling Chen, and Bin Liang. Remote sensing-oriented world model.arXiv preprint arXiv:2509.17808, 2025

arXiv 2025
[26]

Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024

Xin Ma, Yaohui Wang, Xinyuan Chen, Gengyun Jia, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024

Pith/arXiv arXiv 2024
[27]

Controllable video generation: A survey.arXiv preprint arXiv:2507.16869, 2025

Yue Ma, Kunyu Feng, Zhongyuan Hu, Xinyu Wang, Yucheng Wang, Mingzhe Zheng, Bingyuan Wang, Qinghe Wang, Xuanhua He, Hongfa Wang, et al. Controllable video generation: A survey.arXiv preprint arXiv:2507.16869, 2025

arXiv 2025
[28]

Driveworld: 4d pre-trained scene understanding via world models for autonomous driving

Chen Min, Dawei Zhao, Liang Xiao, Jian Zhao, Xinli Xu, Zheng Zhu, Lei Jin, Jianshu Li, Yulan Guo, Junliang Xing, et al. Driveworld: 4d pre-trained scene understanding via world models for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15522–15533, 2024. 11

2024
[29]

Syncvp: joint diffusion for synchronous multi-modal video prediction

Enrico Pallotta, Sina Mokhtarzadeh Azar, Shuai Li, Olga Zatsarynna, and Juergen Gall. Syncvp: joint diffusion for synchronous multi-modal video prediction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13787–13797, 2025

2025
[30]

Explainable earth surface forecasting under extreme events.Earth’s Future, 13(9):e2024EF005446, 2025

Oscar J Pellicer-Valero, Miguel-Ángel Fernández-Torres, Chaonan Ji, Miguel D Mahecha, and Gustau Camps-Valls. Explainable earth surface forecasting under extreme events.Earth’s Future, 13(9):e2024EF005446, 2025

2025
[31]

Film: Visual reasoning with a general conditioning layer

Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

2018
[32]

Earthnet2021: A large-scale dataset and challenge for earth surface forecasting as a guided video prediction task

Christian Requena-Mesa, Vitus Benson, Markus Reichstein, Jakob Runge, and Joachim Denzler. Earthnet2021: A large-scale dataset and challenge for earth surface forecasting as a guided video prediction task. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1132–1142, 2021

2021
[33]

Avid: Adapting video diffusion models to world models.arXiv preprint arXiv:2410.12822, 2024

Marc Rigter, Tarun Gupta, Agrin Hilmkil, and Chao Ma. Avid: Adapting video diffusion models to world models.arXiv preprint arXiv:2410.12822, 2024

arXiv 2024
[34]

Worldarena: A unified benchmark for evaluating perception and functional utility of embodied world models.arXiv preprint arXiv:2602.08971, 2026

Yu Shang, Zhuohang Li, Yiding Ma, Weikang Su, Xin Jin, Ziyou Wang, Lei Jin, Xin Zhang, Yinzhou Tang, Haisheng Su, et al. Worldarena: A unified benchmark for evaluating perception and functional utility of embodied world models.arXiv preprint arXiv:2602.08971, 2026

arXiv 2026
[35]

Vit-koop: Vision-transformer-koopman operators for efficient time-series forecasting of earth-observation data

Takayuki Shinohara. Vit-koop: Vision-transformer-koopman operators for efficient time-series forecasting of earth-observation data. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pages 2835–2844, October 2025

2025
[36]

Restore-dit: Reliable satellite image time series reconstruction by multimodal sequential diffusion transformer.Remote Sensing of Environment, 328:114872, 2025

Qidi Shu, Xiaolin Zhu, Shuai Xu, Yan Wang, and Denghong Liu. Restore-dit: Reliable satellite image time series reconstruction by multimodal sequential diffusion transformer.Remote Sensing of Environment, 328:114872, 2025

2025
[37]

Earthpt: a foundation model for earth observation.European Geosciences Union General Assembly 2024 (EGU24), page 1760, 2024

Michael Smith, Luke Fleming, and James Geach. Earthpt: a foundation model for earth observation.European Geosciences Union General Assembly 2024 (EGU24), page 1760, 2024

2024
[38]

Diffobs: Generative diffusion for global forecasting of satellite observations.arXiv preprint arXiv:2404.06517, 2024

Jason Stock, Jaideep Pathak, Yair Cohen, Mike Pritchard, Piyush Garg, Dale Durran, Morteza Mardani, and Noah Brenowitz. Diffobs: Generative diffusion for global forecasting of satellite observations.arXiv preprint arXiv:2404.06517, 2024

arXiv 2024
[39]

Temporal attention unit: Towards efficient spatiotemporal predictive learning

Cheng Tan, Zhangyang Gao, Lirong Wu, Yongjie Xu, Jun Xia, Siyuan Li, and Stan Z Li. Temporal attention unit: Towards efficient spatiotemporal predictive learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18770–18782, 2023

2023
[40]

Openstl: A comprehensive benchmark of spatio-temporal predictive learning

Cheng Tan, Siyuan Li, Zhangyang Gao, Wenfei Guan, Zedong Wang, Zicheng Liu, Lirong Wu, and Stan Z Li. Openstl: A comprehensive benchmark of spatio-temporal predictive learning. In Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023

2023
[41]

Advancing open-source world models

Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, et al. Advancing open-source world models. arXiv preprint arXiv:2601.20540, 2026

Pith/arXiv arXiv 2026
[42]

Forecasting dryland vegetation condition months in advance through satellite data assimilation.Nature Communica- tions, 10(1):469, 2019

Siyuan Tian, Albert IJM Van Dijk, Paul Tregoning, and Luigi J Renzullo. Forecasting dryland vegetation condition months in advance through satellite data assimilation.Nature Communica- tions, 10(1):469, 2019

2019
[43]

A control-centric benchmark for video prediction

Stephen Tian, Chelsea Finn, and Jiajun Wu. A control-centric benchmark for video prediction. InInternational Conference on Learning Representations, 2023

2023
[44]

Attribution of climate extreme events.Nature climate change, 5(8):725–730, 2015

Kevin E Trenberth, John T Fasullo, and Theodore G Shepherd. Attribution of climate extreme events.Nature climate change, 5(8):725–730, 2015

2015
[45]

Crop yield prediction using machine learning: A systematic literature review.Computers and electronics in agriculture, 177:105709, 2020

Thomas Van Klompenburg, Ayalew Kassahun, and Cagatay Catal. Crop yield prediction using machine learning: A systematic literature review.Computers and electronics in agriculture, 177:105709, 2020. 12

2020
[46]

Mcvd-masked conditional video dif- fusion for prediction, generation, and interpolation.Advances in neural information processing systems, 35:23371–23385, 2022

Vikram V oleti, Alexia Jolicoeur-Martineau, and Chris Pal. Mcvd-masked conditional video dif- fusion for prediction, generation, and interpolation.Advances in neural information processing systems, 35:23371–23385, 2022

2022
[47]

Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Pith/arXiv arXiv 2025
[48]

Ati: Any trajectory instruction for controllable video generation.arXiv preprint arXiv:2505.22944, 2025

Angtian Wang, Haibin Huang, Jacob Zhiyuan Fang, Yiding Yang, and Chongyang Ma. Ati: Any trajectory instruction for controllable video generation.arXiv preprint arXiv:2505.22944, 2025

arXiv 2025
[49]

Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms.Advances in neural information processing systems, 30, 2017

Yunbo Wang, Mingsheng Long, Jianmin Wang, Zhifeng Gao, and Philip S Yu. Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms.Advances in neural information processing systems, 30, 2017

2017
[50]

Predrnn: A recurrent neural network for spatiotemporal predictive learning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):2208–2225, 2022

Yunbo Wang, Haixu Wu, Jianjin Zhang, Zhifeng Gao, Jianmin Wang, Philip S Yu, and Ming- sheng Long. Predrnn: A recurrent neural network for spatiotemporal predictive learning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):2208–2225, 2022

2022
[51]

Rs-worldmodel: a unified model for remote sensing understanding and future sense forecasting

Linrui Xu, Zhongan Wang, Fei Shen, Gang Xu, Huiping Zhuang, Ming Li, and Haifeng Li. Rs-worldmodel: a unified model for remote sensing understanding and future sense forecasting. arXiv preprint arXiv:2603.14941, 2026

arXiv 2026
[52]

Worldmark: A unified benchmark suite for interactive video world models.arXiv preprint arXiv:2604.21686, 2026

Xiaojie Xu, Zhengyuan Lin, Kang He, Yukang Feng, Xiaofeng Mao, Yuanyang Yin, Kaipeng Zhang, and Yongtao Ge. Worldmark: A unified benchmark suite for interactive video world models.arXiv preprint arXiv:2604.21686, 2026

Pith/arXiv arXiv 2026
[53]

Geniedrive: Towards physics-aware driving world model with 4d occupancy guided video generation.arXiv preprint arXiv:2512.12751, 2025

Zhenya Yang, Zhe Liu, Yuxiang Lu, Liping Hou, Chenxuan Miao, Siyi Peng, Bailan Feng, Xiang Bai, and Hengshuang Zhao. Geniedrive: Towards physics-aware driving world model with 4d occupancy guided video generation.arXiv preprint arXiv:2512.12751, 2025

arXiv 2025
[54]

Stdiff: Spatio-temporal diffusion for continuous stochastic video prediction

Xi Ye and Guillaume-Alexandre Bilodeau. Stdiff: Spatio-temporal diffusion for continuous stochastic video prediction. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 6666–6674, 2024

2024
[55]

Units: Unified time series generative model for remote sensing.arXiv preprint arXiv:2512.04461, 2025

Yuxiang Zhang, Shunlin Liang, Wenyuan Li, Han Ma, Jianglei Xu, Yichuan Ma, Jiangwei Xie, Wei Li, Mengmeng Zhang, Ran Tao, et al. Units: Unified time series generative model for remote sensing.arXiv preprint arXiv:2512.04461, 2025

arXiv 2025
[56]

Extdm: Distri- bution extrapolation diffusion model for video prediction

Zhicheng Zhang, Junyao Hu, Wentao Cheng, Danda Paudel, and Jufeng Yang. Extdm: Distri- bution extrapolation diffusion model for video prediction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19310–19320, 2024

2024
[57]

Vegediff: Latent diffusion model for geospatial vegetation forecasting.IEEE Transactions on Geoscience and Remote Sensing, 2025

Sijie Zhao, Hao Chen, Xueliang Zhang, Pengfeng Xiao, and Lei Bai. Vegediff: Latent diffusion model for geospatial vegetation forecasting.IEEE Transactions on Geoscience and Remote Sensing, 2025

2025
[58]

Open-sora 2.0: Training a commercial-level video generation model in $200 k.arXiv preprint arXiv:2503.09642, 2025

Zangwei Zheng, Xiangyu Peng, Yuxuan Lou, Chenhui Shen, Tom Young, Xinying Guo, Binluo Wang, Hang Xu, Hongxin Liu, Mingyan Jiang, et al. Open-sora 2.0: Training a commercial-level video generation model in $200 k.arXiv preprint arXiv:2503.09642, 2025

Pith/arXiv arXiv 2025
[59]

EO cond

Zhuo Zheng, Stefano Ermon, Dongjun Kim, Liangpei Zhang, and Yanfei Zhong. Changen2: Multi-temporal remote sensing generative change foundation model.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(2):725–741, 2024. 13 A Technical appendices and supplementary material We organize our supplementary material as follows. Section A.1 provide...

2024
[60]

Within each track, we rank pairs by their respective divergence score (descending)
[61]

Per seasonal phase (3 phases: offsets 0, 20, 40), we select the top 50 highest-divergence pairs, yielding 150 pairs per track
[62]

A per-cube cap of 3 pairs prevents any single location from dominating the benchmark
[63]

how much

The three track selections are merged via union and deduplicated by pair identity. Each pair retains metadata indicating which track(s) selected it and its rank within each track. The final benchmark contains422 unique pairs(844 inference windows). Multi-track membership provides complementary evaluation perspectives: 394 pairs belong to a single paper-fa...

[1] [1]

Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

Pith/arXiv arXiv 2025

[2] [2]

Diffusion for world modeling: Visual details matter in atari.Advances in Neural Information Processing Systems, 37:58757–58791, 2024

Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari.Advances in Neural Information Processing Systems, 37:58757–58791, 2024

2024

[3] [3]

Multi-modal learning for geospatial vegetation forecasting

Vitus Benson, Claire Robin, Christian Requena-Mesa, Lazaro Alonso, Nuno Carvalhais, José Cortés, Zhihan Gao, Nora Linscheid, Mélanie Weynants, and Markus Reichstein. Multi-modal learning for geospatial vegetation forecasting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27788–27799, 2024

2024

[4] [4]

Genie: Generative interactive environments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 2024

2024

[5] [5]

Insights from earth system model initial-condition large ensembles and future prospects.Nature climate change, 10 (4):277–286, 2020

Clara Deser, Flavio Lehner, Keith B Rodgers, Toby Ault, Thomas L Delworth, Pedro N DiNezio, Arlene Fiore, Claude Frankignoul, John C Fyfe, Daniel E Horton, et al. Insights from earth system model initial-condition large ensembles and future prospects.Nature climate change, 10 (4):277–286, 2020

2020

[6] [6]

Understand- ing the role of weather data for earth surface forecasting using a convlstm-based model

Codrut,-Andrei Diaconu, Sudipan Saha, Stephan Günnemann, and Xiao Xiang Zhu. Understand- ing the role of weather data for earth surface forecasting using a convlstm-based model. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1362–1371, 2022

2022

[7] [7]

Simvp: Simpler yet better video prediction

Zhangyang Gao, Cheng Tan, Lirong Wu, and Stan Z Li. Simvp: Simpler yet better video prediction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3170–3180, 2022

2022

[8] [8]

Earthformer: Exploring space-time transformers for earth system forecasting.Advances in Neural Information Processing Systems, 35:25390–25403, 2022

Zhihan Gao, Xingjian Shi, Hao Wang, Yi Zhu, Yuyang Bernie Wang, Mu Li, and Dit-Yan Yeung. Earthformer: Exploring space-time transformers for earth system forecasting.Advances in Neural Information Processing Systems, 35:25390–25403, 2022

2022

[9] [9]

Ecomapper: Generative modeling for climate-aware satellite imagery

Muhammed Goktepe, Amir hossein Shamseddin, Erencan Uysal, Javier Muinelo Monteagudo, Lukas Drees, Aysim Toker, Senthold Asseng, and Malte V on Bloh. Ecomapper: Generative modeling for climate-aware satellite imagery. InForty-second International Conference on Machine Learning, 2025

2025

[10] [10]

Disentangling physical dynamics from unknown factors for unsupervised video prediction

Vincent Le Guen and Nicolas Thome. Disentangling physical dynamics from unknown factors for unsupervised video prediction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11474–11484, 2020

2020

[11] [11]

World models.arXiv preprint arXiv:1803.10122, 2(3):440, 2018

David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3):440, 2018

Pith/arXiv arXiv 2018

[12] [12]

Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019. 10

Pith/arXiv arXiv 1912

[13] [13]

Learning latent dynamics for planning from pixels

Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. InInternational conference on machine learning, pages 2555–2565. PMLR, 2019

2019

[14] [14]

Classifier-free diffusion guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. InNeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2022

2021

[15] [15]

Diffusion models for video prediction and infilling.Transactions on Machine Learning Research, 2022, 2022

Tobias Höppe, Arash Mehrjou, Stefan Bauer, Didrik Nielsen, and Andrea Dittadi. Diffusion models for video prediction and infilling.Transactions on Machine Learning Research, 2022, 2022

2022

[16] [16]

LoRA: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=nZeVKeeFYf9

2022

[17] [17]

Vid2world: Crafting video diffusion models to interactive world models.arXiv preprint arXiv:2505.14357, 2025

Siqiao Huang, Jialong Wu, Qixing Zhou, Shangchen Miao, and Mingsheng Long. Vid2world: Crafting video diffusion models to interactive world models.arXiv preprint arXiv:2505.14357, 2025

arXiv 2025

[18] [18]

Global vegetation modeling with pre-trained weather transformers.arXiv preprint arXiv:2403.18438, 2024

Pascal Janetzky, Florian Gallusser, Simon Hentschel, Andreas Hotho, and Anna Krause. Global vegetation modeling with pre-trained weather transformers.arXiv preprint arXiv:2403.18438, 2024

arXiv 2024

[19] [19]

Diffusionsat: A generative foundation model for satellite imagery

Samar Khanna, Patrick Liu, Linqi Zhou, Chenlin Meng, Robin Rombach, Marshall Burke, David B Lobell, and Stefano Ermon. Diffusionsat: A generative foundation model for satellite imagery. InThe Twelfth International Conference on Learning Representations, 2023

2023

[20] [20]

Enhanced prediction of vegetation responses to extreme drought using deep learning and earth observation data.Ecological Informatics, 80:102474, 2024

Klaus-Rudolf Kladny, Marco Milanta, Oto Mraz, Koen Hufkens, and Benjamin D Stocker. Enhanced prediction of vegetation responses to extreme drought using deep learning and earth observation data.Ecological Informatics, 80:102474, 2024

2024

[21] [21]

Eo-vae: Towards a multi-sensor tokenizer for earth observation data.arXiv preprint arXiv:2602.12177, 2026

Nils Lehmann, Yi Wang, Zhitong Xiong, and Xiaoxiang Zhu. Eo-vae: Towards a multi-sensor tokenizer for earth observation data.arXiv preprint arXiv:2602.12177, 2026

arXiv 2026

[22] [22]

Stiv: Scalable text and image conditioned video generation

Zongyu Lin, Wei Liu, Chen Chen, Jiasen Lu, Wenze Hu, Tsu-Jui Fu, Jesse Allardice, Zhengfeng Lai, Liangchen Song, Bowen Zhang, et al. Stiv: Scalable text and image conditioned video generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16249–16259, 2025

2025

[23] [23]

Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

Pith/arXiv arXiv 2022

[24] [24]

Vdt: General-purpose video diffusion transformers via mask modeling

Haoyu Lu, Guoxing Yang, Nanyi Fei, Yuqi Huo, Zhiwu Lu, Ping Luo, and Mingyu Ding. Vdt: General-purpose video diffusion transformers via mask modeling. InThe Twelfth International Conference on Learning Representations, 2023

2023

[25] [25]

Remote sensing-oriented world model.arXiv preprint arXiv:2509.17808, 2025

Yuxi Lu, Biao Wu, Zhidong Li, Kunqi Li, Chenya Huang, Huacan Wang, Qizhen Lan, Ronghao Chen, Ling Chen, and Bin Liang. Remote sensing-oriented world model.arXiv preprint arXiv:2509.17808, 2025

arXiv 2025

[26] [26]

Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024

Xin Ma, Yaohui Wang, Xinyuan Chen, Gengyun Jia, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024

Pith/arXiv arXiv 2024

[27] [27]

Controllable video generation: A survey.arXiv preprint arXiv:2507.16869, 2025

Yue Ma, Kunyu Feng, Zhongyuan Hu, Xinyu Wang, Yucheng Wang, Mingzhe Zheng, Bingyuan Wang, Qinghe Wang, Xuanhua He, Hongfa Wang, et al. Controllable video generation: A survey.arXiv preprint arXiv:2507.16869, 2025

arXiv 2025

[28] [28]

Driveworld: 4d pre-trained scene understanding via world models for autonomous driving

Chen Min, Dawei Zhao, Liang Xiao, Jian Zhao, Xinli Xu, Zheng Zhu, Lei Jin, Jianshu Li, Yulan Guo, Junliang Xing, et al. Driveworld: 4d pre-trained scene understanding via world models for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15522–15533, 2024. 11

2024

[29] [29]

Syncvp: joint diffusion for synchronous multi-modal video prediction

Enrico Pallotta, Sina Mokhtarzadeh Azar, Shuai Li, Olga Zatsarynna, and Juergen Gall. Syncvp: joint diffusion for synchronous multi-modal video prediction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13787–13797, 2025

2025

[30] [30]

Explainable earth surface forecasting under extreme events.Earth’s Future, 13(9):e2024EF005446, 2025

Oscar J Pellicer-Valero, Miguel-Ángel Fernández-Torres, Chaonan Ji, Miguel D Mahecha, and Gustau Camps-Valls. Explainable earth surface forecasting under extreme events.Earth’s Future, 13(9):e2024EF005446, 2025

2025

[31] [31]

Film: Visual reasoning with a general conditioning layer

Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

2018

[32] [32]

Earthnet2021: A large-scale dataset and challenge for earth surface forecasting as a guided video prediction task

Christian Requena-Mesa, Vitus Benson, Markus Reichstein, Jakob Runge, and Joachim Denzler. Earthnet2021: A large-scale dataset and challenge for earth surface forecasting as a guided video prediction task. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1132–1142, 2021

2021

[33] [33]

Avid: Adapting video diffusion models to world models.arXiv preprint arXiv:2410.12822, 2024

Marc Rigter, Tarun Gupta, Agrin Hilmkil, and Chao Ma. Avid: Adapting video diffusion models to world models.arXiv preprint arXiv:2410.12822, 2024

arXiv 2024

[34] [34]

Worldarena: A unified benchmark for evaluating perception and functional utility of embodied world models.arXiv preprint arXiv:2602.08971, 2026

Yu Shang, Zhuohang Li, Yiding Ma, Weikang Su, Xin Jin, Ziyou Wang, Lei Jin, Xin Zhang, Yinzhou Tang, Haisheng Su, et al. Worldarena: A unified benchmark for evaluating perception and functional utility of embodied world models.arXiv preprint arXiv:2602.08971, 2026

arXiv 2026

[35] [35]

Vit-koop: Vision-transformer-koopman operators for efficient time-series forecasting of earth-observation data

Takayuki Shinohara. Vit-koop: Vision-transformer-koopman operators for efficient time-series forecasting of earth-observation data. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pages 2835–2844, October 2025

2025

[36] [36]

Restore-dit: Reliable satellite image time series reconstruction by multimodal sequential diffusion transformer.Remote Sensing of Environment, 328:114872, 2025

Qidi Shu, Xiaolin Zhu, Shuai Xu, Yan Wang, and Denghong Liu. Restore-dit: Reliable satellite image time series reconstruction by multimodal sequential diffusion transformer.Remote Sensing of Environment, 328:114872, 2025

2025

[37] [37]

Earthpt: a foundation model for earth observation.European Geosciences Union General Assembly 2024 (EGU24), page 1760, 2024

Michael Smith, Luke Fleming, and James Geach. Earthpt: a foundation model for earth observation.European Geosciences Union General Assembly 2024 (EGU24), page 1760, 2024

2024

[38] [38]

Diffobs: Generative diffusion for global forecasting of satellite observations.arXiv preprint arXiv:2404.06517, 2024

Jason Stock, Jaideep Pathak, Yair Cohen, Mike Pritchard, Piyush Garg, Dale Durran, Morteza Mardani, and Noah Brenowitz. Diffobs: Generative diffusion for global forecasting of satellite observations.arXiv preprint arXiv:2404.06517, 2024

arXiv 2024

[39] [39]

Temporal attention unit: Towards efficient spatiotemporal predictive learning

Cheng Tan, Zhangyang Gao, Lirong Wu, Yongjie Xu, Jun Xia, Siyuan Li, and Stan Z Li. Temporal attention unit: Towards efficient spatiotemporal predictive learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18770–18782, 2023

2023

[40] [40]

Openstl: A comprehensive benchmark of spatio-temporal predictive learning

Cheng Tan, Siyuan Li, Zhangyang Gao, Wenfei Guan, Zedong Wang, Zicheng Liu, Lirong Wu, and Stan Z Li. Openstl: A comprehensive benchmark of spatio-temporal predictive learning. In Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023

2023

[41] [41]

Advancing open-source world models

Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, et al. Advancing open-source world models. arXiv preprint arXiv:2601.20540, 2026

Pith/arXiv arXiv 2026

[42] [42]

Forecasting dryland vegetation condition months in advance through satellite data assimilation.Nature Communica- tions, 10(1):469, 2019

Siyuan Tian, Albert IJM Van Dijk, Paul Tregoning, and Luigi J Renzullo. Forecasting dryland vegetation condition months in advance through satellite data assimilation.Nature Communica- tions, 10(1):469, 2019

2019

[43] [43]

A control-centric benchmark for video prediction

Stephen Tian, Chelsea Finn, and Jiajun Wu. A control-centric benchmark for video prediction. InInternational Conference on Learning Representations, 2023

2023

[44] [44]

Attribution of climate extreme events.Nature climate change, 5(8):725–730, 2015

Kevin E Trenberth, John T Fasullo, and Theodore G Shepherd. Attribution of climate extreme events.Nature climate change, 5(8):725–730, 2015

2015

[45] [45]

Crop yield prediction using machine learning: A systematic literature review.Computers and electronics in agriculture, 177:105709, 2020

Thomas Van Klompenburg, Ayalew Kassahun, and Cagatay Catal. Crop yield prediction using machine learning: A systematic literature review.Computers and electronics in agriculture, 177:105709, 2020. 12

2020

[46] [46]

Mcvd-masked conditional video dif- fusion for prediction, generation, and interpolation.Advances in neural information processing systems, 35:23371–23385, 2022

Vikram V oleti, Alexia Jolicoeur-Martineau, and Chris Pal. Mcvd-masked conditional video dif- fusion for prediction, generation, and interpolation.Advances in neural information processing systems, 35:23371–23385, 2022

2022

[47] [47]

Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Pith/arXiv arXiv 2025

[48] [48]

Ati: Any trajectory instruction for controllable video generation.arXiv preprint arXiv:2505.22944, 2025

Angtian Wang, Haibin Huang, Jacob Zhiyuan Fang, Yiding Yang, and Chongyang Ma. Ati: Any trajectory instruction for controllable video generation.arXiv preprint arXiv:2505.22944, 2025

arXiv 2025

[49] [49]

Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms.Advances in neural information processing systems, 30, 2017

Yunbo Wang, Mingsheng Long, Jianmin Wang, Zhifeng Gao, and Philip S Yu. Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms.Advances in neural information processing systems, 30, 2017

2017

[50] [50]

Predrnn: A recurrent neural network for spatiotemporal predictive learning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):2208–2225, 2022

Yunbo Wang, Haixu Wu, Jianjin Zhang, Zhifeng Gao, Jianmin Wang, Philip S Yu, and Ming- sheng Long. Predrnn: A recurrent neural network for spatiotemporal predictive learning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):2208–2225, 2022

2022

[51] [51]

Rs-worldmodel: a unified model for remote sensing understanding and future sense forecasting

Linrui Xu, Zhongan Wang, Fei Shen, Gang Xu, Huiping Zhuang, Ming Li, and Haifeng Li. Rs-worldmodel: a unified model for remote sensing understanding and future sense forecasting. arXiv preprint arXiv:2603.14941, 2026

arXiv 2026

[52] [52]

Worldmark: A unified benchmark suite for interactive video world models.arXiv preprint arXiv:2604.21686, 2026

Xiaojie Xu, Zhengyuan Lin, Kang He, Yukang Feng, Xiaofeng Mao, Yuanyang Yin, Kaipeng Zhang, and Yongtao Ge. Worldmark: A unified benchmark suite for interactive video world models.arXiv preprint arXiv:2604.21686, 2026

Pith/arXiv arXiv 2026

[53] [53]

Geniedrive: Towards physics-aware driving world model with 4d occupancy guided video generation.arXiv preprint arXiv:2512.12751, 2025

Zhenya Yang, Zhe Liu, Yuxiang Lu, Liping Hou, Chenxuan Miao, Siyi Peng, Bailan Feng, Xiang Bai, and Hengshuang Zhao. Geniedrive: Towards physics-aware driving world model with 4d occupancy guided video generation.arXiv preprint arXiv:2512.12751, 2025

arXiv 2025

[54] [54]

Stdiff: Spatio-temporal diffusion for continuous stochastic video prediction

Xi Ye and Guillaume-Alexandre Bilodeau. Stdiff: Spatio-temporal diffusion for continuous stochastic video prediction. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 6666–6674, 2024

2024

[55] [55]

Units: Unified time series generative model for remote sensing.arXiv preprint arXiv:2512.04461, 2025

Yuxiang Zhang, Shunlin Liang, Wenyuan Li, Han Ma, Jianglei Xu, Yichuan Ma, Jiangwei Xie, Wei Li, Mengmeng Zhang, Ran Tao, et al. Units: Unified time series generative model for remote sensing.arXiv preprint arXiv:2512.04461, 2025

arXiv 2025

[56] [56]

Extdm: Distri- bution extrapolation diffusion model for video prediction

Zhicheng Zhang, Junyao Hu, Wentao Cheng, Danda Paudel, and Jufeng Yang. Extdm: Distri- bution extrapolation diffusion model for video prediction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19310–19320, 2024

2024

[57] [57]

Vegediff: Latent diffusion model for geospatial vegetation forecasting.IEEE Transactions on Geoscience and Remote Sensing, 2025

Sijie Zhao, Hao Chen, Xueliang Zhang, Pengfeng Xiao, and Lei Bai. Vegediff: Latent diffusion model for geospatial vegetation forecasting.IEEE Transactions on Geoscience and Remote Sensing, 2025

2025

[58] [58]

Open-sora 2.0: Training a commercial-level video generation model in $200 k.arXiv preprint arXiv:2503.09642, 2025

Zangwei Zheng, Xiangyu Peng, Yuxuan Lou, Chenhui Shen, Tom Young, Xinying Guo, Binluo Wang, Hang Xu, Hongxin Liu, Mingyan Jiang, et al. Open-sora 2.0: Training a commercial-level video generation model in $200 k.arXiv preprint arXiv:2503.09642, 2025

Pith/arXiv arXiv 2025

[59] [59]

EO cond

Zhuo Zheng, Stefano Ermon, Dongjun Kim, Liangpei Zhang, and Yanfei Zhong. Changen2: Multi-temporal remote sensing generative change foundation model.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(2):725–741, 2024. 13 A Technical appendices and supplementary material We organize our supplementary material as follows. Section A.1 provide...

2024

[60] [60]

Within each track, we rank pairs by their respective divergence score (descending)

[61] [61]

Per seasonal phase (3 phases: offsets 0, 20, 40), we select the top 50 highest-divergence pairs, yielding 150 pairs per track

[62] [62]

A per-cube cap of 3 pairs prevents any single location from dominating the benchmark

[63] [63]

how much

The three track selections are merged via union and deduplicated by pair identity. Each pair retains metadata indicating which track(s) selected it and its rank within each track. The final benchmark contains422 unique pairs(844 inference windows). Multi-track membership provides complementary evaluation perspectives: 394 pairs belong to a single paper-fa...