TerraMind: Large-scale generative multimodality for Earth observation

· 2025 · cs.CV · arXiv 2504.11171

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

open full Pith review browse 6 citing papers arXiv PDF

abstract

We present TerraMind, the first any-to-any generative, multimodal foundation model for Earth observation (EO). Unlike other multimodal models, TerraMind is pretrained on dual-scale representations combining both token-level and pixel-level data across modalities. On a token level, TerraMind encodes high-level contextual information to learn cross-modal relationships, while on a pixel level, TerraMind leverages fine-grained representations to capture critical spatial nuances. We pretrained TerraMind on nine geospatial modalities of a global, large-scale dataset. In this paper, we demonstrate that (i) TerraMind's dual-scale early fusion approach unlocks a range of zero-shot and few-shot applications for Earth observation, (ii) TerraMind introduces "Thinking-in-Modalities" (TiM) -- the capability of generating additional artificial data during finetuning and inference to improve the model output -- and (iii) TerraMind achieves beyond state-of-the-art performance in community-standard benchmarks for EO like PANGAEA. The pretraining dataset, the model weights, and our code are open-sourced under a permissive license.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

MotifGen: Spatiotemporal interpolation of misaligned satellite images via multi-source generative modeling, in an application to tropical cyclones

cs.CV · 2026-06-23 · unverdicted · novelty 7.0

MotifGen is the first multi-source generative model for spatiotemporal interpolation of misaligned microwave cyclone images from heterogeneous instruments at irregular intervals, achieving lower CRPS via self-supervised training and closer power spectra than deterministic baselines when combining in

LUCAS-MEGA: A Large-Scale Multimodal Dataset for Representation Learning in Soil-Environment Systems

cs.LG · 2026-05-05 · unverdicted · novelty 7.0 · 2 refs

LUCAS-MEGA fuses 68 heterogeneous soil datasets into a 70k-sample multimodal collection and demonstrates its value by pretraining a tabular transformer whose representations align with established soil processes.

COP-GEN: Latent Diffusion Transformer for Copernicus Earth Observation Data

cs.CV · 2026-03-03 · unverdicted · novelty 7.0

COP-GEN models multimodal Copernicus Earth observation data as conditional distributions via a latent diffusion transformer, producing diverse physically consistent outputs and covering 90% of the real observation manifold on a new stochastic benchmark.

Cross-Scale Pretraining: Enhancing Self-Supervised Learning for Low-Resolution Satellite Imagery for Semantic Segmentation

cs.CV · 2026-01-19 · unverdicted · novelty 7.0

A new spatial affinity component for self-supervised pretraining leverages high-resolution imagery to enhance mid-resolution satellite image representations and segmentation performance.

Location Is All You Need: Continuous Spatiotemporal Neural Representations of Earth Observation Data

cs.CV · 2026-04-08 · unverdicted · novelty 5.0

LIANet encodes multi-temporal Earth observation data into a coordinate-based neural field that supports label-only fine-tuning for downstream tasks without access to raw imagery.

Feature Extraction in the Remote Sensing Data Value Chain: A Systematic Review of Methods and Applications

cs.CV · 2025-10-21 · unverdicted · novelty 5.0

A systematic review that introduces a framework for feature extraction in remote sensing, traces its evolution in the data value chain, and synthesizes trends toward unified representations and foundation models.

citing papers explorer

Showing 6 of 6 citing papers.

MotifGen: Spatiotemporal interpolation of misaligned satellite images via multi-source generative modeling, in an application to tropical cyclones cs.CV · 2026-06-23 · unverdicted · none · ref 14 · internal anchor
MotifGen is the first multi-source generative model for spatiotemporal interpolation of misaligned microwave cyclone images from heterogeneous instruments at irregular intervals, achieving lower CRPS via self-supervised training and closer power spectra than deterministic baselines when combining in
LUCAS-MEGA: A Large-Scale Multimodal Dataset for Representation Learning in Soil-Environment Systems cs.LG · 2026-05-05 · unverdicted · none · ref 13 · 2 links · internal anchor
LUCAS-MEGA fuses 68 heterogeneous soil datasets into a 70k-sample multimodal collection and demonstrates its value by pretraining a tabular transformer whose representations align with established soil processes.
COP-GEN: Latent Diffusion Transformer for Copernicus Earth Observation Data cs.CV · 2026-03-03 · unverdicted · none · ref 24 · internal anchor
COP-GEN models multimodal Copernicus Earth observation data as conditional distributions via a latent diffusion transformer, producing diverse physically consistent outputs and covering 90% of the real observation manifold on a new stochastic benchmark.
Cross-Scale Pretraining: Enhancing Self-Supervised Learning for Low-Resolution Satellite Imagery for Semantic Segmentation cs.CV · 2026-01-19 · unverdicted · none · ref 9 · internal anchor
A new spatial affinity component for self-supervised pretraining leverages high-resolution imagery to enhance mid-resolution satellite image representations and segmentation performance.
Location Is All You Need: Continuous Spatiotemporal Neural Representations of Earth Observation Data cs.CV · 2026-04-08 · unverdicted · none · ref 22 · internal anchor
LIANet encodes multi-temporal Earth observation data into a coordinate-based neural field that supports label-only fine-tuning for downstream tasks without access to raw imagery.
Feature Extraction in the Remote Sensing Data Value Chain: A Systematic Review of Methods and Applications cs.CV · 2025-10-21 · unverdicted · none · ref 230 · internal anchor
A systematic review that introduces a framework for feature extraction in remote sensing, traces its evolution in the data value chain, and synthesizes trends toward unified representations and foundation models.

TerraMind: Large-scale generative multimodality for Earth observation

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer