TerraMind: Large-Scale Generative Multimodality for Earth Observation

Benedikt Blumenstiel; Erik Scheurer; Felix Yang; Gabriele Cavallaro; Jente Bosmans; Johannes Jakubik; Juan Bernabe-Moreno; Nicolas Long\'ep\'e; Niklas Kopp; Nikolaos Dionelis

arxiv: 2504.11171 · v5 · pith:ERK5YTZHnew · submitted 2025-04-15 · 💻 cs.CV · cs.AI

TerraMind: Large-Scale Generative Multimodality for Earth Observation

Johannes Jakubik , Felix Yang , Benedikt Blumenstiel , Erik Scheurer , Rocco Sedona , Stefano Maurogiovanni , Jente Bosmans , Nikolaos Dionelis

show 8 more authors

Valerio Marsocci Niklas Kopp Rahul Ramachandran Paolo Fraccaro Thomas Brunschwiler Gabriele Cavallaro Juan Bernabe-Moreno Nicolas Long\'ep\'e

This is my paper

classification 💻 cs.CV cs.AI

keywords terramindearthmodelobservationdatadatasetdual-scalegenerative

0 comments

read the original abstract

We present TerraMind, the first any-to-any generative, multimodal foundation model for Earth observation (EO). Unlike other multimodal models, TerraMind is pretrained on dual-scale representations combining both token-level and pixel-level data across modalities. On a token level, TerraMind encodes high-level contextual information to learn cross-modal relationships, while on a pixel level, TerraMind leverages fine-grained representations to capture critical spatial nuances. We pretrained TerraMind on nine geospatial modalities of a global, large-scale dataset. In this paper, we demonstrate that (i) TerraMind's dual-scale early fusion approach unlocks a range of zero-shot and few-shot applications for Earth observation, (ii) TerraMind introduces "Thinking-in-Modalities" (TiM) -- the capability of generating additional artificial data during finetuning and inference to improve the model output -- and (iii) TerraMind achieves beyond state-of-the-art performance in community-standard benchmarks for EO like PANGAEA. The pretraining dataset, the model weights, and our code are open-sourced under a permissive license.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LUCAS-MEGA: A Large-Scale Multimodal Dataset for Representation Learning in Soil-Environment Systems
cs.LG 2026-05 unverdicted novelty 7.0

LUCAS-MEGA fuses 68 soil-environment datasets into a 70k-sample multimodal resource that supports self-supervised pretraining of SoilFormer, whose representations align with known soil processes.
LUCAS-MEGA: A Large-Scale Multimodal Dataset for Representation Learning in Soil-Environment Systems
cs.LG 2026-05 unverdicted novelty 7.0

LUCAS-MEGA fuses 68 heterogeneous soil datasets into a 70k-sample multimodal collection and demonstrates its value by pretraining a tabular transformer whose representations align with established soil processes.
COP-GEN: Latent Diffusion Transformer for Copernicus Earth Observation Data
cs.CV 2026-03 unverdicted novelty 7.0

COP-GEN models multimodal Copernicus Earth observation data as conditional distributions via a latent diffusion transformer, producing diverse physically consistent outputs and covering 90% of the real observation man...
Cross-Scale Pretraining: Enhancing Self-Supervised Learning for Low-Resolution Satellite Imagery for Semantic Segmentation
cs.CV 2026-01 unverdicted novelty 7.0

A new spatial affinity component for self-supervised pretraining leverages high-resolution imagery to enhance mid-resolution satellite image representations and segmentation performance.
Location Is All You Need: Continuous Spatiotemporal Neural Representations of Earth Observation Data
cs.CV 2026-04 unverdicted novelty 5.0

LIANet encodes multi-temporal Earth observation data into a coordinate-based neural field that supports label-only fine-tuning for downstream tasks without access to raw imagery.
Feature Extraction in the Remote Sensing Data Value Chain: A Systematic Review of Methods and Applications
cs.CV 2025-10 unverdicted novelty 5.0

A systematic review that introduces a framework for feature extraction in remote sensing, traces its evolution in the data value chain, and synthesizes trends toward unified representations and foundation models.