TerraMind: Large-Scale Generative Multimodality for Earth Observation
read the original abstract
We present TerraMind, the first any-to-any generative, multimodal foundation model for Earth observation (EO). Unlike other multimodal models, TerraMind is pretrained on dual-scale representations combining both token-level and pixel-level data across modalities. On a token level, TerraMind encodes high-level contextual information to learn cross-modal relationships, while on a pixel level, TerraMind leverages fine-grained representations to capture critical spatial nuances. We pretrained TerraMind on nine geospatial modalities of a global, large-scale dataset. In this paper, we demonstrate that (i) TerraMind's dual-scale early fusion approach unlocks a range of zero-shot and few-shot applications for Earth observation, (ii) TerraMind introduces "Thinking-in-Modalities" (TiM) -- the capability of generating additional artificial data during finetuning and inference to improve the model output -- and (iii) TerraMind achieves beyond state-of-the-art performance in community-standard benchmarks for EO like PANGAEA. The pretraining dataset, the model weights, and our code are open-sourced under a permissive license.
This paper has not been read by Pith yet.
Forward citations
Cited by 6 Pith papers
-
LUCAS-MEGA: A Large-Scale Multimodal Dataset for Representation Learning in Soil-Environment Systems
LUCAS-MEGA fuses 68 soil-environment datasets into a 70k-sample multimodal resource that supports self-supervised pretraining of SoilFormer, whose representations align with known soil processes.
-
LUCAS-MEGA: A Large-Scale Multimodal Dataset for Representation Learning in Soil-Environment Systems
LUCAS-MEGA fuses 68 heterogeneous soil datasets into a 70k-sample multimodal collection and demonstrates its value by pretraining a tabular transformer whose representations align with established soil processes.
-
COP-GEN: Latent Diffusion Transformer for Copernicus Earth Observation Data
COP-GEN models multimodal Copernicus Earth observation data as conditional distributions via a latent diffusion transformer, producing diverse physically consistent outputs and covering 90% of the real observation man...
-
Cross-Scale Pretraining: Enhancing Self-Supervised Learning for Low-Resolution Satellite Imagery for Semantic Segmentation
A new spatial affinity component for self-supervised pretraining leverages high-resolution imagery to enhance mid-resolution satellite image representations and segmentation performance.
-
Location Is All You Need: Continuous Spatiotemporal Neural Representations of Earth Observation Data
LIANet encodes multi-temporal Earth observation data into a coordinate-based neural field that supports label-only fine-tuning for downstream tasks without access to raw imagery.
-
Feature Extraction in the Remote Sensing Data Value Chain: A Systematic Review of Methods and Applications
A systematic review that introduces a framework for feature extraction in remote sensing, traces its evolution in the data value chain, and synthesizes trends toward unified representations and foundation models.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.