pith. sign in

TerraMind: Large-scale generative multimodality for Earth observation

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it
abstract

We present TerraMind, the first any-to-any generative, multimodal foundation model for Earth observation (EO). Unlike other multimodal models, TerraMind is pretrained on dual-scale representations combining both token-level and pixel-level data across modalities. On a token level, TerraMind encodes high-level contextual information to learn cross-modal relationships, while on a pixel level, TerraMind leverages fine-grained representations to capture critical spatial nuances. We pretrained TerraMind on nine geospatial modalities of a global, large-scale dataset. In this paper, we demonstrate that (i) TerraMind's dual-scale early fusion approach unlocks a range of zero-shot and few-shot applications for Earth observation, (ii) TerraMind introduces "Thinking-in-Modalities" (TiM) -- the capability of generating additional artificial data during finetuning and inference to improve the model output -- and (iii) TerraMind achieves beyond state-of-the-art performance in community-standard benchmarks for EO like PANGAEA. The pretraining dataset, the model weights, and our code are open-sourced under a permissive license.

citation-role summary

background 1

citation-polarity summary

fields

cs.CV 5 cs.LG 1

years

2026 5 2025 1

verdicts

UNVERDICTED 6

roles

background 1

polarities

background 1

representative citing papers

COP-GEN: Latent Diffusion Transformer for Copernicus Earth Observation Data

cs.CV · 2026-03-03 · unverdicted · novelty 7.0

COP-GEN models multimodal Copernicus Earth observation data as conditional distributions via a latent diffusion transformer, producing diverse physically consistent outputs and covering 90% of the real observation manifold on a new stochastic benchmark.

citing papers explorer

Showing 6 of 6 citing papers.