OlmoEarth v1.1: A more efficient family of OlmoEarth models

Christopher Wilhelm; Favyen Bastani; Gabriel Tseng; Hadrien Sablon; Henry Herzog; Joseph Redmon; Patrick Alan Johnson; Patrick Beukema; Piper Wolters; Yawen Zhang

arxiv: 2605.20804 · v1 · pith:OALEJTCLnew · submitted 2026-05-20 · 💻 cs.CV · cs.LG

OlmoEarth v1.1: A more efficient family of OlmoEarth models

Gabriel Tseng , Yawen Zhang , Favyen Bastani , Henry Herzog , Joseph Redmon , Hadrien Sablon , Piper Wolters , Patrick Alan Johnson

show 2 more authors

Christopher Wilhelm Patrick Beukema

This is my paper

Pith reviewed 2026-05-21 05:31 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords OlmoEarthmodel efficiencySentinel-2computer visionremote sensingtraining optimizationinference costEarth observation

0 comments

The pith

A revised OlmoEarth model family cuts training GPU hours by 1.7 times and inference MACs by 2.9 times on Sentinel-2 tasks while maintaining overall performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a collection of changes to the OlmoEarth family of models used for Earth observation. These changes lower the GPU hours needed to train the base versions by a factor of 1.7 and reduce the multiply-accumulate operations required for inference on Sentinel-2 satellite imagery by a factor of 2.9. The authors report that the models continue to achieve the same overall results on their tasks as earlier versions. The training code is released publicly so others can reproduce or extend the work. Efficiency improvements of this kind matter because they lower the resources required to develop and run vision models on large-scale remote sensing data.

Core claim

The central claim is that a set of improvements applied to the OlmoEarth family produces a 1.7-fold reduction in GPU hours to train Base models and a 2.9-fold reduction in MACs for inference on Sentinel-2 tasks, all while preserving the models' overall performance across evaluated tasks.

What carries the argument

The OlmoEarth v1.1 model family, created through a set of unspecified improvements that reduce computational costs during training and inference.

If this is right

Base models now require 1.7 times fewer GPU hours to train.
Inference on Sentinel-2 tasks uses 2.9 times fewer MACs.
Overall performance metrics remain comparable to previous versions.
Public release of the training code enables direct reproduction and further development.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The efficiency gains may allow researchers to train or fine-tune models more often when new satellite data becomes available.
Similar changes could be tested on other remote-sensing architectures to check whether comparable savings appear outside this family.

Load-bearing premise

The claim of maintained performance assumes that the same tasks, datasets, and evaluation protocols were used as in the prior OlmoEarth versions.

What would settle it

A side-by-side comparison of v1.1 and earlier OlmoEarth models on the exact same benchmarks that shows a measurable drop in any primary performance metric.

read the original abstract

We present a set of improvements to the OlmoEarth family. These improvements allow us to cut compute costs during training ($1.7 \times$ reduction in GPU hours required to train our Base models) and inference ($2.9\times$ reductions in MACs on Sentinel-2 tasks), while maintaining the models' overall performance. All training code is available at github.com/allenai/olmoearth_pretrain.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper presents improvements to the OlmoEarth family of models. These changes reduce training compute by 1.7× GPU hours for Base models and inference cost by 2.9× MACs on Sentinel-2 tasks while maintaining overall performance. Training code is released at github.com/allenai/olmoearth_pretrain.

Significance. If the performance parity holds under identical evaluation conditions to prior OlmoEarth versions, the work would offer a practical reduction in compute for remote-sensing foundation models, lowering barriers to training and deployment on Sentinel-2 data. Public code release supports reproducibility.

major comments (1)

Abstract: The central claim that overall performance is maintained is asserted without any quantitative metrics, baselines, ablation tables, or explicit statement that the same tasks, datasets, splits, and metrics as the original OlmoEarth models were used. This leaves the comparability of the efficiency gains unverified.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and the recommendation for major revision. We address the single major comment below and will incorporate changes to improve clarity.

read point-by-point responses

Referee: [—] Abstract: The central claim that overall performance is maintained is asserted without any quantitative metrics, baselines, ablation tables, or explicit statement that the same tasks, datasets, splits, and metrics as the original OlmoEarth models were used. This leaves the comparability of the efficiency gains unverified.

Authors: We agree that the abstract is concise and does not itself contain the supporting quantitative details. The full manuscript addresses this in Section 4 and Tables 2–5, which report direct comparisons to the original OlmoEarth models on identical Sentinel-2 tasks, datasets, splits, and metrics (classification, segmentation, and change detection). These evaluations show performance parity, with mean differences below 1% across all reported metrics. To resolve the concern, we will revise the abstract to add a short quantitative clause and an explicit statement that the evaluation protocol matches the prior work. The revised abstract will read in part: “while maintaining the models’ overall performance (within 1% on average across the same Sentinel-2 benchmarks and metrics as OlmoEarth v1.0; see Section 4).” This revision will appear in the next manuscript version. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical measurements of compute and performance

full rationale

The paper presents a set of model improvements and reports direct empirical results: 1.7× reduction in GPU hours for Base models and 2.9× reduction in MACs on Sentinel-2 tasks, with maintained overall performance. No equations, derivations, or mathematical chains are described that reduce by construction to fitted parameters or prior self-citations. Efficiency metrics are self-contained measurements, and performance parity is asserted via reported evaluation rather than any definitional or fitted-input equivalence. The derivation is therefore self-contained against external benchmarks of training time and inference cost.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; claims are presented as empirical outcomes of model changes.

pith-pipeline@v0.9.0 · 5625 in / 959 out tokens · 28161 ms · 2026-05-21T05:31:50.894826+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages

[1]

FlexiViT: One model for all patch sizes

Lucas Beyer, Pavel Izmailov, Alexander Kolesnikov, Mathilde Caron, Simon Kornblith, Xiaohua Zhai, Matthias Minderer, Michael Tschannen, Ibrahim Alabdulmohsin, and Filip Pavetic. FlexiViT: One model for all patch sizes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14496–14506, 2023

work page 2023
[2]

Satmae: Pre-training transformers for temporal and multi-spectral satellite imagery.Advances in Neural Information Processing Systems, 35:197–211, 2022

Yezhen Cong, Samar Khanna, Chenlin Meng, Patrick Liu, Erik Rozi, Yutong He, Marshall Burke, David Lobell, and Stefano Ermon. Satmae: Pre-training transformers for temporal and multi-spectral satellite imagery.Advances in Neural Information Processing Systems, 35:197–211, 2022

work page 2022
[3]

Olmoearth: Stable latent image modeling for multimodal earth observation

Henry Herzog, Favyen Bastani, Yawen Zhang, Gabriel Tseng, Joseph Redmon, Hadrien Sablon, Ryan Park, Jacob Morrison, Alexandra Buraczynski, Karen Farley, et al. Olmoearth: Stable latent image modeling for multimodal earth observation. InCVPR, 2026

work page 2026
[4]

Boosting contrastive self-supervised learning with false negative cancellation

Tri Huynh, Simon Kornblith, Matthew R Walter, Michael Maire, and Maryam Khademi. Boosting contrastive self-supervised learning with false negative cancellation. InProceedings of the IEEE/CVF winter conference on applications of computer vision, 2022

work page 2022
[5]

Shuttle Radar Topography Mission.https: //e4ftl01.cr.usgs.gov/MEASURES/SRTMGL1.003/, 2018

National Aeronautics and Space Administration (NASA) Earthdata. Shuttle Radar Topography Mission.https: //e4ftl01.cr.usgs.gov/MEASURES/SRTMGL1.003/, 2018

work page 2018
[6]

Planet dump retrieved from https://planet.osm.org .https://www.openstreetmap

OpenStreetMap contributors. Planet dump retrieved from https://planet.osm.org .https://www.openstreetmap. org, 2017

work page 2017
[7]

Prithvi-eo-2.0: A versatile multi-temporal foundation model for earth observation applications.arXiv preprint arXiv:2412.02732, 2024

Daniela Szwarcman, Sujit Roy, Paolo Fraccaro, Thorsteinn Elí Gíslason, Benedikt Blumenstiel, Rinki Ghosal, Pedro Henrique de Oliveira, Joao Lucas de Sousa Almeida, Rocco Sedona, Yanghui Kang, et al. Prithvi-eo-2.0: A versatile multi-temporal foundation model for earth observation applications.arXiv preprint arXiv:2412.02732, 2024

work page arXiv 2024
[8]

Jamie Tolan, Hung-I Yang, Benjamin Nosarzewski, Guillaume Couairon, Huy V Vo, John Brandt, Justine Spore, Sayantan Majumdar, Daniel Haziza, Janaki Vamaraju, et al. Very high resolution canopy height maps from rgb imagery using self-supervised vision transformer and convolutional decoder trained on aerial lidar.Remote Sensing of Environment, 300:113888, 2024

work page 2024
[9]

Lightweight, pre-trained transformers for remote sensing timeseries,

Gabriel Tseng, Ruben Cartuyvels, Ivan Zvonkov, Mirali Purohit, David Rolnick, and Hannah Kerner. Lightweight, pre-trained transformers for remote sensing timeseries.arXiv preprint arXiv:2304.14065, 2023

work page arXiv 2023
[10]

Galileo: Learning global & local features of many remote sensing modalities

Gabriel Tseng, Anthony Fuller, Marlena Reil, Henry Herzog, Patrick Beukema, Favyen Bastani, James R Green, Evan Shelhamer, Hannah Kerner, and David Rolnick. Galileo: Learning global & local features of many remote sensing modalities. InForty-second International Conference on Machine Learning, 2025

work page 2025
[11]

Cropland Data Layer: USDA NASS, 2024

United States Department of Agriculture (USDA) National Agricultural Statistics Service (NASS). Cropland Data Layer: USDA NASS, 2024. National Agricultural Statistics Service Marketing and Information Services Office, Washington, D.C. Retrieved from Link: https://croplandcros.scinet.usda.gov/

work page 2024
[12]

WorldCereal: a dynamic open-source system for global-scale, seasonal, and reproducible crop and irrigation mapping.Earth System Science Data Discussions, 2023:1–36, 2023

Kristof Van Tricht, Jeroen Degerickx, Sven Gilliams, Daniele Zanaga, Marjorie Battude, Alex Grosu, Joost Brombacher, Myroslava Lesiv, Juan Carlos Laso Bayas, Santosh Karanam, et al. WorldCereal: a dynamic open-source system for global-scale, seasonal, and reproducible crop and irrigation mapping.Earth System Science Data Discussions, 2023:1–36, 2023

work page 2023
[13]

Towards latent masked image modeling for self-supervised visual representation learning

Yibing Wei, Abhinav Gupta, and Pedro Morgado. Towards latent masked image modeling for self-supervised visual representation learning. InECCV, 2024

work page 2024
[14]

Understanding contrastive learning via distributionally robust optimization.Advances in Neural Information Processing Systems, 2023

Junkang Wu, Jiawei Chen, Jiancan Wu, Wentao Shi, Xiang Wang, and Xiangnan He. Understanding contrastive learning via distributionally robust optimization.Advances in Neural Information Processing Systems, 2023

work page 2023
[15]

Early convolutions help transformers see better.Advances in Neural Information Processing Systems, 2021

Tete Xiao, Mannat Singh, Eric Mintun, Trevor Darrell, Piotr Dollár, and Ross Girshick. Early convolutions help transformers see better.Advances in Neural Information Processing Systems, 2021

work page 2021
[16]

ESA WorldCover 10 m 2021 v200

Daniele Zanaga, Ruben Van De Kerchove, Dirk Daems, Wanda De Keersmaecker, Carsten Brockmann, Grit Kirches, Jan Wevers, Oliver Cartus, Maurizio Santoro, Steffen Fritz, et al. ESA WorldCover 10 m 2021 v200. ESA WorldCover Project, 2022. 8 A Speedups: Linear vs. convolutional patch embedding. FlexiViT [1] needs a patch embedding that accepts variable patch s...

work page 2021

[1] [1]

FlexiViT: One model for all patch sizes

Lucas Beyer, Pavel Izmailov, Alexander Kolesnikov, Mathilde Caron, Simon Kornblith, Xiaohua Zhai, Matthias Minderer, Michael Tschannen, Ibrahim Alabdulmohsin, and Filip Pavetic. FlexiViT: One model for all patch sizes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14496–14506, 2023

work page 2023

[2] [2]

Satmae: Pre-training transformers for temporal and multi-spectral satellite imagery.Advances in Neural Information Processing Systems, 35:197–211, 2022

Yezhen Cong, Samar Khanna, Chenlin Meng, Patrick Liu, Erik Rozi, Yutong He, Marshall Burke, David Lobell, and Stefano Ermon. Satmae: Pre-training transformers for temporal and multi-spectral satellite imagery.Advances in Neural Information Processing Systems, 35:197–211, 2022

work page 2022

[3] [3]

Olmoearth: Stable latent image modeling for multimodal earth observation

Henry Herzog, Favyen Bastani, Yawen Zhang, Gabriel Tseng, Joseph Redmon, Hadrien Sablon, Ryan Park, Jacob Morrison, Alexandra Buraczynski, Karen Farley, et al. Olmoearth: Stable latent image modeling for multimodal earth observation. InCVPR, 2026

work page 2026

[4] [4]

Boosting contrastive self-supervised learning with false negative cancellation

Tri Huynh, Simon Kornblith, Matthew R Walter, Michael Maire, and Maryam Khademi. Boosting contrastive self-supervised learning with false negative cancellation. InProceedings of the IEEE/CVF winter conference on applications of computer vision, 2022

work page 2022

[5] [5]

Shuttle Radar Topography Mission.https: //e4ftl01.cr.usgs.gov/MEASURES/SRTMGL1.003/, 2018

National Aeronautics and Space Administration (NASA) Earthdata. Shuttle Radar Topography Mission.https: //e4ftl01.cr.usgs.gov/MEASURES/SRTMGL1.003/, 2018

work page 2018

[6] [6]

Planet dump retrieved from https://planet.osm.org .https://www.openstreetmap

OpenStreetMap contributors. Planet dump retrieved from https://planet.osm.org .https://www.openstreetmap. org, 2017

work page 2017

[7] [7]

Prithvi-eo-2.0: A versatile multi-temporal foundation model for earth observation applications.arXiv preprint arXiv:2412.02732, 2024

Daniela Szwarcman, Sujit Roy, Paolo Fraccaro, Thorsteinn Elí Gíslason, Benedikt Blumenstiel, Rinki Ghosal, Pedro Henrique de Oliveira, Joao Lucas de Sousa Almeida, Rocco Sedona, Yanghui Kang, et al. Prithvi-eo-2.0: A versatile multi-temporal foundation model for earth observation applications.arXiv preprint arXiv:2412.02732, 2024

work page arXiv 2024

[8] [8]

Jamie Tolan, Hung-I Yang, Benjamin Nosarzewski, Guillaume Couairon, Huy V Vo, John Brandt, Justine Spore, Sayantan Majumdar, Daniel Haziza, Janaki Vamaraju, et al. Very high resolution canopy height maps from rgb imagery using self-supervised vision transformer and convolutional decoder trained on aerial lidar.Remote Sensing of Environment, 300:113888, 2024

work page 2024

[9] [9]

Lightweight, pre-trained transformers for remote sensing timeseries,

Gabriel Tseng, Ruben Cartuyvels, Ivan Zvonkov, Mirali Purohit, David Rolnick, and Hannah Kerner. Lightweight, pre-trained transformers for remote sensing timeseries.arXiv preprint arXiv:2304.14065, 2023

work page arXiv 2023

[10] [10]

Galileo: Learning global & local features of many remote sensing modalities

Gabriel Tseng, Anthony Fuller, Marlena Reil, Henry Herzog, Patrick Beukema, Favyen Bastani, James R Green, Evan Shelhamer, Hannah Kerner, and David Rolnick. Galileo: Learning global & local features of many remote sensing modalities. InForty-second International Conference on Machine Learning, 2025

work page 2025

[11] [11]

Cropland Data Layer: USDA NASS, 2024

United States Department of Agriculture (USDA) National Agricultural Statistics Service (NASS). Cropland Data Layer: USDA NASS, 2024. National Agricultural Statistics Service Marketing and Information Services Office, Washington, D.C. Retrieved from Link: https://croplandcros.scinet.usda.gov/

work page 2024

[12] [12]

WorldCereal: a dynamic open-source system for global-scale, seasonal, and reproducible crop and irrigation mapping.Earth System Science Data Discussions, 2023:1–36, 2023

Kristof Van Tricht, Jeroen Degerickx, Sven Gilliams, Daniele Zanaga, Marjorie Battude, Alex Grosu, Joost Brombacher, Myroslava Lesiv, Juan Carlos Laso Bayas, Santosh Karanam, et al. WorldCereal: a dynamic open-source system for global-scale, seasonal, and reproducible crop and irrigation mapping.Earth System Science Data Discussions, 2023:1–36, 2023

work page 2023

[13] [13]

Towards latent masked image modeling for self-supervised visual representation learning

Yibing Wei, Abhinav Gupta, and Pedro Morgado. Towards latent masked image modeling for self-supervised visual representation learning. InECCV, 2024

work page 2024

[14] [14]

Understanding contrastive learning via distributionally robust optimization.Advances in Neural Information Processing Systems, 2023

Junkang Wu, Jiawei Chen, Jiancan Wu, Wentao Shi, Xiang Wang, and Xiangnan He. Understanding contrastive learning via distributionally robust optimization.Advances in Neural Information Processing Systems, 2023

work page 2023

[15] [15]

Early convolutions help transformers see better.Advances in Neural Information Processing Systems, 2021

Tete Xiao, Mannat Singh, Eric Mintun, Trevor Darrell, Piotr Dollár, and Ross Girshick. Early convolutions help transformers see better.Advances in Neural Information Processing Systems, 2021

work page 2021

[16] [16]

ESA WorldCover 10 m 2021 v200

Daniele Zanaga, Ruben Van De Kerchove, Dirk Daems, Wanda De Keersmaecker, Carsten Brockmann, Grit Kirches, Jan Wevers, Oliver Cartus, Maurizio Santoro, Steffen Fritz, et al. ESA WorldCover 10 m 2021 v200. ESA WorldCover Project, 2022. 8 A Speedups: Linear vs. convolutional patch embedding. FlexiViT [1] needs a patch embedding that accepts variable patch s...

work page 2021