arxiv: 2604.10347 · v1 · submitted 2026-04-11 · 💻 cs.CV

Recognition: unknown

Multi-modal, multi-scale representation learning for satellite imagery analysis just needs a good ALiBi

Patrick Kage , Pavlos Andreadis

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:35 UTC · model grok-4.3

classification 💻 cs.CV

keywords Scale-ALiBimulti-scale satellite imagerymulti-modal representation learningtransformer attention biasGEO-Benchvision foundation modelsSAR and optical fusion

0 comments

The pith

Scale-ALiBi adds a spatial bias to transformer attention that directly encodes ground sample distance ratios between patches from different resolutions and sensors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that standard transformer attention fails to handle satellite imagery because patches from high-resolution optical, low-resolution optical, and low-resolution SAR images sit at incompatible ground sample distances. Scale-ALiBi extends the ALiBi bias with an explicit term that scales the attention offset according to the ratio of those distances, allowing a single model to learn joint representations. The authors train this inside a triple-contrastive plus reconstructive architecture on aligned multi-modal patches and report gains on the GEO-Bench benchmark. Because the change is only a bias addition, it requires no new layers or positional embeddings. The work also releases the aligned dataset used for the experiments.

Core claim

Scale-ALiBi is a linear bias added to self-attention whose magnitude between any two patches is set by the logarithm of the ratio of their ground sample distances; when this bias is used inside a vision transformer trained with contrastive and reconstruction objectives on aligned high- and low-resolution optical and SAR imagery, the resulting representations improve downstream performance on GEO-Bench.

What carries the argument

Scale-ALiBi, the attention bias that injects the log-ratio of ground sample distances between patches drawn from inputs of different spatial resolutions.

If this is right

A single transformer can now ingest and align high-resolution optical, low-resolution optical, and low-resolution SAR patches without separate scale-specific branches.
The same attention bias can be dropped into any existing ALiBi-equipped vision transformer with no change to architecture or training schedule.
The released aligned multi-modal satellite dataset becomes a public test bed for other multi-scale methods.
Downstream tasks that require fusion of different-resolution or different-sensor imagery receive stronger starting representations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same principle of baking physical scale ratios into attention could be applied to temporal spacing in video or to depth spacing in stereo or lidar data.
If the bias works because it reflects real-world geometry, analogous biases might be written for other measurable image properties such as viewing angle or atmospheric path length.
Because the modification is parameter-free once the scale ratios are known, it offers a route to embed domain knowledge without increasing model size.

Load-bearing premise

The log-ratio of ground sample distances between patches supplies the right inductive bias for cross-scale relationships and that this bias remains useful outside the specific aligned training set and loss combination used here.

What would settle it

Training identical models on the same aligned dataset but with the scale-ratio term removed or replaced by a constant, then measuring no drop (or an increase) in GEO-Bench scores.

Figures

Figures reproduced from arXiv: 2604.10347 by Patrick Kage, Pavlos Andreadis.

**Figure 2.** Figure 2: An example Scale-ALiBi attention matrix for 4 patches of size 4 computed from a 4 × 4 source image s, with a 8×8 context image c containing 16 patches. s and c represent the same physical area on the ground, and thus this matrix functions as a distance lookup table comparing these two token streams. Note that here the slopes for the different attention heads were omitted for clarity. representations of… view at source ↗

**Figure 3.** Figure 3: A selection of samples from the Scale [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

read the original abstract

Vision foundation models have been shown to be effective at processing satellite imagery into representations fit for downstream tasks, however, creating models which operate over multiple spatial resolutions and modes is challenging. This paper presents Scale-ALiBi, a linear bias transformer attention mechanism with a spatial encoding bias to relationships between image patches at different ground sample distance scales. We provide an implementation of Scale-ALiBi over a dataset of aligned high- and low-resolution optical and low-resolution SAR satellite imagery data using a triple-contrastive and reconstructive architecture, show an improvement on the GEO-Bench benchmark, and release the newly curated dataset publicly.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces Scale-ALiBi, a linear bias extension to ALiBi attention that adds a spatial encoding term for cross-scale patch relationships in multi-resolution imagery. It describes an implementation within a triple-contrastive plus reconstructive transformer architecture operating on aligned high-resolution optical, low-resolution optical, and low-resolution SAR satellite data, reports an improvement on the GEO-Bench benchmark, and releases the curated aligned dataset.

Significance. If the central claim is substantiated, the result would indicate that a targeted, parameter-light modification to the attention bias can enable effective multi-scale and multi-modal representation learning for satellite imagery, potentially simplifying foundation-model design in remote sensing. The public release of the aligned multi-modal dataset is a clear positive contribution that benefits the community regardless of the architectural novelty.

major comments (1)

The experimental section provides no ablation studies that hold the triple-contrastive and reconstructive objectives, data alignment, and overall architecture fixed while swapping only the positional bias (Scale-ALiBi versus standard ALiBi versus no bias or sinusoidal encodings). Without this isolation, any reported GEO-Bench gain cannot be attributed to the proposed scale-aware bias rather than the training losses or the aligned multi-resolution data, leaving the title claim unsupported.

minor comments (1)

The abstract states that an improvement on GEO-Bench is shown but supplies no numerical values, baseline comparisons, or error bars; these details should be added to the abstract for a self-contained summary.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and for highlighting the value of the released aligned dataset. We address the single major comment below.

read point-by-point responses

Referee: The experimental section provides no ablation studies that hold the triple-contrastive and reconstructive objectives, data alignment, and overall architecture fixed while swapping only the positional bias (Scale-ALiBi versus standard ALiBi versus no bias or sinusoidal encodings). Without this isolation, any reported GEO-Bench gain cannot be attributed to the proposed scale-aware bias rather than the training losses or the aligned multi-resolution data, leaving the title claim unsupported.

Authors: We agree that the current experiments do not isolate the positional bias while holding the triple-contrastive and reconstructive objectives, data alignment, and architecture fixed. The reported GEO-Bench results therefore cannot be attributed solely to Scale-ALiBi. In the revised manuscript we will add the requested ablation studies: we will train identical models that differ only in the attention bias (Scale-ALiBi, standard ALiBi, sinusoidal encodings, and no bias) and report the corresponding GEO-Bench scores. This will directly substantiate the title claim. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architectural proposal with no self-referential derivations

full rationale

The paper introduces Scale-ALiBi as a new linear bias mechanism for multi-scale patch relationships in transformers and evaluates it empirically on aligned multi-modal satellite data with a triple-contrastive/reconstructive objective, reporting GEO-Bench gains. No equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described method. The contribution is an architectural addition whose validity rests on external benchmark results rather than reducing to its own inputs by construction. This matches the default non-circular case for papers that present and test a new component without tautological reasoning chains.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only; the central claim rests on the assumption that a simple additive bias term suffices to encode cross-scale spatial relationships without further architectural or loss-function changes.

axioms (1)

domain assumption A linear bias term can be extended with spatial ground-sample-distance encoding to improve attention across image scales.
Invoked when the authors state that Scale-ALiBi 'just needs a good ALiBi' for multi-scale satellite data.

pith-pipeline@v0.9.0 · 5397 in / 1195 out tokens · 31036 ms · 2026-05-10T15:35:12.447373+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 2 canonical work pages · 2 internal anchors

[1]

CROMA: Remote sensing representations with contrastive radar- optical masked autoencoders,

A. Fuller, K. Millard, and J. R. Green, “CROMA: Remote sensing representations with contrastive radar- optical masked autoencoders,” inAdvances in Neural In- formation Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023 (A. Oh, T. Naumann, A. Globerson, K. Saenk...

2023
[2]

Copernicus Sentinel data, processed by ESA,

E. S. Agency, “Copernicus Sentinel data, processed by ESA,” 2024

2024
[3]

National Agriculture Imagery Program (NAIP),

U. G. Survey, “National Agriculture Imagery Program (NAIP),” 2024

2024
[4]

SatMAE: Pre- training Transformers for Temporal and Multi-Spectral Satellite Imagery,

Y. Cong, S. Khanna, C. Meng, P. Liu, E. Rozi, Y. He, M. Burke, D. B. Lobell, and S. Ermon, “SatMAE: Pre- training Transformers for Temporal and Multi-Spectral Satellite Imagery,” Jan. 2023

2023
[5]

Scale-MAE: A Scale-Aware Masked Autoen- coder for Multiscale Geospatial Representation Learn- ing,

C. J. Reed, R. Gupta, S. Li, S. Brockman, C. Funk, B. Clipp, K. Keutzer, S. Candido, M. Uyttendaele, and T. Darrell, “Scale-MAE: A Scale-Aware Masked Autoen- coder for Multiscale Geospatial Representation Learn- ing,” Sept. 2023

2023
[6]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weis- senborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkor- eit, and N. Houlsby, “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.” https://arxiv.org/abs/2010.11929v2, Oct. 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[7]

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation,

O. Press, N. A. Smith, and M. Lewis, “Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation,” Apr. 2022

2022
[8]

Representation Learning with Contrastive Predictive Coding,

A. van den Oord, Y. Li, and O. Vinyals, “Representation Learning with Contrastive Predictive Coding,” Jan. 2019

2019
[9]

Masked Autoencoders Are Scalable Vision Learners

K. He, X. Chen, S. Xie, Y. Li, P. Doll´ ar, and R. Gir- shick, “Masked Autoencoders Are Scalable Vision Learn- ers.” https://arxiv.org/abs/2111.06377v3, Nov. 2021

work page internal anchor Pith review arXiv 2021
[10]

GEO- Bench: Toward Foundation Models for Earth Monitor- ing,

A. Lacoste, N. Lehmann, P. Rodriguez, E. D. Sherwin, H. Kerner, B. L¨ utjens, J. A. Irvin, D. Dao, H. Alemoham- mad, A. Drouin, M. Gunturkun, G. Huang, D. Vazquez, D. Newman, Y. Bengio, S. Ermon, and X. X. Zhu, “GEO- Bench: Toward Foundation Models for Earth Monitor- ing,” Dec. 2023

2023
[11]

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction,

L. McInnes, J. Healy, and J. Melville, “UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction,” Sept. 2020. Space Imaging W orkshop. Atlanta, GA. 7-9 October 2024 4

2020