pith. sign in

arxiv: 2606.20523 · v2 · pith:PCPT5WBInew · submitted 2026-06-18 · 💻 cs.CV · cs.AI· cs.DB

SARLO-80: Worldwide Slant SAR Language Optic Dataset 80cm

Pith reviewed 2026-06-30 10:15 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.DB
keywords SAR-optical datasetvery-high-resolutionslant-rangemultimodal datasetremote sensingimage captioningcross-modal retrieval
0
0 comments X

The pith

SARLO-80 supplies 119566 worldwide VHR SAR patches aligned with optical imagery and text captions in slant-range geometry.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents a new dataset called SARLO-80 consisting of over 119000 triplets of very high resolution SAR images in slant range, corresponding optical images, and natural language descriptions. The motivation is that multimodal models have advanced with optical data but lack equivalent resources for SAR that preserve complex values and native geometry. The dataset is built by standardizing Umbra spotlight acquisitions to an 80cm grid, warping optical tiles using local coordinates for alignment, and generating short medium and long captions for each sample. It spans 257 locations in 72 countries and comes with fixed splits and code for benchmarks on retrieval and generation tasks. Readers would care because it opens the door to training models that work with SAR data in its original form rather than simplified versions.

Core claim

The paper establishes a public VHR SAR-optical-text dataset with 119566 triplets from 2500 Umbra scenes standardized to 80cm slant-range SAR patches paired with warped optical patches and three lengths of captions, enabling reproducible multimodal alignment benchmarks in native SAR geometry across diverse global sites.

What carries the argument

The SARLO-80 dataset of complex and amplitude 1024 by 1024 slant-range SAR patches with locally warped optical counterparts and natural-language descriptions generated in three variants.

If this is right

  • Cross-modal retrieval models can now be trained and evaluated on aligned VHR SAR and optical data in slant geometry.
  • Conditional generation from text to SAR or optical to SAR becomes feasible with public baselines.
  • Fixed train validation and test splits support consistent comparison of multimodal methods worldwide.
  • The released preprocessing pipeline allows replication and extension to additional SAR scenes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Alignment in native slant range could reveal unique information preservation properties of SAR compared to ground-range products.
  • Text descriptions might support zero-shot tasks in remote sensing applications where visual data alone is insufficient.
  • Global coverage across land types could aid in developing models robust to varied environments and infrastructures.

Load-bearing premise

The local coordinate correspondences used to warp optical tiles onto the SAR grid produce accurate pixel-level alignment without significant residual geometric error or occlusion mismatch.

What would settle it

Finding a large number of samples where the warped optical patch shows visible misalignment with the SAR features, such as shifted building edges or road positions, would indicate the alignment step failed.

Figures

Figures reproduced from arXiv: 2606.20523 by Elise Colin, Georgia Channing, Nathan Letheule, Nicolas Trouv\'e, Sol\`ene Debuys\`ere.

Figure 1
Figure 1. Figure 1: Overview of dataset coverage and SAR acquisition geometry. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our final dataset (Top-N + Other): distribution of images by country (left), and [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our final dataset (Top-N + Other): distribution of images by country (left), and [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Detailed overview of the dataset creation pipeline. Steps [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Example of SAR–optical–Caption from our Dataset [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Prompt labeling overview (left) and summary prompt statistics (right). [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 5
Figure 5. Figure 5: Prompt labeling overview (left) and summary prompt statistics (right). [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of generated SAR images across models and real data (1024 [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Examples of selected SAR–optical pairs from our dataset. (a) A satellite image of a coastal region [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 7
Figure 7. Figure 7: Variants of real SAR SLC images generated with the fine-tuned SDXL backbone. The two [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Additional examples of SAR–optical pairs from our dataset. (a) A satellite image of an expansive area [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 8
Figure 8. Figure 8: Examples of selected SAR–optical pairs from our dataset. (a) A satellite image of a coastal region [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
read the original abstract

Multimodal foundation models have advanced rapidly thanks to large optical benchmarks, but comparable resources for synthetic aperture radar (SAR) remain limited. Existing SAR--optical datasets largely rely on low-resolution, intensity-only Ground Range Detected~(GRD) products and do not preserve complex-valued SAR measurements or native acquisition geometry, which restricts physically grounded multimodal learning. In particular, large-scale public datasets combining very-high-resolution (VHR) SAR SLC, aligned optical imagery, and natural-language descriptions are still lacking. We present a VHR SAR--optical--text dataset built from open-access Umbra spotlight acquisitions distributed as Sensor Independent Complex Data (SICD). From around 2,500 worldwide scenes (VV/HH, 20cm--2m native resolution), we standardize all SAR data to an 80cm slant-range grid via band-limited FFT resampling and tile the imagery into 1024 by 1024 patches. For each SAR patch, we retrieve a high-resolution optical tile and warp it into the SAR grid using local coordinate correspondences for local pixel-level alignment. We further generate three caption variants (SHORT/MID/LONG) per sample to support vision--language training and evaluation. Our dataset contains 119,566 triplets (complex and amplitude slant-range SAR patch, aligned optical patch, natural-language description) covering 257 locations across 72 countries and a broad range of land types and infrastructures. We release fixed train/validation/test splits and the full preprocessing and baseline code to enable reproducible benchmarks for multimodal alignment on cross-modal retrieval and conditional generation in native SAR geometry. The dataset is publicly available on the Hugging Face Hub at https://huggingface.co/datasets/ONERA/SARLO-80.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript presents SARLO-80, a dataset of 119,566 triplets (complex and amplitude 80 cm slant-range SAR patches, aligned optical patches, and natural-language captions in SHORT/MID/LONG variants) derived from ~2,500 Umbra spotlight SICD scenes. The pipeline standardizes SAR data to an 80 cm slant-range grid via band-limited FFT resampling, tiles into 1024×1024 patches, warps optical tiles onto the SAR grid using local coordinate correspondences, generates captions, and releases fixed train/validation/test splits plus preprocessing code on the Hugging Face Hub. Coverage spans 257 locations across 72 countries.

Significance. If the claimed alignments are accurate, the release would fill a documented gap by providing the first large-scale public VHR SAR-optical-text resource that retains complex-valued SLC data and native slant-range geometry, supporting reproducible cross-modal retrieval and conditional generation benchmarks.

major comments (1)
  1. [Abstract and data-construction section] Abstract (paragraph on optical warping) and corresponding methods description: the claim of 'local pixel-level alignment' via local coordinate correspondences is presented without any quantitative validation (RMSE, residual error statistics, terrain-correction details, or assessment of layover/foreshortening mismatches). This is load-bearing for the central claim that the triplets support multimodal learning in native SAR geometry.
minor comments (1)
  1. [Abstract] Abstract: the native resolution range (20 cm–2 m) is stated but the resampling target is fixed at 80 cm; clarify whether all input scenes are up- or down-sampled and any resulting artifacts.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive comment on the alignment validation. We address the point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract and data-construction section] Abstract (paragraph on optical warping) and corresponding methods description: the claim of 'local pixel-level alignment' via local coordinate correspondences is presented without any quantitative validation (RMSE, residual error statistics, terrain-correction details, or assessment of layover/foreshortening mismatches). This is load-bearing for the central claim that the triplets support multimodal learning in native SAR geometry.

    Authors: We agree that the absence of quantitative validation metrics is a limitation. The alignment is performed by warping optical tiles onto the SAR slant-range grid using local coordinate correspondences extracted from the SICD metadata; this produces a deterministic pixel-to-pixel mapping by construction. However, the current manuscript provides no RMSE, residual statistics, explicit terrain-correction description, or analysis of layover/foreshortening effects. In the revised version we will expand the methods section with (i) a precise description of the coordinate-based warping pipeline, (ii) any available geolocation accuracy figures from the Umbra SICD products, and (iii) an explicit discussion of residual geometric mismatches that can arise in native slant-range geometry. If a modest validation subset can be generated from the released data without new acquisitions, we will also report basic alignment quality statistics. revision: yes

Circularity Check

0 steps flagged

Dataset release paper contains no derivation chain or self-referential predictions

full rationale

The paper is a data-release contribution describing collection, standardization, and alignment of SAR-optical-text triplets from open Umbra SICD scenes and optical tiles. No equations, fitted parameters, predictions, or uniqueness theorems appear. The central claim (existence of 119566 aligned triplets) is externally verifiable by inspecting the released Hugging Face dataset and does not reduce to any self-citation or input-by-construction step. Alignment is presented as a construction step, not a derived result.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

Dataset curation paper; no new physical quantities or mathematical derivations are introduced. The main added value is the collection, standardization, and public release of existing open-access imagery.

free parameters (1)
  • target slant-range resolution = 80 cm
    Chosen standardization value of 80 cm to unify scenes whose native resolutions range from 20 cm to 2 m.

pith-pipeline@v0.9.1-grok · 5864 in / 1340 out tokens · 46904 ms · 2026-06-30T10:15:04.050334+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 11 canonical work pages · 3 internal anchors

  1. [1]

    In: 2009 IEEE International Geoscience and Remote Sensing Symposium

    Angelliaume, S., Dubois-Fernandez, P., Dreuillet, P., Oriot, H., Coulombeix, C.: Sethi, the onera airborne sar sensor, and his low frequency capability. In: 2009 IEEE International Geoscience and Remote Sensing Symposium. pp. IV--177--IV--180 (Jul 2009). doi:10.1109/IGARSS.2009.5417343

  2. [2]

    Blumenstiel, B., Fraccaro, P., Marsocci, V., Jakubik, J., Maurogiovanni, S., Czerkawski, M., Sedona, R., Cavallaro, G., Brunschwiler, T., Bernabe-Moreno, J., Longépé, N.: Terramesh: A planetary mosaic of multimodal earth observation data (2025), https://arxiv.org/abs/2504.11172

  3. [3]

    In: 2024 International Radar Conference (RADAR)

    Debuysère, S., Trouvé, N., Letheule, N., Colin, E., Lévêque, O.: Synthesizing sar images with generative ai: Expanding to large-scale imagery. In: 2024 International Radar Conference (RADAR). pp. 1--6 (2024). doi:10.1109/RADAR58436.2024.10993695

  4. [5]

    Nedungadi, V., Kariryaa, A., Oehmcke, S., Belongie, S., Igel, C., Lang, N.: Mmearth: Exploring multi-modal pretext tasks for geospatial representation learning (2024), https://arxiv.org/abs/2405.02771

  5. [6]

    Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis (2023), https://arxiv.org/abs/2307.01952

  6. [7]

    ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences IV-1, 141--146 (2018)

    Schmitt, M., Hughes, L.H., Zhu, X.X.: The sen1-2 dataset for deep learning in sar-optical data fusion. ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences IV-1, 141--146 (2018). doi:10.5194/isprs-annals-IV-1-141-2018, https://isprs-annals.copernicus.org/articles/IV-1/141/2018/

  7. [8]

    Schmitt, M., Hughes, L.H., Qiu, C., Zhu, X.X.: Sen12ms -- a curated dataset of georeferenced multi-spectral sentinel-1/2 imagery for deep learning and data fusion (2019), https://arxiv.org/abs/1906.07789

  8. [10]

    : Open data program

    Umbra Lab Inc. : Open data program. https://umbra.space/open-data/ (2026), accessed: 2026-02-10

  9. [11]

    ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences V-3-2022, 635--640 (2022)

    Wenger, R., Puissant, A., Weber, J., Idoumghar, L., Forestier, G.: Multisenge: A multimodal and multitemporal benchmark dataset for land use/land cover remote sensing applications. ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences V-3-2022, 635--640 (2022). doi:10.5194/isprs-annals-V-3-2022-635-2022, https://isprs-annals....

  10. [12]

    Frontiers in Handwriting Recognition (ICFHR), 2014 14th International Conference on , pages=

    Real-time segmentation of on-line handwritten arabic script , author=. Frontiers in Handwriting Recognition (ICFHR), 2014 14th International Conference on , pages=. 2014 , organization=

  11. [13]

    Soft Computing and Pattern Recognition (SoCPaR), 2014 6th International Conference of , pages=

    Fast classification of handwritten on-line Arabic characters , author=. Soft Computing and Pattern Recognition (SoCPaR), 2014 6th International Conference of , pages=. 2014 , organization=

  12. [14]

    Estimate and Replace: A Novel Approach to Integrating Deep Neural Networks with Existing Applications

    Estimate and Replace: A Novel Approach to Integrating Deep Neural Networks with Existing Applications , author=. arXiv preprint arXiv:1804.09028 , year=

  13. [15]

    and Hughes, L

    Schmitt, M. and Hughes, L. H. and Zhu, X. X. , TITLE =. ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences , VOLUME =. 2018 , PAGES =

  14. [16]

    2019 , eprint=

    SEN12MS -- A Curated Dataset of Georeferenced Multi-Spectral Sentinel-1/2 Imagery for Deep Learning and Data Fusion , author=. 2019 , eprint=

  15. [17]

    BigEarthNet-MM: A Large-Scale, Multimodal, Multilabel Benchmark Archive for Remote Sensing Image Classification and Retrieval [Software and Data Sets] , volume=

    Sumbul, Gencer and de Wall, Arne and Kreuziger, Tristan and Marcelino, Filipe and Costa, Hugo and Benevides, Pedro and Caetano, Mario and Demir, Begum and Markl, Volker , year=. BigEarthNet-MM: A Large-Scale, Multimodal, Multilabel Benchmark Archive for Remote Sensing Image Classification and Retrieval [Software and Data Sets] , volume=. IEEE Geoscience a...

  16. [18]

    and Puissant, A

    Wenger, R. and Puissant, A. and Weber, J. and Idoumghar, L. and Forestier, G. , TITLE =. ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences , VOLUME =. 2022 , PAGES =

  17. [19]

    2024 , eprint=

    MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning , author=. 2024 , eprint=

  18. [20]

    2025 , eprint=

    TerraMesh: A Planetary Mosaic of Multimodal Earth Observation Data , author=. 2025 , eprint=

  19. [21]

    2018 , eprint=

    The SARptical Dataset for Joint Analysis of SAR and Optical Image in Dense Urban Area , author=. 2018 , eprint=

  20. [22]

    Open Data Program , howpublished =

  21. [23]

    2023 , eprint=

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis , author=. 2023 , eprint=

  22. [24]

    Quantitative comparison of fine-tuning techniques for pretrained latent diffusion models in the generation of unseen SAR images , journal =

    Solène Debuysère and Nicolas Trouvé and Nathan Letheule and Olivier Lévêque and Elise Colin , keywords =. Quantitative comparison of fine-tuning techniques for pretrained latent diffusion models in the generation of unseen SAR images , journal =. 2026 , issn =. doi:https://doi.org/10.1016/j.isprsjprs.2026.02.018 , url =

  23. [25]

    Synthesizing SAR Images with Generative AI: Expanding to Large-Scale Imagery , year=

    Debuysère, Solène and Trouvé, Nicolas and Letheule, Nathan and Colin, Elise and Lévêque, Olivier , booktitle=. Synthesizing SAR Images with Generative AI: Expanding to Large-Scale Imagery , year=

  24. [26]

    and Dubois-Fernandez, Pascale and Dreuillet, Ph

    Angelliaume, S. and Dubois-Fernandez, Pascale and Dreuillet, Ph. and Oriot, H. SETHI, the ONERA airborne SAR sensor, and his low frequency capability , booktitle =. 2009 , month = jul, pages =