OmniCD: A Foundational Framework for Remote Sensing Image Change Detection Guided by Multimodal Semantics

Chenhao Sun

arxiv: 2605.30168 · v1 · pith:3JVDT3VYnew · submitted 2026-05-28 · 💻 cs.CV

OmniCD: A Foundational Framework for Remote Sensing Image Change Detection Guided by Multimodal Semantics

Chenhao Sun This is my paper

Pith reviewed 2026-06-29 08:00 UTC · model grok-4.3

classification 💻 cs.CV

keywords change detectionremote sensingmultimodal semanticsfoundational frameworkimage-text datasetstyle disentanglementzero-shot learning

0 comments

The pith

OmniCD unifies remote sensing change detection by guiding it with multimodal image and text prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces OmniCD as a framework that brings together image and text prompts to handle change detection in remote sensing images. Traditional methods have trouble generalizing to different scenarios, so this approach uses semantic maps, textual descriptions, and geospatial metadata to support a range of tasks including binary change detection and zero-shot semantic understanding. It adds a hierarchical scene retrieval module, a change detection module, and a style disentanglement mechanism for better robustness across domains. A new dataset called RSITCD with more than 300,000 annotated image-text pairs is created to train and test the system. Experiments indicate it reaches state-of-the-art results on various benchmarks.

Core claim

OmniCD is a foundational framework that unifies remote sensing change detection through multimodal semantic guidance, incorporating prompts such as text descriptions and semantic maps into an architecture with hierarchical scene retrieval and style disentanglement, backed by the RSITCD dataset of over 300K pairs, to achieve strong performance across tasks from binary to zero-shot.

What carries the argument

The OmniCD architecture, which integrates multimodal prompts with a hierarchical scene retrieval module and a style disentanglement mechanism to enhance cross-domain robustness.

If this is right

OmniCD supports a spectrum of change detection tasks from binary to zero-shot semantic change understanding.
The style disentanglement mechanism aims to improve performance when applied to new domains.
The framework sets a foundation for building general-purpose change detection systems in remote sensing.
Extensive experiments position it as state-of-the-art on existing benchmarks.
RSITCD provides a large-scale resource with multimodal annotations for future work.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could allow users to query changes using natural language descriptions in addition to images.
It may extend to real-time applications in urban planning by incorporating live geospatial metadata.
Similar multimodal integration might apply to other image analysis tasks beyond change detection.
Further testing on diverse disaster scenarios could validate its adaptability claims.

Load-bearing premise

That adding multimodal prompts, hierarchical retrieval, and style disentanglement will produce better generalization and robustness than prior methods in varied remote sensing conditions.

What would settle it

A controlled test on a held-out remote sensing dataset from a completely different geographic region or sensor type where OmniCD does not outperform current leading methods.

Figures

Figures reproduced from arXiv: 2605.30168 by Chenhao Sun.

**Figure 1.** Figure 1: Framework diagram 3.1 Feature Extraction Module This module consists of two components: an image encoder and a text encoder. For the image encoder, it can be any network that outputs a feature map of size C × H × W. To achieve scalability and fully leverage powerful pre-trained models, this work employs a Vision Transformer (ViT) [13] pre-trained with MAE [16], with minimal adaptive modifications to handle… view at source ↗

**Figure 2.** Figure 2: Guide Module Through these steps, the decoder effectively fuses prompt information with image embeddings and extracts masks for the regions of interest. In particular, during the cross-attention process, the image embeddings are treated as a set of 64 × 64 × 256 vectors. Each self-attention, cross-attention, and MLP operation is equipped with residual connections, layer normalization, and a dropout of 0.1 … view at source ↗

**Figure 3.** Figure 3: Detector Module First, the input bi-temporal images are encoded, and their features are subtracted channel-wise and converted to absolute values to obtain a change feature map. This step captures all the difference information between the two temporal images. Next, the change features are fed into the Pyramid Scene Parsing (PSP) segmentation module for further processing. The PSP module employs a pyramid … view at source ↗

**Figure 4.** Figure 4: Encoder Structure of the Style Detection Module [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Decoder Structure of the Style Detection Module [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Decoding Block of the Style Detection Module 3.5 Loss Function To jointly optimize the change detection task and the image reconstruction task, this study introduces a multi-task loss function in which several loss components are combined and weighted by hyperparameters to balance their contributions. These components work together to guide the network toward learning both tasks in a cooperative manner. Th… view at source ↗

**Figure 7.** Figure 7: Illustration of Reference Image-Guided Method fusion, thereby enhancing the model’s generalization, diversity, and robustness. However, existing change detection research lacks large-scale datasets suitable for training such models, and the available datasets vary significantly in organization, making them difficult to apply directly. To address this issue, we designed a unified data organization scheme s… view at source ↗

**Figure 8.** Figure 8: Sample Example from the RSITCD Dataset As shown in [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Statistical Summary of the RSITCD Dataset [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: Sample images from the YRBCD dataset Metrics. Remote sensing image change detection is essentially a pixel-level binary classification task. To comprehensively assess model performance, this study adopts five standard evaluation metrics: Precision, Recall, F1-score, Intersection over Union (IoU), and Overall Accuracy (ACC). Precision measures how reliably the model identifies predicted change pixels, whe… view at source ↗

**Figure 11.** Figure 11: Comparison Results on the LEVIR-CD Dataset [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: Comparison Results on the WHU-CD Dataset [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

**Figure 13.** Figure 13: Detection Results on the GVLM Dataset [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗

**Figure 14.** Figure 14: Reference-image prompting [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗

**Figure 15.** Figure 15: Detection Results on YRBCD with Reference Images 7 Conclusion This work focuses on advancing remote sensing image change detection by integrating state-of-the-art multimodal vision–language techniques. We propose a [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗

read the original abstract

Change detection (CD) in remote sensing is vital for applications such as urban monitoring and disaster assessment, yet traditional methods struggle with generalization across diverse scenarios. We present OmniCD, a foundational framework that unifies and enhances remote sensing CD through multimodal semantic guidance. OmniCD incorporates image and text prompts -- such as textual descriptions, semantic maps, and geospatial metadata -- into a unified architecture, supporting tasks from binary CD to zero-shot semantic change understanding. The framework integrates a hierarchical scene retrieval module and a change detection module, reinforced by a style disentanglement mechanism for improved cross-domain robustness. We further introduce RSITCD, a large-scale multimodal dataset with 300K+ annotated image-text pairs. Extensive experiments show that OmniCD achieves state-of-the-art performance across benchmarks, demonstrating strong adaptability and setting a solid foundation for general-purpose CD systems in remote sensing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OmniCD adds a multimodal prompt framework and a 300k-pair dataset to remote sensing change detection, but the abstract gives no numbers or ablations to support the SOTA and robustness claims.

read the letter

The punchline is that OmniCD proposes a multimodal semantic guided framework for remote sensing change detection along with a new large-scale dataset, but the abstract does not supply the experimental details needed to back up the state-of-the-art claims.

The paper introduces OmniCD as a unified architecture that incorporates image and text prompts, including textual descriptions, semantic maps, and geospatial metadata. It supports a range of tasks from basic binary change detection to zero-shot semantic change understanding. The framework uses a hierarchical scene retrieval module and a change detection module, with a style disentanglement mechanism aimed at better cross-domain robustness. They also present RSITCD, a dataset with more than 300,000 annotated image-text pairs.

This approach has some merit in trying to create a more general-purpose system for change detection in remote sensing, where generalization across different scenarios has been a known issue with traditional methods. Bringing in multimodal inputs could open up new ways to use additional context like text descriptions for better understanding of changes.

However, the soft spots are clear from the abstract. The claims of achieving state-of-the-art performance and strong adaptability rest on assertions without visible supporting data, such as specific benchmark scores, ablation studies on the new modules, or cross-domain train/test results. The stress-test concern is valid here because without those, it's impossible to know if the hierarchical retrieval and style disentanglement are actually responsible for any improvements or if other factors are at play. The soundness is limited because no methods details, equations, or error bars are provided in what we have.

For the citation pattern, since the full text isn't detailed here, it's hard to say, but the idea of multimodal guidance seems to build on existing prompt-based methods in vision.

This paper would be of interest to researchers in computer vision for remote sensing applications, particularly those working on change detection for practical uses like urban monitoring or disaster assessment. A reader who wants to explore multimodal extensions or needs a large dataset might get some value from it, assuming the full paper includes the promised experiments and makes the dataset available.

Overall, it deserves a serious referee to assess whether the full experiments support the claims and to check the dataset's quality and annotations.

I would recommend engaging with the work by sending it to peer review rather than a desk reject, so the details can be properly evaluated.

Referee Report

2 major / 0 minor

Summary. The paper presents OmniCD as a foundational framework for remote sensing image change detection that incorporates multimodal semantic guidance through image and text prompts. It features a hierarchical scene retrieval module, a change detection module, and a style disentanglement mechanism to enhance cross-domain robustness. The authors introduce the RSITCD dataset containing over 300,000 annotated image-text pairs and report that OmniCD achieves state-of-the-art performance across multiple benchmarks, supporting tasks ranging from binary change detection to zero-shot semantic change understanding.

Significance. Should the claims be validated with rigorous experiments and ablations, this work has the potential to establish a new paradigm for general-purpose change detection systems in remote sensing by leveraging multimodal inputs for better generalization. The release of a large-scale multimodal dataset could also facilitate future research in the field.

major comments (2)

Abstract: the central claim that the hierarchical scene retrieval module combined with style disentanglement and multimodal prompts delivers improved cross-domain robustness and SOTA performance is presented without ablation results, cross-domain train/test splits, or quantitative comparisons isolating these components' contributions. This prevents attribution of any gains to the proposed modules rather than dataset scale or standard training.
Abstract: no methods, implementation details for the style disentanglement or retrieval modules, performance metrics, error bars, or experimental setup are provided, rendering the soundness of the SOTA and adaptability claims unevaluable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their review and the opportunity to address these points. We respond to each major comment below.

read point-by-point responses

Referee: Abstract: the central claim that the hierarchical scene retrieval module combined with style disentanglement and multimodal prompts delivers improved cross-domain robustness and SOTA performance is presented without ablation results, cross-domain train/test splits, or quantitative comparisons isolating these components' contributions. This prevents attribution of any gains to the proposed modules rather than dataset scale or standard training.

Authors: We agree that the abstract, as a concise summary, does not itself contain the ablation results, cross-domain splits, or isolating comparisons. The full manuscript presents these analyses in the Experiments section to attribute gains to the proposed modules. We will revise the abstract to note that supporting quantitative evidence and ablations appear in the main text. revision: yes
Referee: Abstract: no methods, implementation details for the style disentanglement or retrieval modules, performance metrics, error bars, or experimental setup are provided, rendering the soundness of the SOTA and adaptability claims unevaluable.

Authors: We agree that the abstract omits these specifics, which is conventional for the format. The manuscript details the methods, implementation, metrics with error bars, and setup in the Methods and Experiments sections. We will revise the abstract to reference the experimental validation supporting the claims. revision: yes

Circularity Check

0 steps flagged

No circularity: framework claims rest on empirical results without self-referential derivations

full rationale

The paper introduces OmniCD as a multimodal framework for remote sensing change detection, describing components such as hierarchical scene retrieval, style disentanglement, and multimodal prompts, along with a new dataset RSITCD. No equations, parameter-fitting procedures, or derivation chains appear in the abstract or description. Performance claims are stated as outcomes of extensive experiments rather than predictions derived from fitted inputs or self-definitions. No self-citations are used to justify uniqueness theorems or ansatzes. The central assertions about cross-domain robustness are presented as design motivations supported by results, not as reductions equivalent to the inputs by construction. This satisfies the default expectation of no significant circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities can be extracted or evaluated from the provided text.

pith-pipeline@v0.9.1-grok · 5670 in / 1233 out tokens · 39791 ms · 2026-06-29T08:00:27.259147+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 9 canonical work pages · 1 internal anchor

[1]

IEEE Geoscience and Remote Sensing Letters14(5) (2017) 778–782

Kussul, N., Lavreniuk, M., Skakun, S., et al.: Deep learning classification of land cover and crop types using remote sensing data. IEEE Geoscience and Remote Sensing Letters14(5) (2017) 778–782

2017
[2]

Remote Sensing14(10) (2022) 2385

Li, Z., Wang, Y., Zhang, N., et al.: Deep learning based object detection techniques for remote sensing images: A survey. Remote Sensing14(10) (2022) 2385

2022
[3]

arXiv preprint arXiv:submit/4812508 [cs.CL] (2023)

OpenAI: Gpt-4 technical report. arXiv preprint arXiv:submit/4812508 [cs.CL] (2023)

work page arXiv 2023
[4]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation.https://arxiv.org/abs/2201.12086, 2022

Li, J., Li, D., Xiong, C., et al.: Blip: Bootstrapping language-image pre- training for unified vision-language understanding and generation. arXiv preprint arXiv:2201.12086v2 [cs.CV] (2022)

work page arXiv 2022
[5]

In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), IEEE (2023) 3992–4003

Alexander, K., Eric, M., Nikhila, R., et al.: Segment anything. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), IEEE (2023) 3992–4003

2023
[6]

International Journal of Remote Sensing29(16) (2008) 4823–4838

Deng, J., Wang, K., Deng, Y., et al.: Pca-based land-use change detection and analysis using multitemporal and multisensor satellite data. International Journal of Remote Sensing29(16) (2008) 4823–4838

2008
[7]

IEEE Geoscience and Remote Sensing Letters8(4) (2011) 799–803

Marpu, P., Gamba, P., Canty, M.: Improving change detection results of ir-mad by eliminating strong changes. IEEE Geoscience and Remote Sensing Letters8(4) (2011) 799–803

2011
[8]

Interna- tional Journal of Remote Sensing33(14) (2012) 4434–4457

Chen, G., Hay, G., Carvalho, L., et al.: Object-based change detection. Interna- tional Journal of Remote Sensing33(14) (2012) 4434–4457

2012
[9]

ISPRS Journal of Photogrammetry and Remote Sensing58(3-4) (2004) 225–238

Walter, V.: Object-based classification of remote sensing data for change detection. ISPRS Journal of Photogrammetry and Remote Sensing58(3-4) (2004) 225–238

2004
[10]

In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015, Springer (2015) 234–241

Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomed- ical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015, Springer (2015) 234–241

2015
[11]

Remote Sensing12(10) (2020) 1662

Chen, H., Shi, Z.: A spatial-temporal attention-based method and a new dataset for remote sensing image change detection. Remote Sensing12(10) (2020) 1662

2020
[12]

IEEE Geoscience and Remote Sensing Letters 19(2021) 1–5

Fang, S., Li, K., Shao, J., et al.: Snunet-cd: A densely connected siamese network for change detection of vhr images. IEEE Geoscience and Remote Sensing Letters 19(2021) 1–5

2021
[13]

In: The Thirteenth Interna- tional Conference on Learning Representations (ICLR)

Kolesnikov, V., Dosovitskiy, A., Weissenborn, D., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: The Thirteenth Interna- tional Conference on Learning Representations (ICLR). (2021)

2021
[14]

IEEE Transactions on Geoscience and Remote Sensing60(2022) 1–14

Chen,H.,Qi,Z.,Shi,Z.: Remotesensingimagechangedetectionwithtransformers. IEEE Transactions on Geoscience and Remote Sensing60(2022) 1–14

2022
[15]

IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing15(2022) 4297–4306

Liu, M., Chai, Z., Deng, H., et al.: A cnn-transformer network with multiscale context aggregation for fine-grained cropland change detection. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing15(2022) 4297–4306

2022
[16]

In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE (2022) 15979–15988

He, K., Chen, X., Xie, S., et al.: Masked autoencoders are scalable vision learners. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE (2022) 15979–15988

2022
[17]

In: 2021 IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), IEEE (2021) 9394–9403

Mañas, O., Lacoste, A., Giró-i Nieto, X., et al.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: 2021 IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), IEEE (2021) 9394–9403

2021
[18]

In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE (2023) 5161–5270 ECCV-16 submission ID *** 29

Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE (2023) 5161–5270 ECCV-16 submission ID *** 29

2023
[19]

IEEE Transactions on Pattern Analysis and Machine Intelligence46(8) (2024) 5227–5244

Hong, D., Zhang, B., Li, X., et al.: Spectralgpt: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence46(8) (2024) 5227–5244

2024
[20]

In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE (2024) 27662–27673

Guo, X., Lao, J., Dang, B., et al.: Skysense: A multi-modal remote sensing founda- tion model towards universal interpretation for earth observation imagery. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE (2024) 27662–27673

2024
[21]

arXiv preprint arXiv:2307.15266 [cs.CV] (2023)

Hu, Y., Yuan, J., Wen, C., et al.: Rsgpt: A remote sensing vision language model and benchmark. arXiv preprint arXiv:2307.15266 [cs.CV] (2023)

work page arXiv 2023
[22]

IEEE Transactions on Geoscience and Remote Sensing 62(2024) 1–16

Liu, F., Chen, D., Guan, Z., et al.: Remoteclip: A vision language foundation model for remote sensing. IEEE Transactions on Geoscience and Remote Sensing 62(2024) 1–16

2024
[23]

In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Kuckreja, K., Danish, M., Naseer, M., et al.: Geochat: Grounded large vision- language model for remote sensing. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). (2024) 27831–27840

2024
[24]

arXiv preprint arXiv:2312.06960 [cs.CV] (2023)

Mall, U., Phoo, C., Liu, M., et al.: Remote sensing vision-language founda- tion models without annotations via ground remote alignment. arXiv preprint arXiv:2312.06960 [cs.CV] (2023)

work page arXiv 2023
[25]

arXiv preprint arXiv:2309.16020 [cs.CV] (2023)

Cepeda, V., Nayak, G., Shah, M.: Geoclip: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 [cs.CV] (2023)

work page arXiv 2023
[26]

arXiv preprint arXiv:2311.17179 [cs.CV] (2023)

Klemmer, K., Rolf, E., Robinson, C., et al.: Satclip: Global, general-purpose loca- tion embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 [cs.CV] (2023)

work page arXiv 2023
[27]

arXiv preprint arXiv:2312.03606 [cs.CV] (2024)

Khanna, S., Liu, P., Zhou, L., et al.: Diffusionsat: A generative foundation model for satellite imagery. arXiv preprint arXiv:2312.03606 [cs.CV] (2024)

work page arXiv 2024
[28]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin, J., Chang, M., Lee, K., et al.: Bert: Pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[29]

IEEE Transactions on Pattern Analysis and Machine Intelligence46(12) (2024) 9677– 9696

Wang, X., Chen, H., Tang, S., et al.: Disentangled representation learning. IEEE Transactions on Pattern Analysis and Machine Intelligence46(12) (2024) 9677– 9696

2024
[30]

In: 2017 IEEE International Conference on Computer Vision (ICCV), IEEE (2017) 1510–1519

Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive in- stance normalization. In: 2017 IEEE International Conference on Computer Vision (ICCV), IEEE (2017) 1510–1519

2017
[31]

Zhang, R., Jiang, Z., Guo, Z., Yan, S., Pan, J., Ma, X., Dong, H., Gao, P., and Li, H

Zhang, R., Jiang, Z., Guo, Z., et al.: Personalize segment anything model with one shot. arXiv preprint arXiv:2305.03048 [cs.CV] (2023)

work page arXiv 2023
[32]

In: 2018 25th IEEE International Conference on Image Process- ing (ICIP), IEEE (2018) 4063–4067

Daudt, R., Le Saux, B., Boulch, A.: Fully convolutional siamese networks for change detection. In: 2018 25th IEEE International Conference on Image Process- ing (ICIP), IEEE (2018) 4063–4067

2018
[33]

In: IGARSS 2024 - 2024 IEEE International Geoscience and Remote Sensing Sympo- sium, IEEE (2024) 8577–8580

Tan, X., Chen, G., Wang, T., et al.: Segment change model (scm) for unsuper- vised change detection in vhr remote sensing images: A case study of buildings. In: IGARSS 2024 - 2024 IEEE International Geoscience and Remote Sensing Sympo- sium, IEEE (2024) 8577–8580

2024
[34]

In: 38th Conference on Neural Information Processing Systems (NeurIPS)

Zheng, Z., Zhong, Y., Zhang, L., et al.: Segment any change. In: 38th Conference on Neural Information Processing Systems (NeurIPS). (2024)

2024
[35]

Remote Sensing Letters5(8) (2014) 713–722

Huang, X., Zhu, T., Zhang, L., et al.: A novel building change index for automatic building change detection from high-resolution remote sensing imagery. Remote Sensing Letters5(8) (2014) 713–722

2014

[1] [1]

IEEE Geoscience and Remote Sensing Letters14(5) (2017) 778–782

Kussul, N., Lavreniuk, M., Skakun, S., et al.: Deep learning classification of land cover and crop types using remote sensing data. IEEE Geoscience and Remote Sensing Letters14(5) (2017) 778–782

2017

[2] [2]

Remote Sensing14(10) (2022) 2385

Li, Z., Wang, Y., Zhang, N., et al.: Deep learning based object detection techniques for remote sensing images: A survey. Remote Sensing14(10) (2022) 2385

2022

[3] [3]

arXiv preprint arXiv:submit/4812508 [cs.CL] (2023)

OpenAI: Gpt-4 technical report. arXiv preprint arXiv:submit/4812508 [cs.CL] (2023)

work page arXiv 2023

[4] [4]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation.https://arxiv.org/abs/2201.12086, 2022

Li, J., Li, D., Xiong, C., et al.: Blip: Bootstrapping language-image pre- training for unified vision-language understanding and generation. arXiv preprint arXiv:2201.12086v2 [cs.CV] (2022)

work page arXiv 2022

[5] [5]

In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), IEEE (2023) 3992–4003

Alexander, K., Eric, M., Nikhila, R., et al.: Segment anything. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), IEEE (2023) 3992–4003

2023

[6] [6]

International Journal of Remote Sensing29(16) (2008) 4823–4838

Deng, J., Wang, K., Deng, Y., et al.: Pca-based land-use change detection and analysis using multitemporal and multisensor satellite data. International Journal of Remote Sensing29(16) (2008) 4823–4838

2008

[7] [7]

IEEE Geoscience and Remote Sensing Letters8(4) (2011) 799–803

Marpu, P., Gamba, P., Canty, M.: Improving change detection results of ir-mad by eliminating strong changes. IEEE Geoscience and Remote Sensing Letters8(4) (2011) 799–803

2011

[8] [8]

Interna- tional Journal of Remote Sensing33(14) (2012) 4434–4457

Chen, G., Hay, G., Carvalho, L., et al.: Object-based change detection. Interna- tional Journal of Remote Sensing33(14) (2012) 4434–4457

2012

[9] [9]

ISPRS Journal of Photogrammetry and Remote Sensing58(3-4) (2004) 225–238

Walter, V.: Object-based classification of remote sensing data for change detection. ISPRS Journal of Photogrammetry and Remote Sensing58(3-4) (2004) 225–238

2004

[10] [10]

In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015, Springer (2015) 234–241

Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomed- ical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015, Springer (2015) 234–241

2015

[11] [11]

Remote Sensing12(10) (2020) 1662

Chen, H., Shi, Z.: A spatial-temporal attention-based method and a new dataset for remote sensing image change detection. Remote Sensing12(10) (2020) 1662

2020

[12] [12]

IEEE Geoscience and Remote Sensing Letters 19(2021) 1–5

Fang, S., Li, K., Shao, J., et al.: Snunet-cd: A densely connected siamese network for change detection of vhr images. IEEE Geoscience and Remote Sensing Letters 19(2021) 1–5

2021

[13] [13]

In: The Thirteenth Interna- tional Conference on Learning Representations (ICLR)

Kolesnikov, V., Dosovitskiy, A., Weissenborn, D., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: The Thirteenth Interna- tional Conference on Learning Representations (ICLR). (2021)

2021

[14] [14]

IEEE Transactions on Geoscience and Remote Sensing60(2022) 1–14

Chen,H.,Qi,Z.,Shi,Z.: Remotesensingimagechangedetectionwithtransformers. IEEE Transactions on Geoscience and Remote Sensing60(2022) 1–14

2022

[15] [15]

IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing15(2022) 4297–4306

Liu, M., Chai, Z., Deng, H., et al.: A cnn-transformer network with multiscale context aggregation for fine-grained cropland change detection. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing15(2022) 4297–4306

2022

[16] [16]

In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE (2022) 15979–15988

He, K., Chen, X., Xie, S., et al.: Masked autoencoders are scalable vision learners. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE (2022) 15979–15988

2022

[17] [17]

In: 2021 IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), IEEE (2021) 9394–9403

Mañas, O., Lacoste, A., Giró-i Nieto, X., et al.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: 2021 IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), IEEE (2021) 9394–9403

2021

[18] [18]

In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE (2023) 5161–5270 ECCV-16 submission ID *** 29

Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE (2023) 5161–5270 ECCV-16 submission ID *** 29

2023

[19] [19]

IEEE Transactions on Pattern Analysis and Machine Intelligence46(8) (2024) 5227–5244

Hong, D., Zhang, B., Li, X., et al.: Spectralgpt: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence46(8) (2024) 5227–5244

2024

[20] [20]

In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE (2024) 27662–27673

Guo, X., Lao, J., Dang, B., et al.: Skysense: A multi-modal remote sensing founda- tion model towards universal interpretation for earth observation imagery. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE (2024) 27662–27673

2024

[21] [21]

arXiv preprint arXiv:2307.15266 [cs.CV] (2023)

Hu, Y., Yuan, J., Wen, C., et al.: Rsgpt: A remote sensing vision language model and benchmark. arXiv preprint arXiv:2307.15266 [cs.CV] (2023)

work page arXiv 2023

[22] [22]

IEEE Transactions on Geoscience and Remote Sensing 62(2024) 1–16

Liu, F., Chen, D., Guan, Z., et al.: Remoteclip: A vision language foundation model for remote sensing. IEEE Transactions on Geoscience and Remote Sensing 62(2024) 1–16

2024

[23] [23]

In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Kuckreja, K., Danish, M., Naseer, M., et al.: Geochat: Grounded large vision- language model for remote sensing. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). (2024) 27831–27840

2024

[24] [24]

arXiv preprint arXiv:2312.06960 [cs.CV] (2023)

Mall, U., Phoo, C., Liu, M., et al.: Remote sensing vision-language founda- tion models without annotations via ground remote alignment. arXiv preprint arXiv:2312.06960 [cs.CV] (2023)

work page arXiv 2023

[25] [25]

arXiv preprint arXiv:2309.16020 [cs.CV] (2023)

Cepeda, V., Nayak, G., Shah, M.: Geoclip: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 [cs.CV] (2023)

work page arXiv 2023

[26] [26]

arXiv preprint arXiv:2311.17179 [cs.CV] (2023)

Klemmer, K., Rolf, E., Robinson, C., et al.: Satclip: Global, general-purpose loca- tion embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 [cs.CV] (2023)

work page arXiv 2023

[27] [27]

arXiv preprint arXiv:2312.03606 [cs.CV] (2024)

Khanna, S., Liu, P., Zhou, L., et al.: Diffusionsat: A generative foundation model for satellite imagery. arXiv preprint arXiv:2312.03606 [cs.CV] (2024)

work page arXiv 2024

[28] [28]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin, J., Chang, M., Lee, K., et al.: Bert: Pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[29] [29]

IEEE Transactions on Pattern Analysis and Machine Intelligence46(12) (2024) 9677– 9696

Wang, X., Chen, H., Tang, S., et al.: Disentangled representation learning. IEEE Transactions on Pattern Analysis and Machine Intelligence46(12) (2024) 9677– 9696

2024

[30] [30]

In: 2017 IEEE International Conference on Computer Vision (ICCV), IEEE (2017) 1510–1519

Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive in- stance normalization. In: 2017 IEEE International Conference on Computer Vision (ICCV), IEEE (2017) 1510–1519

2017

[31] [31]

Zhang, R., Jiang, Z., Guo, Z., Yan, S., Pan, J., Ma, X., Dong, H., Gao, P., and Li, H

Zhang, R., Jiang, Z., Guo, Z., et al.: Personalize segment anything model with one shot. arXiv preprint arXiv:2305.03048 [cs.CV] (2023)

work page arXiv 2023

[32] [32]

In: 2018 25th IEEE International Conference on Image Process- ing (ICIP), IEEE (2018) 4063–4067

Daudt, R., Le Saux, B., Boulch, A.: Fully convolutional siamese networks for change detection. In: 2018 25th IEEE International Conference on Image Process- ing (ICIP), IEEE (2018) 4063–4067

2018

[33] [33]

In: IGARSS 2024 - 2024 IEEE International Geoscience and Remote Sensing Sympo- sium, IEEE (2024) 8577–8580

Tan, X., Chen, G., Wang, T., et al.: Segment change model (scm) for unsuper- vised change detection in vhr remote sensing images: A case study of buildings. In: IGARSS 2024 - 2024 IEEE International Geoscience and Remote Sensing Sympo- sium, IEEE (2024) 8577–8580

2024

[34] [34]

In: 38th Conference on Neural Information Processing Systems (NeurIPS)

Zheng, Z., Zhong, Y., Zhang, L., et al.: Segment any change. In: 38th Conference on Neural Information Processing Systems (NeurIPS). (2024)

2024

[35] [35]

Remote Sensing Letters5(8) (2014) 713–722

Huang, X., Zhu, T., Zhang, L., et al.: A novel building change index for automatic building change detection from high-resolution remote sensing imagery. Remote Sensing Letters5(8) (2014) 713–722

2014