UniGeoSeg: Towards Unified Open-World Segmentation for Geospatial Scenes

Di Wang; Haonan Guo; He Chen; Jing Zhang; Ning Zhang; Shuo Ni

arxiv: 2511.23332 · v3 · submitted 2025-11-28 · 💻 cs.CV

UniGeoSeg: Towards Unified Open-World Segmentation for Geospatial Scenes

Shuo Ni , Di Wang , He Chen , Haonan Guo , Ning Zhang , Jing Zhang This is my paper

Pith reviewed 2026-05-17 04:34 UTC · model grok-4.3

classification 💻 cs.CV

keywords remote sensinginstruction-driven segmentationopen-world segmentationgeospatial sceneslarge-scale datasetunified frameworkzero-shot generalizationmulti-task learning

0 comments

The pith

UniGeoSeg unifies referring, interactive, and reasoning segmentation for remote sensing images through a new million-scale dataset and multi-task training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a unified approach to instruction-driven segmentation in geospatial scenes by releasing GeoSeg-1M, a dataset of 590K images and 1.1M triplets built automatically from existing public sources, along with the UniGeoSeg model. The model adds task-aware text enhancement, latent knowledge memory, and progressive training to support simultaneous learning across fragmented task types. A sympathetic reader would care because prior methods remain limited by small, task-specific datasets and poor generalization to complex real-world aerial or satellite imagery. If the claims hold, this would allow a single system to handle diverse instructions without separate models or retraining.

Core claim

We introduce GeoSeg-1M containing 1.1M image-mask-instruction triplets across 117 categories and curate GeoSeg-Bench to test contextual understanding, then present UniGeoSeg as a baseline framework that achieves state-of-the-art results on the new benchmark and public datasets while showing strong zero-shot generalization through its task-aware components and training strategy.

What carries the argument

The UniGeoSeg framework that incorporates task-aware text enhancement, latent knowledge memory, and progressive training to enable multi-task learning over referring, interactive, and reasoning segmentation instructions.

If this is right

UniGeoSeg reaches state-of-the-art performance on GeoSeg-Bench and multiple public remote-sensing benchmarks.
The same model handles referring, interactive, and reasoning segmentation within one training run.
Strong zero-shot transfer occurs to unseen instructions and complex geospatial scenes.
Multi-task learning improves contextual understanding compared with task-specific models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The automatic data-synthesis approach could be reused to scale instruction data in other vision domains that currently lack large paired instruction sets.
Unified models of this type might reduce the need for domain experts to maintain separate segmentation tools for different geospatial applications.
If the memory and enhancement modules prove reusable, they could serve as drop-in components for open-world segmentation outside remote sensing.

Load-bearing premise

The automatic mask filtering and instruction generation pipeline creates high-quality, unbiased triplets that support effective multi-task learning and real-world generalization.

What would settle it

Run UniGeoSeg on a fresh collection of geospatial images paired with human-written instructions that were never seen during the automatic pipeline and measure whether zero-shot accuracy falls substantially below the reported levels on GeoSeg-Bench.

Figures

Figures reproduced from arXiv: 2511.23332 by Di Wang, Haonan Guo, He Chen, Jing Zhang, Ning Zhang, Shuo Ni.

**Figure 2.** Figure 2: The diagram of UniGeoSeg. The top indicates the whole pipeline, and the bottom describes each module. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative examples of the segmentations generated by UniGeoSeg and comparative methods on GeoSeg-Bench. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: The three mask marking strategies we tried in model-based mask filtering. (a) Boundary-only highlight. (b) Semi-transparent [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: The prompt of InternVL3 for mask filtering. The images marked with [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: The prompt of GPT-4o for attribute reasoning instruction generation. [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: The prompt of GPT-4o for the first step of context reasoning instruction generation. [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: The prompt of GPT-4o for the second step of context reasoning instruction generation. [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: The prompt of InternVL3 and QwenVL2 for evaluating the quality of reasoning image–mask–instruction triplets. [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: The prompt of GPT-4o for referring instruction generation. [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: The prompt of InternVL3 and QwenVL2 for evaluating the quality of referring image–mask–instruction triplets. [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: Class distribution in GeoSeg-1M. (a) Overall class distribution across the entire dataset. (b–d) Class distributions for each [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

**Figure 13.** Figure 13: Additional samples of GeoSeg-1M. 16 [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗

**Figure 14.** Figure 14: Additional samples of GeoSeg-Bench. 17 [PITH_FULL_IMAGE:figures/full_fig_p029_14.png] view at source ↗

**Figure 15.** Figure 15: Additional samples of model prediction by UniGeoSeg. [PITH_FULL_IMAGE:figures/full_fig_p030_15.png] view at source ↗

read the original abstract

Instruction-driven segmentation in remote sensing generates masks from guidance, offering great potential for accessible and generalizable applications. However, existing methods suffer from fragmented task formulations and limited instruction data, hindering effective understanding and generalization. To address these issues, we introduce GeoSeg-1M, the first million-scale dataset for remote sensing instruction-driven segmentation, constructed via an automatic mask filtering and instruction generation pipeline that synthesizes referring, interactive, and reasoning segmentation instructions from multiple public datasets. GeoSeg-1M contains 590K images, 117 categories, and 1.1M image-mask-instruction triplets. Building upon this foundation, we further curate GeoSeg-Bench, a challenging benchmark designed to evaluate contextual understanding and reasoning capabilities across diverse instruction-driven tasks and complex geospatial scenes. Furthermore, we present UniGeoSeg, a unified framework that serves as a strong baseline, incorporating task-aware text enhancement, latent knowledge memory, and a progressive training strategy to facilitate multi-task learning. Extensive experiments demonstrate the state-of-the-art performance of UniGeoSeg across GeoSeg-Bench and diverse public benchmarks, while exhibiting strong zero-shot generalization. Datasets and source code were released at https://github.com/MiliLab/UniGeoSeg.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper ships a million-scale dataset for instruction-driven remote sensing segmentation plus a unified baseline, but the automatic data pipeline lacks the quality checks needed to fully trust the SOTA and zero-shot claims.

read the letter

The main takeaway is a new million-scale resource for instruction-driven segmentation on geospatial images, built by pulling from existing public datasets and generating referring, interactive, and reasoning instructions automatically. They also release GeoSeg-Bench and a model called UniGeoSeg that adds task-aware text enhancement, a latent knowledge memory module, and progressive training to handle the tasks together. The work reports strong results across their benchmark and other public sets, with some zero-shot generalization shown in the abstract. Releasing the data and code at the GitHub link is the practical part that stands out here, since earlier efforts in this space stayed smaller and more split across task types. This could help people working on flexible analysis for earth observation, like spotting changes in land use or assessing damage after events. The soft spot sits in the data synthesis step. The pipeline filters masks and creates instructions without any reported human agreement scores, error breakdowns, or ablations on how the generation choices affect downstream performance. If mismatches between text and actual mask content slipped through, or if source dataset noise carried over, the performance edge could partly reflect data artifacts rather than the model itself. The abstract claims extensive experiments, but the lack of those checks makes it harder to pin down what is driving the numbers. This is aimed at computer vision and remote sensing groups that need large instruction-following data for segmentation. A reader who wants to test unified models on real geospatial scenes or build on the released triplets would get direct value. It should go to peer review because the dataset scale and the unification approach are concrete steps worth referee scrutiny, even with the current gaps in validation details.

Referee Report

2 major / 2 minor

Summary. The paper presents GeoSeg-1M, a 1.1M-triplet dataset for instruction-driven segmentation in remote sensing synthesized via an automatic mask-filtering and instruction-generation pipeline from existing public sources; it also introduces the GeoSeg-Bench evaluation suite and the UniGeoSeg model that combines task-aware text enhancement, latent knowledge memory, and progressive multi-task training. The central claims are that UniGeoSeg achieves state-of-the-art results on GeoSeg-Bench and diverse public benchmarks while demonstrating strong zero-shot generalization.

Significance. If the data-synthesis quality and experimental claims are substantiated, the work would supply the first large-scale resource for unified referring, interactive, and reasoning segmentation in geospatial imagery and a practical baseline architecture, thereby enabling more systematic study of multi-task and open-world capabilities in remote-sensing vision.

major comments (2)

[Section 3] Section 3 (GeoSeg-1M Construction): the automatic mask filtering and instruction generation pipeline is load-bearing for all downstream claims, yet the manuscript reports neither human agreement rates on generated triplets, quantitative error analysis of semantic mismatches between text and masks, nor ablations on pipeline variants. Without these, performance gains on GeoSeg-Bench cannot be confidently attributed to UniGeoSeg rather than residual label noise or synthesis artifacts.
[Section 5] Section 5 (Experiments): the SOTA and zero-shot results are presented without the full set of ablation studies on the individual components (task-aware text enhancement, latent knowledge memory, progressive training) or detailed verification of the zero-shot evaluation protocol, making it difficult to isolate the contribution of each design choice to the reported gains.

minor comments (2)

[Figure 2] Figure 2 and the accompanying text use inconsistent terminology for the three instruction types (referring vs. interactive vs. reasoning); a single canonical naming should be adopted throughout.
[Related Work] The manuscript does not cite prior large-scale remote-sensing segmentation datasets (e.g., LoveDA, iSAID) when discussing the construction of GeoSeg-1M; adding these references would clarify the novelty of the synthesis approach.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. We address each major comment in detail below and have revised the paper accordingly to enhance the validation of our dataset construction and experimental analysis.

read point-by-point responses

Referee: [Section 3] Section 3 (GeoSeg-1M Construction): the automatic mask filtering and instruction generation pipeline is load-bearing for all downstream claims, yet the manuscript reports neither human agreement rates on generated triplets, quantitative error analysis of semantic mismatches between text and masks, nor ablations on pipeline variants. Without these, performance gains on GeoSeg-Bench cannot be confidently attributed to UniGeoSeg rather than residual label noise or synthesis artifacts.

Authors: We fully agree that rigorous validation of the data synthesis pipeline is essential to substantiate our claims. Although the original manuscript focused on describing the scalable automatic pipeline, we recognize the need for human validation and error analysis. In the revised version, we have incorporated a human study involving expert annotators evaluating a sample of generated triplets for instruction-mask alignment, reporting high agreement rates. We also add a quantitative breakdown of potential semantic mismatches and ablations comparing different pipeline configurations (e.g., with and without specific filtering steps). These additions are placed in Section 3 and the supplementary material, allowing readers to better assess the dataset quality and attribute improvements to the proposed UniGeoSeg model. revision: yes
Referee: [Section 5] Section 5 (Experiments): the SOTA and zero-shot results are presented without the full set of ablation studies on the individual components (task-aware text enhancement, latent knowledge memory, progressive training) or detailed verification of the zero-shot evaluation protocol, making it difficult to isolate the contribution of each design choice to the reported gains.

Authors: We thank the referee for this suggestion to strengthen the experimental section. The original manuscript included some component analysis, but we agree that a more comprehensive set of ablations is warranted. In the revision, we have added detailed ablation studies for each proposed component—task-aware text enhancement, latent knowledge memory, and progressive multi-task training—showing their individual and combined impacts on performance. We have also expanded the description of the zero-shot evaluation protocol, including specifics on category and scene selection to confirm no overlap with training data. These updates are integrated into Section 5, with additional results in the appendix. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset construction and benchmark evaluation are self-contained

full rationale

The paper describes an automatic pipeline to synthesize GeoSeg-1M triplets from existing public datasets, curates GeoSeg-Bench from the same sources, and reports experimental SOTA results for UniGeoSeg on both the new benchmark and external public datasets. No equations, fitted parameters, or first-principles derivations are present that reduce by construction to the inputs. Claims rest on empirical performance rather than self-definitional renaming, fitted-input predictions, or load-bearing self-citations. The central results therefore remain independent of the circularity patterns enumerated in the guidelines.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims depend on the assumption that automatically synthesized instructions from existing datasets are representative and high-quality enough for training a generalizable model; no new physical entities or mathematical derivations are introduced.

axioms (1)

domain assumption Large-scale synthetic instruction data generated from public datasets can train models that generalize to real geospatial scenes
Invoked in the description of the automatic mask filtering and instruction generation pipeline.

pith-pipeline@v0.9.0 · 5522 in / 1248 out tokens · 35633 ms · 2026-05-17T04:34:03.227751+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AgroVG: A Large-Scale Multi-Source Benchmark for Agricultural Visual Grounding
cs.CV 2026-05 accept novelty 7.0

AgroVG is a new multi-source benchmark for agricultural visual grounding formulated as generalized set prediction, with protocols for box and mask grounding across single-target, multi-target, and target-absent querie...
UHR-Micro: Diagnosing and Mitigating the Resolution Illusion in Earth Observation VLMs
cs.CV 2026-05 unverdicted novelty 7.0

VLMs show a resolution illusion on UHR Earth observation imagery where higher resolution does not improve micro-target perception; UHR-Micro benchmark and MAP-Agent address this via evidence-centered active inspection.
Think in Latent Thoughts: A New Paradigm for Gloss-Free Sign Language Translation
cs.CV 2026-04 unverdicted novelty 6.0

A new SLT framework uses latent thoughts as a middle reasoning layer and plan-then-ground decoding to improve coherence and faithfulness in gloss-free sign language translation.

Reference graph

Works this paper leans on

129 extracted references · 129 canonical work pages · cited by 3 Pith papers · 8 internal anchors

[1]

Generalizable disaster damage assessment via change detection with vision foun- dation model

Kyeongjin Ahn, Sungwon Han, Sungwon Park, Jihee Kim, Sangyoon Park, and Meeyoung Cha. Generalizable disaster damage assessment via change detection with vision foun- dation model. InProceedings of the AAAI Conference on Artificial Intelligence, pages 27784–27792, 2025. 1

work page 2025
[2]

Skyscapes fine-grained se- mantic understanding of aerial scenes

Seyed Majid Azimi, Corentin Henry, Lars Sommer, Arne Schumann, and Eleonora Vig. Skyscapes fine-grained se- mantic understanding of aerial scenes. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 7393–7403, 2019. 3

work page 2019
[3]

Multi-task learning from fixed-wing uav images for 2d/3d city modeling.arXiv preprint arXiv:2109.00918,

Mohammad R Bayanlou and Mehdi Khoshboresh- Masouleh. Multi-task learning from fixed-wing uav images for 2d/3d city modeling.arXiv preprint arXiv:2109.00918,

work page arXiv
[4]

Curriculum learning

Yoshua Bengio, J ´erˆome Louradour, Ronan Collobert, and Ja- son Weston. Curriculum learning. InProceedings of the 26th annual international conference on machine learning, pages 41–48, 2009. 6

work page 2009
[5]

Agmtr: Agent mining transformer for few-shot segmentation in remote sensing.International Journal of Computer Vision, 133(4): 1780–1807, 2025

Hanbo Bi, Yingchao Feng, Yongqiang Mao, Jianning Pei, Wenhui Diao, Hongqi Wang, and Xian Sun. Agmtr: Agent mining transformer for few-shot segmentation in remote sensing.International Journal of Computer Vision, 133(4): 1780–1807, 2025. 1

work page 2025
[6]

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al. Deepseek llm: Scaling open- source language models with longtermism.arXiv preprint arXiv:2401.02954, 2024. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

InternLM2 Technical Report

Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. Internlm2 technical report.arXiv preprint arXiv:2403.17297, 2024. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Semi- supervised semantic segmentation in earth observation: The minifrance suite, dataset analysis and multi-task network study.Machine Learning, 111(9):3125–3160, 2022

Javiera Castillo-Navarro, Bertrand Le Saux, Alexandre Boulch, Nicolas Audebert, and S ´ebastien Lef `evre. Semi- supervised semantic segmentation in earth observation: The minifrance suite, dataset analysis and multi-task network study.Machine Learning, 111(9):3125–3160, 2022. 3, 1, 8

work page 2022
[9]

Rsrefseg: Referring remote sensing im- age segmentation with foundation models.arXiv preprint arXiv:2501.06809, 2025

Keyan Chen, Jiafan Zhang, Chenyang Liu, Zhengxia Zou, and Zhenwei Shi. Rsrefseg: Referring remote sensing im- age segmentation with foundation models.arXiv preprint arXiv:2501.06809, 2025. 1

work page arXiv 2025
[10]

Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024. 10

work page 2024
[11]

Masked-attention mask transformer for universal image segmentation

Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexan- der Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1290–1299, 2022. 6

work page 2022
[12]

Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna

Wei-Lin Chiang, Zhuohan Li, Ziqing Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6, 2023. 10

work page 2023
[13]

Functional map of the world

Gordon Christie, Neil Fendley, James Wilson, and Ryan Mukherjee. Functional map of the world. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6172–6180, 2018. 3

work page 2018
[14]

Deepglobe 2018: A challenge to parse the earth through satellite images

Ilke Demir, Krzysztof Koperski, David Lindenbaum, Guan Pang, Jing Huang, Saikat Basu, Forest Hughes, Devis Tuia, and Ramesh Raskar. Deepglobe 2018: A challenge to parse the earth through satellite images. InProceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 172–181, 2018. 3, 1, 8

work page 2018
[15]

Cross-Modal Bidi- rectional Interaction Model for Referring Remote Sensing Image Segmentation, 2025

Zhe Dong, Yuzhe Sun, Tianzhu Liu, Wangmeng Zuo, and Yanfeng Gu. Cross-modal bidirectional interaction model for referring remote sensing image segmentation.arXiv preprint arXiv:2410.08613, 2024. 3

work page arXiv 2024
[16]

Visdrone-det2019: The vision meets drone ob- ject detection in image challenge results

Dawei Du, Pengfei Zhu, Longyin Wen, Xiao Bian, Haibin Lin, Qinghua Hu, Tao Peng, Jiayu Zheng, Xinyao Wang, Yue Zhang, et al. Visdrone-det2019: The vision meets drone ob- ject detection in image challenge results. InProceedings of the IEEE/CVF international conference on computer vision workshops, pages 0–0, 2019. 3

work page 2019
[17]

Enrich distill and fuse: Generalized few-shot semantic seg- mentation in remote sensing leveraging foundation model’s assistance

Tianyi Gao, Wei Ao, Xing-ao Wang, Yuanhao Zhao, Ping Ma, Mengjie Xie, Hang Fu, Jinchang Ren, and Zhi Gao. Enrich distill and fuse: Generalized few-shot semantic seg- mentation in remote sensing leveraging foundation model’s assistance. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2771– 2780, 2024. 1

work page 2024
[18]

Flair: a country-scale land cover semantic segmen- tation dataset from multi-source optical imagery.Advances in Neural Information Processing Systems, 36:16456–16482,

Anatol Garioud, Nicolas Gonthier, Loic Landrieu, Apolline De Wit, Marion Valette, Marc Poup ´ee, S´ebastien Giordano, et al. Flair: a country-scale land cover semantic segmen- tation dataset from multi-source optical imagery.Advances in Neural Information Processing Systems, 36:16456–16482,

work page
[19]

Seg- mentation from natural language expressions

Ronghang Hu, Marcus Rohrbach, and Trevor Darrell. Seg- mentation from natural language expressions. InEuropean conference on computer vision, pages 108–124. Springer,

work page
[20]

Rsgpt: A remote sensing vision language model and benchmark.ISPRS Journal of Photogrammetry and Remote Sensing, 224:272–286, 2025

Yuan Hu, Jianlong Yuan, Congcong Wen, Xiaonan Lu, Yu Liu, and Xiang Li. Rsgpt: A remote sensing vision language model and benchmark.ISPRS Journal of Photogrammetry and Remote Sensing, 224:272–286, 2025. 2

work page 2025
[21]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Segment anything in high quality

Lei Ke, Mingqiao Ye, Martin Danelljan, Yifan Liu, Yu-Wing Tai, Chi-Keung Tang, and Fisher Yu. Segment anything in high quality. InNeurIPS, 2023. 8

work page 2023
[23]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- 9 head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 1, 2

work page 2023
[24]

Geochat: Grounded large vision-language model for remote sensing

Kartik Kuckreja, Muhammad Sohail Danish, Muzammal Naseer, Abhijit Das, Salman Khan, and Fahad Shahbaz Khan. Geochat: Grounded large vision-language model for remote sensing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27831– 27840, 2024. 2, 8

work page 2024
[25]

Lisa: Reasoning segmentation via large language model

Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9579–9589, 2024. 2, 4, 6, 7, 8, 10

work page 2024
[26]

Object detection in optical remote sensing images: A survey and a new benchmark.ISPRS journal of photogram- metry and remote sensing, 159:296–307, 2020

Ke Li, Gang Wan, Gong Cheng, Liqiu Meng, and Junwei Han. Object detection in optical remote sensing images: A survey and a new benchmark.ISPRS journal of photogram- metry and remote sensing, 159:296–307, 2020. 3, 2, 8

work page 2020
[27]

Segearth-ov: Towards training-free open-vocabulary segmentation for remote sens- ing images

Kaiyu Li, Ruixun Liu, Xiangyong Cao, Xueru Bai, Feng Zhou, Deyu Meng, and Zhi Wang. Segearth-ov: Towards training-free open-vocabulary segmentation for remote sens- ing images. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 10545–10556, 2025. 2

work page 2025
[28]

Rsvg-zeroov: Exploring a training-free framework for zero- shot open-vocabulary visual grounding in remote sensing im- ages.arXiv preprint arXiv:2509.18711, 2025

Ke Li, Di Wang, Ting Wang, Fuyu Dong, Yiming Zhang, Luyao Zhang, Xiangyu Wang, Shaofeng Li, and Quan Wang. Rsvg-zeroov: Exploring a training-free framework for zero- shot open-vocabulary visual grounding in remote sensing im- ages.arXiv preprint arXiv:2509.18711, 2025. 8

work page arXiv 2025
[29]

SegEarth-R1: Geospatial Pixel Reasoning via Large Language Model, 2025

Kaiyu Li, Zepeng Xin, Li Pang, Chao Pang, Yupeng Deng, Jing Yao, Guisong Xia, Deyu Meng, Zhi Wang, and Xiangy- ong Cao. Segearth-r1: Geospatial pixel reasoning via large language model.arXiv preprint arXiv:2504.09644, 2025. 1, 2, 3, 4, 5, 6, 7, 8, 10, 11

work page arXiv 2025
[30]

Referring image seg- mentation via recurrent refinement networks

Ruiyu Li, Kaican Li, Yi-Chun Kuo, Michelle Shu, Xiaojuan Qi, Xiaoyong Shen, and Jiaya Jia. Referring image seg- mentation via recurrent refinement networks. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5745–5753, 2018. 2

work page 2018
[31]

Textbooks Are All You Need II: phi-1.5 technical report

Yuanzhi Li, S ´ebastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee. Textbooks are all you need ii: phi-1.5 technical report.arXiv preprint arXiv:2309.05463, 2023. 6, 10

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Re- moteclip: A vision language foundation model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, 62:1–16, 2024

Fan Liu, Delong Chen, Zhangqingyun Guan, Xiaocong Zhou, Jiale Zhu, Qiaolin Ye, Liyong Fu, and Jun Zhou. Re- moteclip: A vision language foundation model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, 62:1–16, 2024. 10

work page 2024
[33]

Rotated multi-scale interaction network for referring remote sensing image seg- mentation

Sihan Liu, Yiwei Ma, Xiaoqing Zhang, Haowei Wang, Ji- ayi Ji, Xiaoshuai Sun, and Rongrong Ji. Rotated multi-scale interaction network for referring remote sensing image seg- mentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26658– 26668, 2024. 1, 2, 3, 4, 8, 10, 11

work page 2024
[34]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021. 6, 9, 10

work page 2021
[35]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 6

work page internal anchor Pith review Pith/arXiv arXiv 2017
[36]

Rrsecs: Referring remote sensing expres- sion comprehension and segmentation.IEEE Geoscience and Remote Sensing Magazine, 2025

Xiaoqiang Lu, Long Sun, Lingling Li, Licheng Jiao, Yuting Yang, Zhongjian Huang, Jinming Chai, Xu Liu, Fang Liu, Wenping Ma, et al. Rrsecs: Referring remote sensing expres- sion comprehension and segmentation.IEEE Geoscience and Remote Sensing Magazine, 2025. 1, 2

work page 2025
[37]

Skysensegpt: A fine-grained in- struction tuning dataset and model for remote sensing vision- language understanding.arXiv preprint arXiv:2406.10100,

Junwei Luo, Zhen Pang, Yongjun Zhang, Tingzhu Wang, Linlin Wang, Bo Dang, Jiangwei Lao, Jian Wang, Jingdong Chen, Yihua Tan, et al. Skysensegpt: A fine-grained in- struction tuning dataset and model for remote sensing vision- language understanding.arXiv preprint arXiv:2406.10100,

work page arXiv
[38]

Lhrs-bot: Empowering remote sensing with vgi-enhanced large multimodal language model

Dilxat Muhtar, Zhenshi Li, Feng Gu, Xueliang Zhang, and Pengfeng Xiao. Lhrs-bot: Empowering remote sensing with vgi-enhanced large multimodal language model. In European Conference on Computer Vision, pages 440–457. Springer, 2024. 2

work page 2024
[39]

Safari: Adaptive s equence tr a ns f ormer for we a kly su- pervised r eferring expression segmentat i on

Sayan Nag, Koustava Goswami, and Srikrishna Karanam. Safari: Adaptive s equence tr a ns f ormer for we a kly su- pervised r eferring expression segmentat i on. InEuropean Conference on Computer Vision, pages 485–503. Springer,

work page
[40]

Geopix: A multimodal large language model for pixel-level image understanding in remote sensing.IEEE Geoscience and Remote Sensing Magazine, 2025

Ruizhe Ou, Yuan Hu, Fan Zhang, Jiaxin Chen, and Yu Liu. Geopix: A multimodal large language model for pixel-level image understanding in remote sensing.IEEE Geoscience and Remote Sensing Magazine, 2025. 2, 4, 6, 10

work page 2025
[41]

Locate anything on earth: Advancing open-vocabulary ob- ject detection for remote sensing community

Jiancheng Pan, Yanxing Liu, Yuqian Fu, Muyuan Ma, Jiahao Li, Danda Pani Paudel, Luc Van Gool, and Xiaomeng Huang. Locate anything on earth: Advancing open-vocabulary ob- ject detection for remote sensing community. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 6281–6289, 2025. 2

work page 2025
[42]

Lisat: Language- instructed segmentation assistant for satellite imagery.arXiv preprint arXiv:2505.02829, 2025

Jerome Quenum, Wen-Han Hsieh, Tsung-Han Wu, Ritwik Gupta, Trevor Darrell, and David M Chan. Lisat: Language- instructed segmentation assistant for satellite imagery.arXiv preprint arXiv:2505.02829, 2025. 4, 6, 10

work page arXiv 2025
[43]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2, 10

work page 2021
[44]

Sam 2: Seg- ment anything in images and videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Seg- ment anything in images and videos. InThe Thirteenth In- ternational Conference on Learning Representations, 2024. 8

work page 2024
[45]

Pixellm: Pixel reasoning with large multimodal model

Zhongwei Ren, Zhicheng Huang, Yunchao Wei, Yao Zhao, Dongmei Fu, Jiashi Feng, and Xiaojie Jin. Pixellm: Pixel reasoning with large multimodal model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26374–26383, 2024. 4, 6, 7, 10

work page 2024
[46]

Large scale high-resolution land cover mapping with multi- resolution data

Caleb Robinson, Le Hou, Kolya Malkin, Rachel Soobit- sky, Jacob Czawlytko, Bistra Dilkina, and Nebojsa Jojic. Large scale high-resolution land cover mapping with multi- resolution data. InProceedings of the IEEE/CVF Conference 10 on Computer Vision and Pattern Recognition, pages 12726– 12735, 2019. 3, 1, 8

work page 2019
[47]

Advances in remote sensing and ai for vegetation monitoring in power line corridors: A re- view and future directions: A review and future directions

Antonis Savva, Christos Kyrkou, Panayiotis Kolios, and Theocharis Theocharides. Advances in remote sensing and ai for vegetation monitoring in power line corridors: A re- view and future directions: A review and future directions. IEEE Geoscience and Remote Sensing Magazine, 2025. 1

work page 2025
[48]

Geopixel: Pixel grounding large multimodal model in remote sens- ing

Akashah Shabbir, Mohammed Zumri, Mohammed Ben- namoun, Fahad Shahbaz Khan, and Salman Khan. Geopixel: Pixel grounding large multimodal model in remote sens- ing. InForty-second International Conference on Machine Learning, 2025. 2, 4, 6, 7, 10

work page 2025
[49]

Groundvlp: Harnessing zero-shot visual ground- ing from vision-language pre-training and open-vocabulary object detection

Haozhan Shen, Tiancheng Zhao, Mingwei Zhu, and Jian- wei Yin. Groundvlp: Harnessing zero-shot visual ground- ing from vision-language pre-training and open-vocabulary object detection. InProceedings of the AAAI Conference on Artificial Intelligence, pages 4766–4775, 2024. 8

work page 2024
[50]

Key-word-aware network for referring expression image seg- mentation

Hengcan Shi, Hongliang Li, Fanman Meng, and Qingbo Wu. Key-word-aware network for referring expression image seg- mentation. InProceedings of the European Conference on Computer Vision (ECCV), pages 38–54, 2018. 2

work page 2018
[51]

Globe230k: A benchmark dense-pixel annotation dataset for global land cover mapping.Journal of Remote Sensing, 3:0078, 2023

Qian Shi, Da He, Zhengyu Liu, Xiaoping Liu, and Jingqian Xue. Globe230k: A benchmark dense-pixel annotation dataset for global land cover mapping.Journal of Remote Sensing, 3:0078, 2023. 3, 1, 8

work page 2023
[52]

Earthmind: Towards multi-granular and multi- sensor earth observation with large multimodal models,

Yan Shu, Bin Ren, Zhitong Xiong, Danda Pani Paudel, Luc Van Gool, Begum Demir, Nicu Sebe, and Paolo Rota. Earth- mind: Towards multi-granular and multi-sensor earth ob- servation with large multimodal models.arXiv preprint arXiv:2506.01667, 2025. 4, 6, 10

work page arXiv 2025
[53]

Earthdial: Turning multi-sensory earth observations to interactive dialogues

Sagar Soni, Akshay Dudhane, Hiyam Debary, Mustansar Fiaz, Muhammad Akhtar Munir, Muhammad Sohail Danish, Paolo Fraccaro, Campbell D Watson, Levente J Klein, Fa- had Shahbaz Khan, et al. Earthdial: Turning multi-sensory earth observations to interactive dialogues. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14303–14313, 2025. 2

work page 2025
[54]

Fair1m: A benchmark dataset for fine- grained object recognition in high-resolution remote sens- ing imagery.ISPRS Journal of Photogrammetry and Remote Sensing, 184:116–130, 2022

Xian Sun, Peijin Wang, Zhiyuan Yan, Feng Xu, Ruiping Wang, Wenhui Diao, Jin Chen, Jihao Li, Yingchao Feng, Tao Xu, et al. Fair1m: A benchmark dataset for fine- grained object recognition in high-resolution remote sens- ing imagery.ISPRS Journal of Photogrammetry and Remote Sensing, 184:116–130, 2022. 3, 2, 8

work page 2022
[55]

Visual grounding in remote sensing images

Yuxi Sun, Shanshan Feng, Xutao Li, Yunming Ye, Jian Kang, and Xu Huang. Visual grounding in remote sensing images. InProceedings of the 30th ACM International con- ference on Multimedia, pages 404–412, 2022. 3

work page 2022
[56]

Land- cover classification with high-resolution remote sensing im- ages using transferable deep models.Remote Sensing of En- vironment, 2020

Xin-Yi Tong, Gui-Song Xia, Qikai Lu, Huangfeng Shen, Shengyang Li, Shucheng You, and Liangpei Zhang. Land- cover classification with high-resolution remote sensing im- ages using transferable deep models.Remote Sensing of En- vironment, 2020. 3, 1, 8

work page 2020
[57]

Enabling country-scale land cover mapping with meter-resolution satellite imagery.ISPRS Journal of Photogrammetry and Re- mote Sensing, 196:178–196, 2023

Xin-Yi Tong, Gui-Song Xia, and Xiao Xiang Zhu. Enabling country-scale land cover mapping with meter-resolution satellite imagery.ISPRS Journal of Photogrammetry and Re- mote Sensing, 196:178–196, 2023. 3, 1, 8

work page 2023
[58]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 10

work page internal anchor Pith review Pith/arXiv arXiv 2023
[59]

Ov-vg: A benchmark for open-vocabulary visual grounding.Neurocomputing, 591:127738, 2024

Chunlei Wang, Wenquan Feng, Xiangtai Li, Guangliang Cheng, Shuchang Lyu, Binghao Liu, Lijiang Chen, and Qi Zhao. Ov-vg: A benchmark for open-vocabulary visual grounding.Neurocomputing, 591:127738, 2024. 8

work page 2024
[60]

Samrs: Scaling-up re- mote sensing segmentation dataset with segment anything model.Advances in Neural Information Processing Systems, 36:8815–8827, 2023

Di Wang, Jing Zhang, Bo Du, Minqiang Xu, Lin Liu, Dacheng Tao, and Liangpei Zhang. Samrs: Scaling-up re- mote sensing segmentation dataset with segment anything model.Advances in Neural Information Processing Systems, 36:8815–8827, 2023. 1, 6, 2, 8

work page 2023
[61]

Fengxiang Wang, Hongzhen Wang, Zonghao Guo, Di Wang, Yulin Wang, Mingshuo Chen, Qiang Ma, Long Lan, Wenjing Yang, Jing Zhang, et al. Xlrs-bench: Could your multimodal llms understand extremely large ultra-high-resolution remote sensing imagery? InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14325–14336,

work page
[62]

Loveda: A remote sensing land-cover dataset for domain adaptive semantic segmentation.arXiv preprint arXiv:2110.08733, 2021

Junjue Wang, Zhuo Zheng, Ailong Ma, Xiaoyan Lu, and Yanfei Zhong. Loveda: A remote sensing land-cover dataset for domain adaptive semantic segmentation.arXiv preprint arXiv:2110.08733, 2021. 3, 1, 8

work page arXiv 2021
[63]

Diffusion model is secretly a training-free open vocabulary semantic segmenter

Jinglong Wang, Xiawei Li, Jing Zhang, Qingyuan Xu, Qin Zhou, Qian Yu, Lu Sheng, and Dong Xu. Diffusion model is secretly a training-free open vocabulary semantic segmenter. IEEE Transactions on Image Processing, 2025. 8

work page 2025
[64]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 3, 8

work page internal anchor Pith review Pith/arXiv arXiv 2024
[65]

Holitracer: Holistic vectorization of geographic objects from large-size remote sensing imagery

Yu Wang, Bo Dang, Wanchun Li, Wei Chen, and Yansheng Li. Holitracer: Holistic vectorization of geographic objects from large-size remote sensing imagery. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 8482–8491, 2025. 1

work page 2025
[66]

Toward robust referring image seg- mentation.IEEE Transactions on Image Processing, 33: 1782–1794, 2024

Jianzong Wu, Xiangtai Li, Xia Li, Henghui Ding, Yunhai Tong, and Dacheng Tao. Toward robust referring image seg- mentation.IEEE Transactions on Image Processing, 33: 1782–1794, 2024. 2

work page 2024
[67]

Aid: A benchmark data set for performance evaluation of aerial scene classification.IEEE Transactions on Geoscience and Remote Sensing, 55(7):3965–3981, 2017

Gui-Song Xia, Jingwen Hu, Fan Hu, Baoguang Shi, Xiang Bai, Yanfei Zhong, Liangpei Zhang, and Xiaoqiang Lu. Aid: A benchmark data set for performance evaluation of aerial scene classification.IEEE Transactions on Geoscience and Remote Sensing, 55(7):3965–3981, 2017. 3

work page 2017
[68]

Dota: A large-scale dataset for object detection in aerial images

Gui-Song Xia, Xiang Bai, Jian Ding, Zhen Zhu, Serge Be- longie, Jiebo Luo, Mihai Datcu, Marcello Pelillo, and Liang- pei Zhang. Dota: A large-scale dataset for object detection in aerial images. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3974–3983,

work page
[69]

Exploring phrase- level grounding with text-to-image diffusion model

Danni Yang, Ruohan Dong, Jiayi Ji, Yiwei Ma, Haowei Wang, Xiaoshuai Sun, and Rongrong Ji. Exploring phrase- level grounding with text-to-image diffusion model. InEu- ropean Conference on Computer Vision, pages 161–180. Springer, 2024. 8 11

work page 2024
[70]

Lavt: Language-aware vision transformer for referring image segmentation

Zhao Yang, Jiaqi Wang, Yansong Tang, Kai Chen, Heng- shuang Zhao, and Philip HS Torr. Lavt: Language-aware vision transformer for referring image segmentation. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18155–18165, 2022. 2

work page 2022
[71]

Remotesam: Towards segment anything for earth observa- tion

Liang Yao, Fan Liu, Delong Chen, Chuanyi Zhang, Yijun Wang, Ziyun Chen, Wei Xu, Shimin Di, and Yuhui Zheng. Remotesam: Towards segment anything for earth observa- tion. InProceedings of the 33rd ACM International Confer- ence on Multimedia, pages 3027–3036, 2025. 3, 4, 6, 7, 8, 2, 11

work page 2025
[72]

Remotereasoner: Towards unifying geospatial reasoning workflow.arXiv preprint arXiv:2507.19280, 2025

Liang Yao, Fan Liu, Hongbo Lu, Chuanyi Zhang, Rui Min, Shengxiang Xu, Shimin Di, and Pai Peng. Remotereasoner: Towards unifying geospatial reasoning workflow.arXiv preprint arXiv:2507.19280, 2025. 2, 7

work page arXiv 2025
[73]

Modeling context in referring expres- sions

Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in referring expres- sions. InEuropean conference on computer vision, pages 69–85. Springer, 2016. 6

work page 2016
[74]

Rrsis: Referring remote sensing image segmentation.arXiv preprint arXiv:2306.08625, 2023

Zhenghang Yuan, Lichao Mou, Yuansheng Hua, and Xiao Xiang Zhu. Rrsis: Referring remote sensing image segmentation.arXiv preprint arXiv:2306.08625, 2023. 2, 3

work page arXiv 2023
[75]

Rsvg: Exploring data and models for visual grounding on remote sensing data

Yang Zhan, Zhitong Xiong, and Yuan Yuan. Rsvg: Exploring data and models for visual grounding on remote sensing data. IEEE Transactions on Geoscience and Remote Sensing, 61: 1–13, 2023. 2, 6, 8

work page 2023
[76]

Next-chat: An lmm for chat, detection and segmen- tation

Ao Zhang, Yuan Yao, Wei Ji, Zhiyuan Liu, and Tat-Seng Chua. Next-chat: An lmm for chat, detection and segmen- tation. InInternational Conference on Machine Learning, pages 60116–60133. PMLR, 2024. 7

work page 2024
[77]

Earthmarker: A visual prompting multi- modal large language model for remote sensing.IEEE Trans- actions on Geoscience and Remote Sensing, 2024

Wei Zhang, Miaoxin Cai, Tong Zhang, Yin Zhuang, Jun Li, and Xuerui Mao. Earthmarker: A visual prompting multi- modal large language model for remote sensing.IEEE Trans- actions on Geoscience and Remote Sensing, 2024. 1

work page 2024
[78]

Wei Zhang, Miaoxin Cai, Tong Zhang, Yin Zhuang, and Xuerui Mao. Earthgpt: A universal multimodal large lan- guage model for multisensor image comprehension in re- mote sensing domain.IEEE Transactions on Geoscience and Remote Sensing, 62:1–20, 2024. 2

work page 2024
[79]

Hierarchical and robust convolutional neural network for very high-resolution remote sensing object detection

Yuanlin Zhang, Yuan Yuan, Yachuang Feng, and Xiaoqiang Lu. Hierarchical and robust convolutional neural network for very high-resolution remote sensing object detection. IEEE Transactions on Geoscience and Remote Sensing, 57 (8):5535–5548, 2019. 3

work page 2019
[80]

Psalm: Pixelwise segmentation with large multi-modal model

Zheng Zhang, Yeyao Ma, Enming Zhang, and Xiang Bai. Psalm: Pixelwise segmentation with large multi-modal model. InEuropean Conference on Computer Vision, pages 74–91. Springer, 2024. 2, 4, 6, 7, 8, 10

work page 2024

Showing first 80 references.

[1] [1]

Generalizable disaster damage assessment via change detection with vision foun- dation model

Kyeongjin Ahn, Sungwon Han, Sungwon Park, Jihee Kim, Sangyoon Park, and Meeyoung Cha. Generalizable disaster damage assessment via change detection with vision foun- dation model. InProceedings of the AAAI Conference on Artificial Intelligence, pages 27784–27792, 2025. 1

work page 2025

[2] [2]

Skyscapes fine-grained se- mantic understanding of aerial scenes

Seyed Majid Azimi, Corentin Henry, Lars Sommer, Arne Schumann, and Eleonora Vig. Skyscapes fine-grained se- mantic understanding of aerial scenes. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 7393–7403, 2019. 3

work page 2019

[3] [3]

Multi-task learning from fixed-wing uav images for 2d/3d city modeling.arXiv preprint arXiv:2109.00918,

Mohammad R Bayanlou and Mehdi Khoshboresh- Masouleh. Multi-task learning from fixed-wing uav images for 2d/3d city modeling.arXiv preprint arXiv:2109.00918,

work page arXiv

[4] [4]

Curriculum learning

Yoshua Bengio, J ´erˆome Louradour, Ronan Collobert, and Ja- son Weston. Curriculum learning. InProceedings of the 26th annual international conference on machine learning, pages 41–48, 2009. 6

work page 2009

[5] [5]

Agmtr: Agent mining transformer for few-shot segmentation in remote sensing.International Journal of Computer Vision, 133(4): 1780–1807, 2025

Hanbo Bi, Yingchao Feng, Yongqiang Mao, Jianning Pei, Wenhui Diao, Hongqi Wang, and Xian Sun. Agmtr: Agent mining transformer for few-shot segmentation in remote sensing.International Journal of Computer Vision, 133(4): 1780–1807, 2025. 1

work page 2025

[6] [6]

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al. Deepseek llm: Scaling open- source language models with longtermism.arXiv preprint arXiv:2401.02954, 2024. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

InternLM2 Technical Report

Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. Internlm2 technical report.arXiv preprint arXiv:2403.17297, 2024. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

Semi- supervised semantic segmentation in earth observation: The minifrance suite, dataset analysis and multi-task network study.Machine Learning, 111(9):3125–3160, 2022

Javiera Castillo-Navarro, Bertrand Le Saux, Alexandre Boulch, Nicolas Audebert, and S ´ebastien Lef `evre. Semi- supervised semantic segmentation in earth observation: The minifrance suite, dataset analysis and multi-task network study.Machine Learning, 111(9):3125–3160, 2022. 3, 1, 8

work page 2022

[9] [9]

Rsrefseg: Referring remote sensing im- age segmentation with foundation models.arXiv preprint arXiv:2501.06809, 2025

Keyan Chen, Jiafan Zhang, Chenyang Liu, Zhengxia Zou, and Zhenwei Shi. Rsrefseg: Referring remote sensing im- age segmentation with foundation models.arXiv preprint arXiv:2501.06809, 2025. 1

work page arXiv 2025

[10] [10]

Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024. 10

work page 2024

[11] [11]

Masked-attention mask transformer for universal image segmentation

Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexan- der Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1290–1299, 2022. 6

work page 2022

[12] [12]

Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna

Wei-Lin Chiang, Zhuohan Li, Ziqing Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6, 2023. 10

work page 2023

[13] [13]

Functional map of the world

Gordon Christie, Neil Fendley, James Wilson, and Ryan Mukherjee. Functional map of the world. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6172–6180, 2018. 3

work page 2018

[14] [14]

Deepglobe 2018: A challenge to parse the earth through satellite images

Ilke Demir, Krzysztof Koperski, David Lindenbaum, Guan Pang, Jing Huang, Saikat Basu, Forest Hughes, Devis Tuia, and Ramesh Raskar. Deepglobe 2018: A challenge to parse the earth through satellite images. InProceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 172–181, 2018. 3, 1, 8

work page 2018

[15] [15]

Cross-Modal Bidi- rectional Interaction Model for Referring Remote Sensing Image Segmentation, 2025

Zhe Dong, Yuzhe Sun, Tianzhu Liu, Wangmeng Zuo, and Yanfeng Gu. Cross-modal bidirectional interaction model for referring remote sensing image segmentation.arXiv preprint arXiv:2410.08613, 2024. 3

work page arXiv 2024

[16] [16]

Visdrone-det2019: The vision meets drone ob- ject detection in image challenge results

Dawei Du, Pengfei Zhu, Longyin Wen, Xiao Bian, Haibin Lin, Qinghua Hu, Tao Peng, Jiayu Zheng, Xinyao Wang, Yue Zhang, et al. Visdrone-det2019: The vision meets drone ob- ject detection in image challenge results. InProceedings of the IEEE/CVF international conference on computer vision workshops, pages 0–0, 2019. 3

work page 2019

[17] [17]

Enrich distill and fuse: Generalized few-shot semantic seg- mentation in remote sensing leveraging foundation model’s assistance

Tianyi Gao, Wei Ao, Xing-ao Wang, Yuanhao Zhao, Ping Ma, Mengjie Xie, Hang Fu, Jinchang Ren, and Zhi Gao. Enrich distill and fuse: Generalized few-shot semantic seg- mentation in remote sensing leveraging foundation model’s assistance. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2771– 2780, 2024. 1

work page 2024

[18] [18]

Flair: a country-scale land cover semantic segmen- tation dataset from multi-source optical imagery.Advances in Neural Information Processing Systems, 36:16456–16482,

Anatol Garioud, Nicolas Gonthier, Loic Landrieu, Apolline De Wit, Marion Valette, Marc Poup ´ee, S´ebastien Giordano, et al. Flair: a country-scale land cover semantic segmen- tation dataset from multi-source optical imagery.Advances in Neural Information Processing Systems, 36:16456–16482,

work page

[19] [19]

Seg- mentation from natural language expressions

Ronghang Hu, Marcus Rohrbach, and Trevor Darrell. Seg- mentation from natural language expressions. InEuropean conference on computer vision, pages 108–124. Springer,

work page

[20] [20]

Rsgpt: A remote sensing vision language model and benchmark.ISPRS Journal of Photogrammetry and Remote Sensing, 224:272–286, 2025

Yuan Hu, Jianlong Yuan, Congcong Wen, Xiaonan Lu, Yu Liu, and Xiang Li. Rsgpt: A remote sensing vision language model and benchmark.ISPRS Journal of Photogrammetry and Remote Sensing, 224:272–286, 2025. 2

work page 2025

[21] [21]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [22]

Segment anything in high quality

Lei Ke, Mingqiao Ye, Martin Danelljan, Yifan Liu, Yu-Wing Tai, Chi-Keung Tang, and Fisher Yu. Segment anything in high quality. InNeurIPS, 2023. 8

work page 2023

[23] [23]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- 9 head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 1, 2

work page 2023

[24] [24]

Geochat: Grounded large vision-language model for remote sensing

Kartik Kuckreja, Muhammad Sohail Danish, Muzammal Naseer, Abhijit Das, Salman Khan, and Fahad Shahbaz Khan. Geochat: Grounded large vision-language model for remote sensing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27831– 27840, 2024. 2, 8

work page 2024

[25] [25]

Lisa: Reasoning segmentation via large language model

Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9579–9589, 2024. 2, 4, 6, 7, 8, 10

work page 2024

[26] [26]

Object detection in optical remote sensing images: A survey and a new benchmark.ISPRS journal of photogram- metry and remote sensing, 159:296–307, 2020

Ke Li, Gang Wan, Gong Cheng, Liqiu Meng, and Junwei Han. Object detection in optical remote sensing images: A survey and a new benchmark.ISPRS journal of photogram- metry and remote sensing, 159:296–307, 2020. 3, 2, 8

work page 2020

[27] [27]

Segearth-ov: Towards training-free open-vocabulary segmentation for remote sens- ing images

Kaiyu Li, Ruixun Liu, Xiangyong Cao, Xueru Bai, Feng Zhou, Deyu Meng, and Zhi Wang. Segearth-ov: Towards training-free open-vocabulary segmentation for remote sens- ing images. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 10545–10556, 2025. 2

work page 2025

[28] [28]

Rsvg-zeroov: Exploring a training-free framework for zero- shot open-vocabulary visual grounding in remote sensing im- ages.arXiv preprint arXiv:2509.18711, 2025

Ke Li, Di Wang, Ting Wang, Fuyu Dong, Yiming Zhang, Luyao Zhang, Xiangyu Wang, Shaofeng Li, and Quan Wang. Rsvg-zeroov: Exploring a training-free framework for zero- shot open-vocabulary visual grounding in remote sensing im- ages.arXiv preprint arXiv:2509.18711, 2025. 8

work page arXiv 2025

[29] [29]

SegEarth-R1: Geospatial Pixel Reasoning via Large Language Model, 2025

Kaiyu Li, Zepeng Xin, Li Pang, Chao Pang, Yupeng Deng, Jing Yao, Guisong Xia, Deyu Meng, Zhi Wang, and Xiangy- ong Cao. Segearth-r1: Geospatial pixel reasoning via large language model.arXiv preprint arXiv:2504.09644, 2025. 1, 2, 3, 4, 5, 6, 7, 8, 10, 11

work page arXiv 2025

[30] [30]

Referring image seg- mentation via recurrent refinement networks

Ruiyu Li, Kaican Li, Yi-Chun Kuo, Michelle Shu, Xiaojuan Qi, Xiaoyong Shen, and Jiaya Jia. Referring image seg- mentation via recurrent refinement networks. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5745–5753, 2018. 2

work page 2018

[31] [31]

Textbooks Are All You Need II: phi-1.5 technical report

Yuanzhi Li, S ´ebastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee. Textbooks are all you need ii: phi-1.5 technical report.arXiv preprint arXiv:2309.05463, 2023. 6, 10

work page internal anchor Pith review Pith/arXiv arXiv 2023

[32] [32]

Re- moteclip: A vision language foundation model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, 62:1–16, 2024

Fan Liu, Delong Chen, Zhangqingyun Guan, Xiaocong Zhou, Jiale Zhu, Qiaolin Ye, Liyong Fu, and Jun Zhou. Re- moteclip: A vision language foundation model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, 62:1–16, 2024. 10

work page 2024

[33] [33]

Rotated multi-scale interaction network for referring remote sensing image seg- mentation

Sihan Liu, Yiwei Ma, Xiaoqing Zhang, Haowei Wang, Ji- ayi Ji, Xiaoshuai Sun, and Rongrong Ji. Rotated multi-scale interaction network for referring remote sensing image seg- mentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26658– 26668, 2024. 1, 2, 3, 4, 8, 10, 11

work page 2024

[34] [34]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021. 6, 9, 10

work page 2021

[35] [35]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 6

work page internal anchor Pith review Pith/arXiv arXiv 2017

[36] [36]

Rrsecs: Referring remote sensing expres- sion comprehension and segmentation.IEEE Geoscience and Remote Sensing Magazine, 2025

Xiaoqiang Lu, Long Sun, Lingling Li, Licheng Jiao, Yuting Yang, Zhongjian Huang, Jinming Chai, Xu Liu, Fang Liu, Wenping Ma, et al. Rrsecs: Referring remote sensing expres- sion comprehension and segmentation.IEEE Geoscience and Remote Sensing Magazine, 2025. 1, 2

work page 2025

[37] [37]

Skysensegpt: A fine-grained in- struction tuning dataset and model for remote sensing vision- language understanding.arXiv preprint arXiv:2406.10100,

Junwei Luo, Zhen Pang, Yongjun Zhang, Tingzhu Wang, Linlin Wang, Bo Dang, Jiangwei Lao, Jian Wang, Jingdong Chen, Yihua Tan, et al. Skysensegpt: A fine-grained in- struction tuning dataset and model for remote sensing vision- language understanding.arXiv preprint arXiv:2406.10100,

work page arXiv

[38] [38]

Lhrs-bot: Empowering remote sensing with vgi-enhanced large multimodal language model

Dilxat Muhtar, Zhenshi Li, Feng Gu, Xueliang Zhang, and Pengfeng Xiao. Lhrs-bot: Empowering remote sensing with vgi-enhanced large multimodal language model. In European Conference on Computer Vision, pages 440–457. Springer, 2024. 2

work page 2024

[39] [39]

Safari: Adaptive s equence tr a ns f ormer for we a kly su- pervised r eferring expression segmentat i on

Sayan Nag, Koustava Goswami, and Srikrishna Karanam. Safari: Adaptive s equence tr a ns f ormer for we a kly su- pervised r eferring expression segmentat i on. InEuropean Conference on Computer Vision, pages 485–503. Springer,

work page

[40] [40]

Geopix: A multimodal large language model for pixel-level image understanding in remote sensing.IEEE Geoscience and Remote Sensing Magazine, 2025

Ruizhe Ou, Yuan Hu, Fan Zhang, Jiaxin Chen, and Yu Liu. Geopix: A multimodal large language model for pixel-level image understanding in remote sensing.IEEE Geoscience and Remote Sensing Magazine, 2025. 2, 4, 6, 10

work page 2025

[41] [41]

Locate anything on earth: Advancing open-vocabulary ob- ject detection for remote sensing community

Jiancheng Pan, Yanxing Liu, Yuqian Fu, Muyuan Ma, Jiahao Li, Danda Pani Paudel, Luc Van Gool, and Xiaomeng Huang. Locate anything on earth: Advancing open-vocabulary ob- ject detection for remote sensing community. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 6281–6289, 2025. 2

work page 2025

[42] [42]

Lisat: Language- instructed segmentation assistant for satellite imagery.arXiv preprint arXiv:2505.02829, 2025

Jerome Quenum, Wen-Han Hsieh, Tsung-Han Wu, Ritwik Gupta, Trevor Darrell, and David M Chan. Lisat: Language- instructed segmentation assistant for satellite imagery.arXiv preprint arXiv:2505.02829, 2025. 4, 6, 10

work page arXiv 2025

[43] [43]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2, 10

work page 2021

[44] [44]

Sam 2: Seg- ment anything in images and videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Seg- ment anything in images and videos. InThe Thirteenth In- ternational Conference on Learning Representations, 2024. 8

work page 2024

[45] [45]

Pixellm: Pixel reasoning with large multimodal model

Zhongwei Ren, Zhicheng Huang, Yunchao Wei, Yao Zhao, Dongmei Fu, Jiashi Feng, and Xiaojie Jin. Pixellm: Pixel reasoning with large multimodal model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26374–26383, 2024. 4, 6, 7, 10

work page 2024

[46] [46]

Large scale high-resolution land cover mapping with multi- resolution data

Caleb Robinson, Le Hou, Kolya Malkin, Rachel Soobit- sky, Jacob Czawlytko, Bistra Dilkina, and Nebojsa Jojic. Large scale high-resolution land cover mapping with multi- resolution data. InProceedings of the IEEE/CVF Conference 10 on Computer Vision and Pattern Recognition, pages 12726– 12735, 2019. 3, 1, 8

work page 2019

[47] [47]

Advances in remote sensing and ai for vegetation monitoring in power line corridors: A re- view and future directions: A review and future directions

Antonis Savva, Christos Kyrkou, Panayiotis Kolios, and Theocharis Theocharides. Advances in remote sensing and ai for vegetation monitoring in power line corridors: A re- view and future directions: A review and future directions. IEEE Geoscience and Remote Sensing Magazine, 2025. 1

work page 2025

[48] [48]

Geopixel: Pixel grounding large multimodal model in remote sens- ing

Akashah Shabbir, Mohammed Zumri, Mohammed Ben- namoun, Fahad Shahbaz Khan, and Salman Khan. Geopixel: Pixel grounding large multimodal model in remote sens- ing. InForty-second International Conference on Machine Learning, 2025. 2, 4, 6, 7, 10

work page 2025

[49] [49]

Groundvlp: Harnessing zero-shot visual ground- ing from vision-language pre-training and open-vocabulary object detection

Haozhan Shen, Tiancheng Zhao, Mingwei Zhu, and Jian- wei Yin. Groundvlp: Harnessing zero-shot visual ground- ing from vision-language pre-training and open-vocabulary object detection. InProceedings of the AAAI Conference on Artificial Intelligence, pages 4766–4775, 2024. 8

work page 2024

[50] [50]

Key-word-aware network for referring expression image seg- mentation

Hengcan Shi, Hongliang Li, Fanman Meng, and Qingbo Wu. Key-word-aware network for referring expression image seg- mentation. InProceedings of the European Conference on Computer Vision (ECCV), pages 38–54, 2018. 2

work page 2018

[51] [51]

Globe230k: A benchmark dense-pixel annotation dataset for global land cover mapping.Journal of Remote Sensing, 3:0078, 2023

Qian Shi, Da He, Zhengyu Liu, Xiaoping Liu, and Jingqian Xue. Globe230k: A benchmark dense-pixel annotation dataset for global land cover mapping.Journal of Remote Sensing, 3:0078, 2023. 3, 1, 8

work page 2023

[52] [52]

Earthmind: Towards multi-granular and multi- sensor earth observation with large multimodal models,

Yan Shu, Bin Ren, Zhitong Xiong, Danda Pani Paudel, Luc Van Gool, Begum Demir, Nicu Sebe, and Paolo Rota. Earth- mind: Towards multi-granular and multi-sensor earth ob- servation with large multimodal models.arXiv preprint arXiv:2506.01667, 2025. 4, 6, 10

work page arXiv 2025

[53] [53]

Earthdial: Turning multi-sensory earth observations to interactive dialogues

Sagar Soni, Akshay Dudhane, Hiyam Debary, Mustansar Fiaz, Muhammad Akhtar Munir, Muhammad Sohail Danish, Paolo Fraccaro, Campbell D Watson, Levente J Klein, Fa- had Shahbaz Khan, et al. Earthdial: Turning multi-sensory earth observations to interactive dialogues. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14303–14313, 2025. 2

work page 2025

[54] [54]

Fair1m: A benchmark dataset for fine- grained object recognition in high-resolution remote sens- ing imagery.ISPRS Journal of Photogrammetry and Remote Sensing, 184:116–130, 2022

Xian Sun, Peijin Wang, Zhiyuan Yan, Feng Xu, Ruiping Wang, Wenhui Diao, Jin Chen, Jihao Li, Yingchao Feng, Tao Xu, et al. Fair1m: A benchmark dataset for fine- grained object recognition in high-resolution remote sens- ing imagery.ISPRS Journal of Photogrammetry and Remote Sensing, 184:116–130, 2022. 3, 2, 8

work page 2022

[55] [55]

Visual grounding in remote sensing images

Yuxi Sun, Shanshan Feng, Xutao Li, Yunming Ye, Jian Kang, and Xu Huang. Visual grounding in remote sensing images. InProceedings of the 30th ACM International con- ference on Multimedia, pages 404–412, 2022. 3

work page 2022

[56] [56]

Land- cover classification with high-resolution remote sensing im- ages using transferable deep models.Remote Sensing of En- vironment, 2020

Xin-Yi Tong, Gui-Song Xia, Qikai Lu, Huangfeng Shen, Shengyang Li, Shucheng You, and Liangpei Zhang. Land- cover classification with high-resolution remote sensing im- ages using transferable deep models.Remote Sensing of En- vironment, 2020. 3, 1, 8

work page 2020

[57] [57]

Enabling country-scale land cover mapping with meter-resolution satellite imagery.ISPRS Journal of Photogrammetry and Re- mote Sensing, 196:178–196, 2023

Xin-Yi Tong, Gui-Song Xia, and Xiao Xiang Zhu. Enabling country-scale land cover mapping with meter-resolution satellite imagery.ISPRS Journal of Photogrammetry and Re- mote Sensing, 196:178–196, 2023. 3, 1, 8

work page 2023

[58] [58]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 10

work page internal anchor Pith review Pith/arXiv arXiv 2023

[59] [59]

Ov-vg: A benchmark for open-vocabulary visual grounding.Neurocomputing, 591:127738, 2024

Chunlei Wang, Wenquan Feng, Xiangtai Li, Guangliang Cheng, Shuchang Lyu, Binghao Liu, Lijiang Chen, and Qi Zhao. Ov-vg: A benchmark for open-vocabulary visual grounding.Neurocomputing, 591:127738, 2024. 8

work page 2024

[60] [60]

Samrs: Scaling-up re- mote sensing segmentation dataset with segment anything model.Advances in Neural Information Processing Systems, 36:8815–8827, 2023

Di Wang, Jing Zhang, Bo Du, Minqiang Xu, Lin Liu, Dacheng Tao, and Liangpei Zhang. Samrs: Scaling-up re- mote sensing segmentation dataset with segment anything model.Advances in Neural Information Processing Systems, 36:8815–8827, 2023. 1, 6, 2, 8

work page 2023

[61] [61]

Fengxiang Wang, Hongzhen Wang, Zonghao Guo, Di Wang, Yulin Wang, Mingshuo Chen, Qiang Ma, Long Lan, Wenjing Yang, Jing Zhang, et al. Xlrs-bench: Could your multimodal llms understand extremely large ultra-high-resolution remote sensing imagery? InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14325–14336,

work page

[62] [62]

Loveda: A remote sensing land-cover dataset for domain adaptive semantic segmentation.arXiv preprint arXiv:2110.08733, 2021

Junjue Wang, Zhuo Zheng, Ailong Ma, Xiaoyan Lu, and Yanfei Zhong. Loveda: A remote sensing land-cover dataset for domain adaptive semantic segmentation.arXiv preprint arXiv:2110.08733, 2021. 3, 1, 8

work page arXiv 2021

[63] [63]

Diffusion model is secretly a training-free open vocabulary semantic segmenter

Jinglong Wang, Xiawei Li, Jing Zhang, Qingyuan Xu, Qin Zhou, Qian Yu, Lu Sheng, and Dong Xu. Diffusion model is secretly a training-free open vocabulary semantic segmenter. IEEE Transactions on Image Processing, 2025. 8

work page 2025

[64] [64]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 3, 8

work page internal anchor Pith review Pith/arXiv arXiv 2024

[65] [65]

Holitracer: Holistic vectorization of geographic objects from large-size remote sensing imagery

Yu Wang, Bo Dang, Wanchun Li, Wei Chen, and Yansheng Li. Holitracer: Holistic vectorization of geographic objects from large-size remote sensing imagery. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 8482–8491, 2025. 1

work page 2025

[66] [66]

Toward robust referring image seg- mentation.IEEE Transactions on Image Processing, 33: 1782–1794, 2024

Jianzong Wu, Xiangtai Li, Xia Li, Henghui Ding, Yunhai Tong, and Dacheng Tao. Toward robust referring image seg- mentation.IEEE Transactions on Image Processing, 33: 1782–1794, 2024. 2

work page 2024

[67] [67]

Aid: A benchmark data set for performance evaluation of aerial scene classification.IEEE Transactions on Geoscience and Remote Sensing, 55(7):3965–3981, 2017

Gui-Song Xia, Jingwen Hu, Fan Hu, Baoguang Shi, Xiang Bai, Yanfei Zhong, Liangpei Zhang, and Xiaoqiang Lu. Aid: A benchmark data set for performance evaluation of aerial scene classification.IEEE Transactions on Geoscience and Remote Sensing, 55(7):3965–3981, 2017. 3

work page 2017

[68] [68]

Dota: A large-scale dataset for object detection in aerial images

Gui-Song Xia, Xiang Bai, Jian Ding, Zhen Zhu, Serge Be- longie, Jiebo Luo, Mihai Datcu, Marcello Pelillo, and Liang- pei Zhang. Dota: A large-scale dataset for object detection in aerial images. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3974–3983,

work page

[69] [69]

Exploring phrase- level grounding with text-to-image diffusion model

Danni Yang, Ruohan Dong, Jiayi Ji, Yiwei Ma, Haowei Wang, Xiaoshuai Sun, and Rongrong Ji. Exploring phrase- level grounding with text-to-image diffusion model. InEu- ropean Conference on Computer Vision, pages 161–180. Springer, 2024. 8 11

work page 2024

[70] [70]

Lavt: Language-aware vision transformer for referring image segmentation

Zhao Yang, Jiaqi Wang, Yansong Tang, Kai Chen, Heng- shuang Zhao, and Philip HS Torr. Lavt: Language-aware vision transformer for referring image segmentation. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18155–18165, 2022. 2

work page 2022

[71] [71]

Remotesam: Towards segment anything for earth observa- tion

Liang Yao, Fan Liu, Delong Chen, Chuanyi Zhang, Yijun Wang, Ziyun Chen, Wei Xu, Shimin Di, and Yuhui Zheng. Remotesam: Towards segment anything for earth observa- tion. InProceedings of the 33rd ACM International Confer- ence on Multimedia, pages 3027–3036, 2025. 3, 4, 6, 7, 8, 2, 11

work page 2025

[72] [72]

Remotereasoner: Towards unifying geospatial reasoning workflow.arXiv preprint arXiv:2507.19280, 2025

Liang Yao, Fan Liu, Hongbo Lu, Chuanyi Zhang, Rui Min, Shengxiang Xu, Shimin Di, and Pai Peng. Remotereasoner: Towards unifying geospatial reasoning workflow.arXiv preprint arXiv:2507.19280, 2025. 2, 7

work page arXiv 2025

[73] [73]

Modeling context in referring expres- sions

Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in referring expres- sions. InEuropean conference on computer vision, pages 69–85. Springer, 2016. 6

work page 2016

[74] [74]

Rrsis: Referring remote sensing image segmentation.arXiv preprint arXiv:2306.08625, 2023

Zhenghang Yuan, Lichao Mou, Yuansheng Hua, and Xiao Xiang Zhu. Rrsis: Referring remote sensing image segmentation.arXiv preprint arXiv:2306.08625, 2023. 2, 3

work page arXiv 2023

[75] [75]

Rsvg: Exploring data and models for visual grounding on remote sensing data

Yang Zhan, Zhitong Xiong, and Yuan Yuan. Rsvg: Exploring data and models for visual grounding on remote sensing data. IEEE Transactions on Geoscience and Remote Sensing, 61: 1–13, 2023. 2, 6, 8

work page 2023

[76] [76]

Next-chat: An lmm for chat, detection and segmen- tation

Ao Zhang, Yuan Yao, Wei Ji, Zhiyuan Liu, and Tat-Seng Chua. Next-chat: An lmm for chat, detection and segmen- tation. InInternational Conference on Machine Learning, pages 60116–60133. PMLR, 2024. 7

work page 2024

[77] [77]

Earthmarker: A visual prompting multi- modal large language model for remote sensing.IEEE Trans- actions on Geoscience and Remote Sensing, 2024

Wei Zhang, Miaoxin Cai, Tong Zhang, Yin Zhuang, Jun Li, and Xuerui Mao. Earthmarker: A visual prompting multi- modal large language model for remote sensing.IEEE Trans- actions on Geoscience and Remote Sensing, 2024. 1

work page 2024

[78] [78]

Wei Zhang, Miaoxin Cai, Tong Zhang, Yin Zhuang, and Xuerui Mao. Earthgpt: A universal multimodal large lan- guage model for multisensor image comprehension in re- mote sensing domain.IEEE Transactions on Geoscience and Remote Sensing, 62:1–20, 2024. 2

work page 2024

[79] [79]

Hierarchical and robust convolutional neural network for very high-resolution remote sensing object detection

Yuanlin Zhang, Yuan Yuan, Yachuang Feng, and Xiaoqiang Lu. Hierarchical and robust convolutional neural network for very high-resolution remote sensing object detection. IEEE Transactions on Geoscience and Remote Sensing, 57 (8):5535–5548, 2019. 3

work page 2019

[80] [80]

Psalm: Pixelwise segmentation with large multi-modal model

Zheng Zhang, Yeyao Ma, Enming Zhang, and Xiang Bai. Psalm: Pixelwise segmentation with large multi-modal model. InEuropean Conference on Computer Vision, pages 74–91. Springer, 2024. 2, 4, 6, 7, 8, 10

work page 2024