arxiv: 2604.15670 · v1 · submitted 2026-04-17 · 💻 cs.CV

Recognition: unknown

PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation

Shuyan Ke , Yifan Mei , Changli Wu , Yonghan Zheng , Jiayi Ji , Liujuan Cao , Rongrong Ji

Authors on Pith no claims yet

Pith reviewed 2026-05-10 09:16 UTC · model grok-4.3

classification 💻 cs.CV

keywords UAV reasoning segmentationmultimodal language modelDRSeg benchmarkaerial image segmentationchain-of-thought supervisionpixel-level outputdual-path architectureremote sensing vision

0 comments

The pith

UAV reasoning segmentation is formalized into spatial, attribute and scene-level dimensions with a 10k-image benchmark and dual-path model baseline.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines a new task for segmenting objects in drone-captured images according to complex natural-language instructions. It organizes the needed understanding into three parts: spatial relations and positions, object attributes and properties, and broader scene context. This framing addresses why standard vision models often fail on aerial data due to tilted views, extreme object sizes, and high detail. The authors release DRSeg, a dataset of 10,000 high-resolution UAV photos accompanied by chain-of-thought question-answer pairs that test each dimension. They also supply PixDLM, a multimodal model built to turn language instructions into pixel masks, and report initial results that expose the remaining gaps in current approaches.

Core claim

We formally define the UAV Reasoning Segmentation task and organize its semantic requirements into three dimensions: Spatial, Attribute, and Scene-level reasoning. Based on this formulation, we construct DRSeg, a large-scale benchmark for UAV reasoning segmentation, containing 10k high-resolution aerial images paired with Chain-of-Thought QA supervision across all three reasoning types. As a benchmark companion, we introduce PixDLM, a simple yet effective pixel-level multimodal language model that serves as a unified baseline for this task.

What carries the argument

PixDLM, the dual-path multimodal language model that combines visual feature paths with language reasoning to produce pixel-level segmentation masks from UAV images and complex queries.

If this is right

The three-dimensional breakdown allows separate measurement of how well models handle location, properties, and overall scene meaning in aerial scenes.
DRSeg supplies a shared testbed that makes it possible to compare future methods on identical UAV reasoning cases.
PixDLM shows that a single multimodal model can address both language understanding and precise pixel output for drone imagery.
Reported baseline numbers indicate that scale variation and oblique angles remain harder for current models than for ground-level scenes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the three dimensions prove sufficient, they could guide the creation of specialized training data for drone systems used in search, mapping, or monitoring.
Pairing the PixDLM approach with onboard UAV flight controls might enable real-time interpretive decisions during missions.
The current dataset size suggests that scaling to more varied altitudes, weather, and urban-rural mixes would be a direct next test.

Load-bearing premise

That the three proposed dimensions fully capture the semantic requirements of UAV reasoning segmentation and that the 10k-image DRSeg dataset supplies a representative testbed for such models.

What would settle it

A new set of UAV images whose required reasoning falls outside the spatial-attribute-scene framework, on which models trained only on DRSeg show no gain over generic segmentation baselines.

Figures

Figures reproduced from arXiv: 2604.15670 by Changli Wu, Jiayi Ji, Liujuan Cao, Rongrong Ji, Shuyan Ke, Yifan Mei, Yonghan Zheng.

**Figure 1.** Figure 1: Overview of the UAV ReasonSeg task, which involves three core reasoning types: spatial, attribute, and scene-level reasoning, [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of the DRSeg construction pipeline. Includes (1) human selection, (2) semi-automatic mask generation, (3) automatic [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of PixDLM. A Dual-Path Vision Encoder extracts global semantics (CLIP) and fine-grained structure (SAM), which are [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Visual comparison of segmentation results produced by LISA, our fine-tuned LISA variant, GeoPix, and our PixDLM model. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Visualization of PixDLM results on the three DRSeg reasoning subsets: Attribute (left), Scene (middle), and Spatial (right). Each [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

read the original abstract

Reasoning segmentation has recently expanded from ground-level scenes to remote-sensing imagery, yet UAV data poses distinct challenges, including oblique viewpoints, ultra-high resolutions, and extreme scale variations. To address these issues, we formally define the UAV Reasoning Segmentation task and organize its semantic requirements into three dimensions: Spatial, Attribute, and Scene-level reasoning. Based on this formulation, we construct DRSeg, a large-scale benchmark for UAV reasoning segmentation, containing 10k high-resolution aerial images paired with Chain-of-Thought QA supervision across all three reasoning types. As a benchmark companion, we introduce PixDLM, a simple yet effective pixel-level multimodal language model that serves as a unified baseline for this task. Experiments on DRSeg establish strong baseline results and highlight the unique challenges of UAV reasoning segmentation, providing a solid foundation for future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper introduces a new UAV reasoning segmentation task with a 10k-image benchmark and baseline model, which organizes work in a narrow but growing subfield.

read the letter

The main thing here is a new task definition for UAV Reasoning Segmentation organized into spatial, attribute, and scene-level reasoning, backed by the DRSeg benchmark of 10k images with CoT supervision and PixDLM as the accompanying baseline model. They break the reasoning into spatial, attribute, and scene-level dimensions, which helps structure the problem for high-res aerial data with its unique issues like oblique views and scale variation. The chain-of-thought supervision in the dataset is a nice touch for training models that explain their segmentations. This setup does a solid job of extending reasoning segmentation beyond ground-level scenes and giving the community a concrete starting point with the benchmark. The dataset size is substantial for this domain. Where it could be stronger is in the reported results. The abstract mentions strong baselines but doesn't share any quantitative findings, so the actual performance gains or the highlighted challenges remain a bit abstract until you see the tables. If the full paper includes thorough comparisons and ablations, that would address this. The dimensions chosen look practical, though one could argue for additional categories like temporal if video is involved, but that's not the focus here. This paper is for CV researchers in remote sensing and multimodal learning. Anyone building models for UAV applications would find the benchmark relevant to test new ideas against. I'd send it to peer review. The new task and data make it worth a referee's time.

Referee Report

1 major / 1 minor

Summary. The paper formally defines the UAV Reasoning Segmentation task and organizes its semantic requirements into three dimensions: Spatial, Attribute, and Scene-level reasoning. It constructs the DRSeg benchmark containing 10k high-resolution aerial images paired with Chain-of-Thought QA supervision across the reasoning types, introduces PixDLM as a dual-path multimodal language model baseline for pixel-level reasoning segmentation, and reports experiments that establish strong baseline results while highlighting UAV-specific challenges.

Significance. If the empirical results hold, this work is significant for establishing a dedicated benchmark and initial model for reasoning segmentation in UAV imagery, which presents distinct challenges from ground-level scenes due to oblique viewpoints, ultra-high resolutions, and scale variations. The three-dimensional reasoning framework provides a useful organizing taxonomy, the release of DRSeg with CoT annotations offers a practical resource for developing interpretable multimodal models, and the baseline experiments can serve as a reference point for future research in remote-sensing applications.

major comments (1)

Abstract: the claim that 'Experiments on DRSeg establish strong baseline results' is not supported by any quantitative metrics, ablation studies, or comparison tables, which is load-bearing for the central empirical contribution and prevents verification of the strength of PixDLM or the highlighted challenges.

minor comments (1)

Abstract: the model name 'PixDLM' is introduced without expanding the acronym on first use.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful review and the opportunity to improve the manuscript. We address the single major comment point by point below.

read point-by-point responses

Referee: Abstract: the claim that 'Experiments on DRSeg establish strong baseline results' is not supported by any quantitative metrics, ablation studies, or comparison tables, which is load-bearing for the central empirical contribution and prevents verification of the strength of PixDLM or the highlighted challenges.

Authors: We agree that the abstract claim would be stronger and more verifiable if accompanied by concrete quantitative highlights. In the revised version we will update the abstract to include key metrics (overall mIoU, per-reasoning-dimension scores, and brief comparison to the strongest baseline), while retaining the existing statement that the experiments highlight UAV-specific challenges. The full manuscript already contains the supporting tables, ablations, and comparisons in the Experiments section; the revision will simply surface the most salient numbers in the abstract for immediate reader assessment. revision: yes

Circularity Check

0 steps flagged

No significant circularity; task definition and empirical baseline are self-contained

full rationale

The paper defines the UAV Reasoning Segmentation task by organizing semantic requirements into Spatial, Attribute, and Scene-level dimensions, releases the DRSeg benchmark (10k images with CoT QA pairs), and introduces PixDLM as a simple dual-path baseline model. No equations, fitted parameters, or derivation chains appear in the provided text. The three dimensions function as an explicit organizing taxonomy rather than a result derived from data or prior self-citations; the model is presented as a unified baseline whose experiments highlight challenges without claiming uniqueness theorems or reducing predictions to fitted inputs. All load-bearing elements (task formulation, dataset construction, baseline results) are independent of any internal self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the assumption that the newly defined three reasoning dimensions are the right semantic decomposition for UAV imagery and that standard multimodal training will suffice to produce useful baselines; no free parameters, invented physical entities, or non-standard axioms are introduced in the abstract.

axioms (1)

domain assumption Standard multimodal language model training paradigms apply to pixel-level reasoning segmentation
The paper treats PixDLM as a straightforward extension of existing vision-language architectures without stating novel training assumptions.

invented entities (1)

PixDLM no independent evidence
purpose: Dual-path multimodal language model serving as baseline for UAV reasoning segmentation
New model name and architecture introduced to address the defined task; no external falsifiable prediction or independent evidence supplied in the abstract.

pith-pipeline@v0.9.0 · 5456 in / 1307 out tokens · 65340 ms · 2026-05-10T09:16:54.415957+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

63 extracted references · 21 canonical work pages · 3 internal anchors

[1]

AU-AIR: A multi-modal unmanned aerial vehicle dataset for low altitude traffic surveillance

Ilker Bozcan and Erdal Kayacan. AU-AIR: A multi-modal unmanned aerial vehicle dataset for low altitude traffic surveillance. InICRA, pages 8504–8510, 2020. 2

2020
[2]

3 Table B.3

Qinglong Cao, Yuntian Chen, Chao Ma, and Xiaokang Yang. 3 Table B.3. Ablation study of theMulti-Path Alignment Fusion Strategyon theDRSegbenchmark. We compare three fusion directions: SAM→CLIP,SAM+CLIP, andCLIP→SAM. Metrics (%) are reported asgIoU↑andcIoU↑. Fusion Strategy Attribute Reasoning Scene Reasoning Spatial Reasoning gIoU↑cIoU↑ gIoU↑cIoU↑ gIoU↑cI...

2025
[3]

FGSD: A dataset for fine-grained ship detection in high res- olution satellite images.CoRR, abs/2003.06832, 2020

Kaiyan Chen, Ming Wu, Jiaming Liu, and Chuang Zhang. FGSD: A dataset for fine-grained ship detection in high res- olution satellite images.CoRR, abs/2003.06832, 2020. 2

work page arXiv 2003
[4]

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic.CoRR, https://doi.org/10.48550/arXiv.2306.15195, 2023. 2

work page internal anchor Pith review doi:10.48550/arxiv.2306.15195 2023
[5]

Masked-attention mask transformer for universal image segmentation

Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexan- der Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1290–1299, 2022. 5

2022
[6]

Anchor-free oriented proposal generator for object detection.IEEE Trans

Gong Cheng, Jiabao Wang, Ke Li, Xingxing Xie, Chunbo Lang, Yanqing Yao, and Junwei Han. Anchor-free oriented proposal generator for object detection.IEEE Trans. Geosci. Remote. Sens., 60:1–11, 2022. 2

2022
[7]

Towards large-scale small object detection: Survey and benchmarks.IEEE Trans

Gong Cheng, Xiang Yuan, Xiwen Yao, Kebing Yan, Qinghua Zeng, Xingxing Xie, and Junwei Han. Towards large-scale small object detection: Survey and benchmarks.IEEE Trans. Pattern Anal. Mach. Intell., 45(11):13467–13488, 2023. 2

2023
[8]

The unmanned aerial vehicle benchmark: Object detection and tracking

Dawei Du, Yuankai Qi, Hongyang Yu, Yi-Fan Yang, Kaiwen Duan, Guorong Li, Weigang Zhang, Qingming Huang, and Qi Tian. The unmanned aerial vehicle benchmark: Object detection and tracking. InECCV (10), pages 375–391, 2018. 2

2018
[9]

Visdrone-sot2019: The vision meets drone single ob- ject tracking challenge results

Dawei Du, Yue Zhang, Liefeng Bo, Hailin Shi, Rui Zhu, Bo Han, Chunhui Zhang, Guizhong Liu, Han Wu, Hao Wen, Haoran Wang, Pengfei Zhu, Jiaqing Fan, Jie Chen, Jie Gao, Jie Zhang, Jinghao Zhou, Jinliu Zhou, Jinwang Wang, Jiuqing Wan, Josef Kittler, Kaihua Zhang, Longyin Wen, Kaiqi Huang, Kang Yang, Kangkai Zhang, Lianghua Huang, Lijun Zhou, Lingling Shi, Lu ...

2019
[10]

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara L. Berg. Referitgame: Referring to objects in pho- tographs of natural scenes. InEMNLP, pages 787–798, 2014. 7

2014
[11]

Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross B

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chlo´e Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C. Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross B. Girshick. Segment anything. InICCV, pages 3992–
[12]

LISA: reasoning segmenta- tion via large language model

Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. LISA: reasoning segmenta- tion via large language model. InCVPR, pages 9579–9589,
[13]

DASSF: dynamic-attention scale-sequence fusion for aerial object detection

Haodong Li and Haicheng Qu. DASSF: dynamic-attention scale-sequence fusion for aerial object detection. InCVM (1), pages 212–227, 2025. 3

2025
[14]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, pages 19730–19742, 2023. 2, 5

2023
[15]

Object detection in optical remote sensing images: A 4 survey and a new benchmark.ISPRS journal of photogram- metry and remote sensing, 159:296–307, 2020

Ke Li, Gang Wan, Gong Cheng, Liqiu Meng, and Junwei Han. Object detection in optical remote sensing images: A 4 survey and a new benchmark.ISPRS journal of photogram- metry and remote sensing, 159:296–307, 2020. 3

2020
[16]

Segearth-r1: Geospatial pixel reasoning via large language model

Kaiyu Li, Zepeng Xin, Li Pang, Chao Pang, Yupeng Deng, Jing Yao, Guisong Xia, Deyu Meng, Zhi Wang, and Xiangy- ong Cao. Segearth-r1: Geospatial pixel reasoning via large language model.CoRR, abs/2504.09644, 2025. 6

work page arXiv 2025
[17]

Video-llava: Learning united visual represen- tation by alignment before projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual represen- tation by alignment before projection. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 5971–5984, 2024. 5

2024
[18]

Hrseg: High-resolution visual perception and enhancement for reasoning segmenta- tion.CoRR, abs/2507.12883, 2025

Weihuang Lin, Yiwei Ma, Xiaoshuai Sun, Shuting He, Jiayi Ji, Liujuan Cao, and Rongrong Ji. Hrseg: High-resolution visual perception and enhancement for reasoning segmenta- tion.CoRR, abs/2507.12883, 2025. 2

work page arXiv 2025
[19]

Primitivenet: decomposing the global constraints for referring segmenta- tion.Visual Intelligence, 2(1):16, 2024

Chang Liu, Xudong Jiang, and Henghui Ding. Primitivenet: decomposing the global constraints for referring segmenta- tion.Visual Intelligence, 2(1):16, 2024. 2

2024
[20]

Re- moteclip: A vision language foundation model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, 62:1–16, 2024

Fan Liu, Delong Chen, Zhangqingyun Guan, Xiaocong Zhou, Jiale Zhu, Qiaolin Ye, Liyong Fu, and Jun Zhou. Re- moteclip: A vision language foundation model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, 62:1–16, 2024. 3

2024
[21]

Boost uav-based object detec- tion via scale-invariant feature disentanglement and adver- sarial learning.IEEE Transactions on Geoscience and Re- mote Sensing, 2025

Fan Liu, Liang Yao, Chuanyi Zhang, Ting Wu, Xinlei Zhang, Xiruo Jiang, and Jun Zhou. Boost uav-based object detec- tion via scale-invariant feature disentanglement and adver- sarial learning.IEEE Transactions on Geoscience and Re- mote Sensing, 2025. 3

2025
[22]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 2

2023
[23]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024. 5

2024
[24]

Aerialvln: Vision-and-language navigation for uavs

Shubo Liu, Hongsheng Zhang, Yuankai Qi, Peng Wang, Yan- ning Zhang, and Qi Wu. Aerialvln: Vision-and-language navigation for uavs. InICCV, pages 15338–15348, 2023. 3

2023
[25]

Rotated multi-scale interaction network for referring remote sensing image seg- mentation

Sihan Liu, Yiwei Ma, Xiaoqing Zhang, Haowei Wang, Ji- ayi Ji, Xiaoshuai Sun, and Rongrong Ji. Rotated multi-scale interaction network for referring remote sensing image seg- mentation. InCVPR, pages 26648–26658, 2024. 2, 7

2024
[26]

Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement.arXiv preprint arXiv:2503.06520, 2025

Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fan- bin Lu, Bei Yu, and Jiaya Jia. Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement.CoRR, abs/2503.06520, 2025. 3

work page arXiv 2025
[27]

A high resolution optical satellite image dataset for ship recognition and some new baselines

Zikun Liu, Liu Yuan, Lubin Weng, and Yiping Yang. A high resolution optical satellite image dataset for ship recognition and some new baselines. InICPRAM, pages 324–331, 2017. 2

2017
[28]

Internchat: Solving vision-centric tasks by interacting with chatbots beyond language

Zhaoyang Liu, Yinan He, Wenhai Wang, Weiyun Wang, Yi Wang, Shoufa Chen, Qinglong Zhang, Zeqiang Lai, Yang Yang, Qingyun Li, et al. Interngpt: Solving vision-centric tasks by interacting with chatgpt beyond language.arXiv preprint arXiv:2305.05662, 2023. 2

work page arXiv 2023
[29]

Feast your eyes: Mixture-of- resolution adaptation for multimodal large language models

Gen Luo, Yiyi Zhou, Yuxin Zhang, Xiawu Zheng, Xi- aoshuai Sun, and Rongrong Ji. Feast your eyes: Mixture-of- resolution adaptation for multimodal large language models. arXiv preprint arXiv:2403.03003, 2024. 3, 5

work page arXiv 2024
[30]

Generation and comprehension of unambiguous object descriptions

Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 11–20, 2016. 7

2016
[31]

Public dataset of parking lot videos for com- putational vision applied to surveillance

Ingrid Nascimento, Pedro Castro, Sofia Klautau, Luan Gonc ¸alves, Carnot Filho, Fl´avio Brito, Aldebaro Klautau, and Silvia Lins. Public dataset of parking lot videos for com- putational vision applied to surveillance. InICMLA, pages 61–64, 2020. 2

2020
[32]

Geopix: Multi-modal large language model for pixel-level image understanding in remote sensing.CoRR, abs/2501.06828, 2025

Ruizhe Ou, Yuan Hu, Fan Zhang, Jiaxin Chen, and Yu Liu. Geopix: Multi-modal large language model for pixel-level image understanding in remote sensing.CoRR, abs/2501.06828, 2025. 2, 3, 7

work page arXiv 2025
[33]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 5

2021
[34]

Zero: memory optimizations toward training trillion parameter models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: memory optimizations toward training trillion parameter models. InSC, page 20, 2020. 7

2020
[35]

Vehicle detec- tion in aerial imagery : A small target detection benchmark

S ´ebastien Razakarivony and Fr ´ed´eric Jurie. Vehicle detec- tion in aerial imagery : A small target detection benchmark. J. Vis. Commun. Image Represent., 34:187–203, 2016. 2

2016
[36]

Pixellm: Pixel reasoning with large multimodal model

Zhongwei Ren, Zhicheng Huang, Yunchao Wei, Yao Zhao, Dongmei Fu, Jiashi Feng, and Xiaojie Jin. Pixellm: Pixel reasoning with large multimodal model. InCVPR, pages 26364–26373, 2024. 2, 3, 6

2024
[37]

U- net: Convolutional networks for biomedical image segmen- tation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. InInternational Conference on Medical image com- puting and computer-assisted intervention, pages 234–241. Springer, 2015. 5

2015
[38]

Akashah Shabbir, Mohammed Zumri, Mohammed Ben- namoun, Fahad Khan, and Salman H. Khan. Geopixel: Pixel grounding large multimodal model in remote sensing.CoRR, abs/2501.13925, 2025. 2, 3

work page arXiv 2025
[39]

EVD4UA V: an altitude-sensitive benchmark to evade vehicle detection in UA V

Huiming Sun, Jiacheng Guo, Zibo Meng, Tianyun Zhang, Jianwu Fang, Yuewei Lin, and Hongkai Yu. EVD4UA V: an altitude-sensitive benchmark to evade vehicle detection in UA V. InIV, pages 545–552, 2024. 2

2024
[40]

Drone-based rgb-infrared cross-modality vehicle detection via uncertainty-aware learning.IEEE Trans

Yiming Sun, Bing Cao, Pengfei Zhu, and Qinghua Hu. Drone-based rgb-infrared cross-modality vehicle detection via uncertainty-aware learning.IEEE Trans. Circuits Syst. Video Technol., 32(10):6700–6713, 2022. 2

2022
[41]

Advancements in visual language models for remote sensing: Datasets, capabilities, and en- hancement techniques.CoRR, abs/2410.17283, 2024

Lijie Tao, Haokui Zhang, Haizhao Jing, Yu Liu, Kelu Yao, Chao Li, and Xizhe Xue. Advancements in visual language models for remote sensing: Datasets, capabilities, and en- hancement techniques.CoRR, abs/2410.17283, 2024. 3

work page arXiv 2024
[42]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Al- bert, Amjad Almahairi, Yasmine Babaei, Nikolay Bash- lykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhos- ale, Dan Bikel, Lukas Blecher, Cristian Canton-Ferrer, 5 Moya Chen, Guillem Cucurull, David Esiobu, Jude Fer- nandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Ant...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

Oblique aerial images: a review focusing on georeferencing procedures

Styliani Verykokou and Charalabos Ioannidis. Oblique aerial images: a review focusing on georeferencing procedures. International journal of remote sensing, 39(11):3452–3496,
[44]

Uavswarm dataset: An unmanned aerial vehicle swarm dataset for multiple object tracking.Remote

Chuanyun Wang, Yang Su, Jingjing Wang, Tian Wang, and Qian Gao. Uavswarm dataset: An unmanned aerial vehicle swarm dataset for multiple object tracking.Remote. Sens., 14(11):2601, 2022. 2

2022
[45]

Llm-seg: Bridging image seg- mentation and large language model reasoning

Junchi Wang and Lei Ke. Llm-seg: Bridging image seg- mentation and large language model reasoning. InCVPR Workshops, pages 1765–1774, 2024. 3

2024
[46]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.CoRR, abs/2409.12191, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

Segllm: Multi-round reasoning segmentation with large language models

Xudong Wang, Shaolun Zhang, Shufan Li, Kehan Li, Kon- stantinos Kallidromitis, Yusuke Kato, Kazuki Kozuka, and Trevor Darrell. Segllm: Multi-round reasoning segmentation with large language models. InICLR, 2025. 2

2025
[48]

Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2404.08506, 2024

Cong Wei, Haoxian Tan, Yujie Zhong, Yujiu Yang, and Lin Ma. Lasagna: Language-based segmentation assistant for complex queries.CoRR, abs/2404.08506, 2024. 3

work page arXiv 2024
[49]

Be- longie, Jiebo Luo, Mihai Datcu, Marcello Pelillo, and Liang- pei Zhang

Gui-Song Xia, Xiang Bai, Jian Ding, Zhen Zhu, Serge J. Be- longie, Jiebo Luo, Mihai Datcu, Marcello Pelillo, and Liang- pei Zhang. DOTA: A large-scale dataset for object detection in aerial images. InCVPR, pages 3974–3983, 2018. 2, 3

2018
[50]

GSV A: generalized seg- mentation via multimodal large language models.CoRR, https://doi.org/10.48550/arXiv.2312.10103, 2023

Zhuofan Xia, Dongchen Han, Yizeng Han, Xuran Pan, Shiji Song, and Gao Huang. GSV A: generalized seg- mentation via multimodal large language models.CoRR, https://doi.org/10.48550/arXiv.2312.10103, 2023. 3

work page doi:10.48550/arxiv.2312.10103 2023
[51]

arXiv preprint arXiv:2312.17240 (2023)

Senqiao Yang, Tianyuan Qu, Xin Lai, Zhuotao Tian, Bo- hao Peng, Shu Liu, and Jiaya Jia. An improved baseline for reasoning segmentation with large language model.CoRR, abs/2312.17240, 2023. 3

work page arXiv 2023
[52]

Remotesam: Towards segment anything for earth observa- tion

Liang Yao, Fan Liu, Delong Chen, Chuanyi Zhang, Yijun Wang, Ziyun Chen, Wei Xu, Shimin Di, and Yuhui Zheng. Remotesam: Towards segment anything for earth observa- tion.CoRR, abs/2505.18022, 2025. 3

work page arXiv 2025
[53]

Remotesam: Towards segment anything for earth observa- tion

Liang Yao, Fan Liu, Delong Chen, Chuanyi Zhang, Yijun Wang, Ziyun Chen, Wei Xu, Shimin Di, and Yuhui Zheng. Remotesam: Towards segment anything for earth observa- tion. InProceedings of the 33rd ACM International Confer- ence on Multimedia, pages 3027–3036, 2025. 3

2025
[54]

Remotereasoner: Towards unifying geospatial reasoning workflow.arXiv preprint arXiv:2507.19280, 2025

Liang Yao, Fan Liu, Hongbo Lu, Chuanyi Zhang, Rui Min, Shengxiang Xu, Shimin Di, and Pai Peng. Remotereasoner: Towards unifying geospatial reasoning workflow.arXiv preprint arXiv:2507.19280, 2025. 3

work page arXiv 2025
[55]

RIS- LAD: A benchmark and model for referring low- altitude drone image segmentation.arXiv Preprint, https://doi.org/10.48550/arXiv.2507.20920, 2025

Kai Ye, YingShi Luan, Zhudi Chen, Guangyue Meng, Pingyang Dai, and Liujuan Cao. RIS- LAD: A benchmark and model for referring low- altitude drone image segmentation.arXiv Preprint, https://doi.org/10.48550/arXiv.2507.20920, 2025. 4

work page doi:10.48550/arxiv.2507.20920 2025
[56]

More clear, more flexible, more precise: A comprehensive oriented object detection benchmark for UA V.CoRR, abs/2504.20032, 2025

Kai Ye, Haidi Tang, Bowen Liu, Pingyang Dai, Liujuan Cao, and Rongrong Ji. More clear, more flexible, more precise: A comprehensive oriented object detection benchmark for UA V.CoRR, abs/2504.20032, 2025. 4

work page arXiv 2025
[57]

The unmanned aerial vehicle benchmark: Object detection, tracking and baseline.Int

Hongyang Yu, Guorong Li, Weigang Zhang, Qingming Huang, Dawei Du, Qi Tian, and Nicu Sebe. The unmanned aerial vehicle benchmark: Object detection, tracking and baseline.Int. J. Comput. Vis., 128(5):1141–1159, 2020. 3

2020
[58]

Modeling context in referring expres- sions

Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in referring expres- sions. InEuropean conference on computer vision, pages 69–85. Springer, 2016. 7

2016
[59]

Rsmmformer: Multimodal trans- former using multiscale self-attention for remote sensing im- age classification

Bo Zhang, Zuheng Ming, Yaqian Liu, Wei Feng, Liang He, and Kaixing Zhao. Rsmmformer: Multimodal trans- former using multiscale self-attention for remote sensing im- age classification. InCICAI (1), pages 329–339, 2023. 3

2023
[60]

Llama-adapter: Efficient fine-tuning of large language models with zero- initialized attention

Renrui Zhang, Jiaming Han, Chris Liu, Aojun Zhou, Pan Lu, Yu Qiao, Hongsheng Li, and Peng Gao. Llama-adapter: Efficient fine-tuning of large language models with zero- initialized attention. InICLR, 2024. 2

2024
[61]

Villa: Video reasoning segmentation with large language model.arXiv preprint arXiv:2407.14500, 2024

Rongkun Zheng, Lu Qi, Xi Chen, Yi Wang, Kun Wang, Yu Qiao, and Hengshuang Zhao. Villa: Video rea- soning segmentation with large language model.CoRR, abs/2407.14500, 2024. 3

work page arXiv 2024
[62]

Ts4net: Two-stage sample selective strategy for rotating ob- ject detection.Neurocomputing, 501:753–764, 2022

Jian Zhou, Kai Feng, Weixing Li, Jun Han, and Feng Pan. Ts4net: Two-stage sample selective strategy for rotating ob- ject detection.Neurocomputing, 501:753–764, 2022. 2

2022
[63]

Vision Meets Drones: A Challenge

Pengfei Zhu, Longyin Wen, Xiao Bian, Haibin Ling, and Qinghua Hu. Vision meets drones: A challenge.CoRR, abs/1804.07437, 2018. 3 6

work page Pith review arXiv 2018