Recognition: unknown
PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation
Pith reviewed 2026-05-10 09:16 UTC · model grok-4.3
The pith
UAV reasoning segmentation is formalized into spatial, attribute and scene-level dimensions with a 10k-image benchmark and dual-path model baseline.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We formally define the UAV Reasoning Segmentation task and organize its semantic requirements into three dimensions: Spatial, Attribute, and Scene-level reasoning. Based on this formulation, we construct DRSeg, a large-scale benchmark for UAV reasoning segmentation, containing 10k high-resolution aerial images paired with Chain-of-Thought QA supervision across all three reasoning types. As a benchmark companion, we introduce PixDLM, a simple yet effective pixel-level multimodal language model that serves as a unified baseline for this task.
What carries the argument
PixDLM, the dual-path multimodal language model that combines visual feature paths with language reasoning to produce pixel-level segmentation masks from UAV images and complex queries.
If this is right
- The three-dimensional breakdown allows separate measurement of how well models handle location, properties, and overall scene meaning in aerial scenes.
- DRSeg supplies a shared testbed that makes it possible to compare future methods on identical UAV reasoning cases.
- PixDLM shows that a single multimodal model can address both language understanding and precise pixel output for drone imagery.
- Reported baseline numbers indicate that scale variation and oblique angles remain harder for current models than for ground-level scenes.
Where Pith is reading between the lines
- If the three dimensions prove sufficient, they could guide the creation of specialized training data for drone systems used in search, mapping, or monitoring.
- Pairing the PixDLM approach with onboard UAV flight controls might enable real-time interpretive decisions during missions.
- The current dataset size suggests that scaling to more varied altitudes, weather, and urban-rural mixes would be a direct next test.
Load-bearing premise
That the three proposed dimensions fully capture the semantic requirements of UAV reasoning segmentation and that the 10k-image DRSeg dataset supplies a representative testbed for such models.
What would settle it
A new set of UAV images whose required reasoning falls outside the spatial-attribute-scene framework, on which models trained only on DRSeg show no gain over generic segmentation baselines.
Figures
read the original abstract
Reasoning segmentation has recently expanded from ground-level scenes to remote-sensing imagery, yet UAV data poses distinct challenges, including oblique viewpoints, ultra-high resolutions, and extreme scale variations. To address these issues, we formally define the UAV Reasoning Segmentation task and organize its semantic requirements into three dimensions: Spatial, Attribute, and Scene-level reasoning. Based on this formulation, we construct DRSeg, a large-scale benchmark for UAV reasoning segmentation, containing 10k high-resolution aerial images paired with Chain-of-Thought QA supervision across all three reasoning types. As a benchmark companion, we introduce PixDLM, a simple yet effective pixel-level multimodal language model that serves as a unified baseline for this task. Experiments on DRSeg establish strong baseline results and highlight the unique challenges of UAV reasoning segmentation, providing a solid foundation for future research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper formally defines the UAV Reasoning Segmentation task and organizes its semantic requirements into three dimensions: Spatial, Attribute, and Scene-level reasoning. It constructs the DRSeg benchmark containing 10k high-resolution aerial images paired with Chain-of-Thought QA supervision across the reasoning types, introduces PixDLM as a dual-path multimodal language model baseline for pixel-level reasoning segmentation, and reports experiments that establish strong baseline results while highlighting UAV-specific challenges.
Significance. If the empirical results hold, this work is significant for establishing a dedicated benchmark and initial model for reasoning segmentation in UAV imagery, which presents distinct challenges from ground-level scenes due to oblique viewpoints, ultra-high resolutions, and scale variations. The three-dimensional reasoning framework provides a useful organizing taxonomy, the release of DRSeg with CoT annotations offers a practical resource for developing interpretable multimodal models, and the baseline experiments can serve as a reference point for future research in remote-sensing applications.
major comments (1)
- Abstract: the claim that 'Experiments on DRSeg establish strong baseline results' is not supported by any quantitative metrics, ablation studies, or comparison tables, which is load-bearing for the central empirical contribution and prevents verification of the strength of PixDLM or the highlighted challenges.
minor comments (1)
- Abstract: the model name 'PixDLM' is introduced without expanding the acronym on first use.
Simulated Author's Rebuttal
We thank the referee for the thoughtful review and the opportunity to improve the manuscript. We address the single major comment point by point below.
read point-by-point responses
-
Referee: Abstract: the claim that 'Experiments on DRSeg establish strong baseline results' is not supported by any quantitative metrics, ablation studies, or comparison tables, which is load-bearing for the central empirical contribution and prevents verification of the strength of PixDLM or the highlighted challenges.
Authors: We agree that the abstract claim would be stronger and more verifiable if accompanied by concrete quantitative highlights. In the revised version we will update the abstract to include key metrics (overall mIoU, per-reasoning-dimension scores, and brief comparison to the strongest baseline), while retaining the existing statement that the experiments highlight UAV-specific challenges. The full manuscript already contains the supporting tables, ablations, and comparisons in the Experiments section; the revision will simply surface the most salient numbers in the abstract for immediate reader assessment. revision: yes
Circularity Check
No significant circularity; task definition and empirical baseline are self-contained
full rationale
The paper defines the UAV Reasoning Segmentation task by organizing semantic requirements into Spatial, Attribute, and Scene-level dimensions, releases the DRSeg benchmark (10k images with CoT QA pairs), and introduces PixDLM as a simple dual-path baseline model. No equations, fitted parameters, or derivation chains appear in the provided text. The three dimensions function as an explicit organizing taxonomy rather than a result derived from data or prior self-citations; the model is presented as a unified baseline whose experiments highlight challenges without claiming uniqueness theorems or reducing predictions to fitted inputs. All load-bearing elements (task formulation, dataset construction, baseline results) are independent of any internal self-referential reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard multimodal language model training paradigms apply to pixel-level reasoning segmentation
invented entities (1)
-
PixDLM
no independent evidence
Reference graph
Works this paper leans on
-
[1]
AU-AIR: A multi-modal unmanned aerial vehicle dataset for low altitude traffic surveillance
Ilker Bozcan and Erdal Kayacan. AU-AIR: A multi-modal unmanned aerial vehicle dataset for low altitude traffic surveillance. InICRA, pages 8504–8510, 2020. 2
2020
-
[2]
3 Table B.3
Qinglong Cao, Yuntian Chen, Chao Ma, and Xiaokang Yang. 3 Table B.3. Ablation study of theMulti-Path Alignment Fusion Strategyon theDRSegbenchmark. We compare three fusion directions: SAM→CLIP,SAM+CLIP, andCLIP→SAM. Metrics (%) are reported asgIoU↑andcIoU↑. Fusion Strategy Attribute Reasoning Scene Reasoning Spatial Reasoning gIoU↑cIoU↑ gIoU↑cIoU↑ gIoU↑cI...
2025
-
[3]
Kaiyan Chen, Ming Wu, Jiaming Liu, and Chuang Zhang. FGSD: A dataset for fine-grained ship detection in high res- olution satellite images.CoRR, abs/2003.06832, 2020. 2
-
[4]
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic.CoRR, https://doi.org/10.48550/arXiv.2306.15195, 2023. 2
work page internal anchor Pith review doi:10.48550/arxiv.2306.15195 2023
-
[5]
Masked-attention mask transformer for universal image segmentation
Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexan- der Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1290–1299, 2022. 5
2022
-
[6]
Anchor-free oriented proposal generator for object detection.IEEE Trans
Gong Cheng, Jiabao Wang, Ke Li, Xingxing Xie, Chunbo Lang, Yanqing Yao, and Junwei Han. Anchor-free oriented proposal generator for object detection.IEEE Trans. Geosci. Remote. Sens., 60:1–11, 2022. 2
2022
-
[7]
Towards large-scale small object detection: Survey and benchmarks.IEEE Trans
Gong Cheng, Xiang Yuan, Xiwen Yao, Kebing Yan, Qinghua Zeng, Xingxing Xie, and Junwei Han. Towards large-scale small object detection: Survey and benchmarks.IEEE Trans. Pattern Anal. Mach. Intell., 45(11):13467–13488, 2023. 2
2023
-
[8]
The unmanned aerial vehicle benchmark: Object detection and tracking
Dawei Du, Yuankai Qi, Hongyang Yu, Yi-Fan Yang, Kaiwen Duan, Guorong Li, Weigang Zhang, Qingming Huang, and Qi Tian. The unmanned aerial vehicle benchmark: Object detection and tracking. InECCV (10), pages 375–391, 2018. 2
2018
-
[9]
Visdrone-sot2019: The vision meets drone single ob- ject tracking challenge results
Dawei Du, Yue Zhang, Liefeng Bo, Hailin Shi, Rui Zhu, Bo Han, Chunhui Zhang, Guizhong Liu, Han Wu, Hao Wen, Haoran Wang, Pengfei Zhu, Jiaqing Fan, Jie Chen, Jie Gao, Jie Zhang, Jinghao Zhou, Jinliu Zhou, Jinwang Wang, Jiuqing Wan, Josef Kittler, Kaihua Zhang, Longyin Wen, Kaiqi Huang, Kang Yang, Kangkai Zhang, Lianghua Huang, Lijun Zhou, Lingling Shi, Lu ...
2019
-
[10]
Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara L. Berg. Referitgame: Referring to objects in pho- tographs of natural scenes. InEMNLP, pages 787–798, 2014. 7
2014
-
[11]
Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross B
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chlo´e Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C. Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross B. Girshick. Segment anything. InICCV, pages 3992–
-
[12]
LISA: reasoning segmenta- tion via large language model
Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. LISA: reasoning segmenta- tion via large language model. InCVPR, pages 9579–9589,
-
[13]
DASSF: dynamic-attention scale-sequence fusion for aerial object detection
Haodong Li and Haicheng Qu. DASSF: dynamic-attention scale-sequence fusion for aerial object detection. InCVM (1), pages 212–227, 2025. 3
2025
-
[14]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, pages 19730–19742, 2023. 2, 5
2023
-
[15]
Object detection in optical remote sensing images: A 4 survey and a new benchmark.ISPRS journal of photogram- metry and remote sensing, 159:296–307, 2020
Ke Li, Gang Wan, Gong Cheng, Liqiu Meng, and Junwei Han. Object detection in optical remote sensing images: A 4 survey and a new benchmark.ISPRS journal of photogram- metry and remote sensing, 159:296–307, 2020. 3
2020
-
[16]
Segearth-r1: Geospatial pixel reasoning via large language model
Kaiyu Li, Zepeng Xin, Li Pang, Chao Pang, Yupeng Deng, Jing Yao, Guisong Xia, Deyu Meng, Zhi Wang, and Xiangy- ong Cao. Segearth-r1: Geospatial pixel reasoning via large language model.CoRR, abs/2504.09644, 2025. 6
-
[17]
Video-llava: Learning united visual represen- tation by alignment before projection
Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual represen- tation by alignment before projection. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 5971–5984, 2024. 5
2024
-
[18]
Weihuang Lin, Yiwei Ma, Xiaoshuai Sun, Shuting He, Jiayi Ji, Liujuan Cao, and Rongrong Ji. Hrseg: High-resolution visual perception and enhancement for reasoning segmenta- tion.CoRR, abs/2507.12883, 2025. 2
-
[19]
Primitivenet: decomposing the global constraints for referring segmenta- tion.Visual Intelligence, 2(1):16, 2024
Chang Liu, Xudong Jiang, and Henghui Ding. Primitivenet: decomposing the global constraints for referring segmenta- tion.Visual Intelligence, 2(1):16, 2024. 2
2024
-
[20]
Re- moteclip: A vision language foundation model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, 62:1–16, 2024
Fan Liu, Delong Chen, Zhangqingyun Guan, Xiaocong Zhou, Jiale Zhu, Qiaolin Ye, Liyong Fu, and Jun Zhou. Re- moteclip: A vision language foundation model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, 62:1–16, 2024. 3
2024
-
[21]
Boost uav-based object detec- tion via scale-invariant feature disentanglement and adver- sarial learning.IEEE Transactions on Geoscience and Re- mote Sensing, 2025
Fan Liu, Liang Yao, Chuanyi Zhang, Ting Wu, Xinlei Zhang, Xiruo Jiang, and Jun Zhou. Boost uav-based object detec- tion via scale-invariant feature disentanglement and adver- sarial learning.IEEE Transactions on Geoscience and Re- mote Sensing, 2025. 3
2025
-
[22]
Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 2
2023
-
[23]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024. 5
2024
-
[24]
Aerialvln: Vision-and-language navigation for uavs
Shubo Liu, Hongsheng Zhang, Yuankai Qi, Peng Wang, Yan- ning Zhang, and Qi Wu. Aerialvln: Vision-and-language navigation for uavs. InICCV, pages 15338–15348, 2023. 3
2023
-
[25]
Rotated multi-scale interaction network for referring remote sensing image seg- mentation
Sihan Liu, Yiwei Ma, Xiaoqing Zhang, Haowei Wang, Ji- ayi Ji, Xiaoshuai Sun, and Rongrong Ji. Rotated multi-scale interaction network for referring remote sensing image seg- mentation. InCVPR, pages 26648–26658, 2024. 2, 7
2024
-
[26]
Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fan- bin Lu, Bei Yu, and Jiaya Jia. Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement.CoRR, abs/2503.06520, 2025. 3
-
[27]
A high resolution optical satellite image dataset for ship recognition and some new baselines
Zikun Liu, Liu Yuan, Lubin Weng, and Yiping Yang. A high resolution optical satellite image dataset for ship recognition and some new baselines. InICPRAM, pages 324–331, 2017. 2
2017
-
[28]
Internchat: Solving vision-centric tasks by interacting with chatbots beyond language
Zhaoyang Liu, Yinan He, Wenhai Wang, Weiyun Wang, Yi Wang, Shoufa Chen, Qinglong Zhang, Zeqiang Lai, Yang Yang, Qingyun Li, et al. Interngpt: Solving vision-centric tasks by interacting with chatgpt beyond language.arXiv preprint arXiv:2305.05662, 2023. 2
-
[29]
Feast your eyes: Mixture-of- resolution adaptation for multimodal large language models
Gen Luo, Yiyi Zhou, Yuxin Zhang, Xiawu Zheng, Xi- aoshuai Sun, and Rongrong Ji. Feast your eyes: Mixture-of- resolution adaptation for multimodal large language models. arXiv preprint arXiv:2403.03003, 2024. 3, 5
-
[30]
Generation and comprehension of unambiguous object descriptions
Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 11–20, 2016. 7
2016
-
[31]
Public dataset of parking lot videos for com- putational vision applied to surveillance
Ingrid Nascimento, Pedro Castro, Sofia Klautau, Luan Gonc ¸alves, Carnot Filho, Fl´avio Brito, Aldebaro Klautau, and Silvia Lins. Public dataset of parking lot videos for com- putational vision applied to surveillance. InICMLA, pages 61–64, 2020. 2
2020
-
[32]
Ruizhe Ou, Yuan Hu, Fan Zhang, Jiaxin Chen, and Yu Liu. Geopix: Multi-modal large language model for pixel-level image understanding in remote sensing.CoRR, abs/2501.06828, 2025. 2, 3, 7
-
[33]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 5
2021
-
[34]
Zero: memory optimizations toward training trillion parameter models
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: memory optimizations toward training trillion parameter models. InSC, page 20, 2020. 7
2020
-
[35]
Vehicle detec- tion in aerial imagery : A small target detection benchmark
S ´ebastien Razakarivony and Fr ´ed´eric Jurie. Vehicle detec- tion in aerial imagery : A small target detection benchmark. J. Vis. Commun. Image Represent., 34:187–203, 2016. 2
2016
-
[36]
Pixellm: Pixel reasoning with large multimodal model
Zhongwei Ren, Zhicheng Huang, Yunchao Wei, Yao Zhao, Dongmei Fu, Jiashi Feng, and Xiaojie Jin. Pixellm: Pixel reasoning with large multimodal model. InCVPR, pages 26364–26373, 2024. 2, 3, 6
2024
-
[37]
U- net: Convolutional networks for biomedical image segmen- tation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. InInternational Conference on Medical image com- puting and computer-assisted intervention, pages 234–241. Springer, 2015. 5
2015
- [38]
-
[39]
EVD4UA V: an altitude-sensitive benchmark to evade vehicle detection in UA V
Huiming Sun, Jiacheng Guo, Zibo Meng, Tianyun Zhang, Jianwu Fang, Yuewei Lin, and Hongkai Yu. EVD4UA V: an altitude-sensitive benchmark to evade vehicle detection in UA V. InIV, pages 545–552, 2024. 2
2024
-
[40]
Drone-based rgb-infrared cross-modality vehicle detection via uncertainty-aware learning.IEEE Trans
Yiming Sun, Bing Cao, Pengfei Zhu, and Qinghua Hu. Drone-based rgb-infrared cross-modality vehicle detection via uncertainty-aware learning.IEEE Trans. Circuits Syst. Video Technol., 32(10):6700–6713, 2022. 2
2022
-
[41]
Lijie Tao, Haokui Zhang, Haizhao Jing, Yu Liu, Kelu Yao, Chao Li, and Xizhe Xue. Advancements in visual language models for remote sensing: Datasets, capabilities, and en- hancement techniques.CoRR, abs/2410.17283, 2024. 3
-
[42]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Al- bert, Amjad Almahairi, Yasmine Babaei, Nikolay Bash- lykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhos- ale, Dan Bikel, Lukas Blecher, Cristian Canton-Ferrer, 5 Moya Chen, Guillem Cucurull, David Esiobu, Jude Fer- nandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Ant...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[43]
Oblique aerial images: a review focusing on georeferencing procedures
Styliani Verykokou and Charalabos Ioannidis. Oblique aerial images: a review focusing on georeferencing procedures. International journal of remote sensing, 39(11):3452–3496,
-
[44]
Uavswarm dataset: An unmanned aerial vehicle swarm dataset for multiple object tracking.Remote
Chuanyun Wang, Yang Su, Jingjing Wang, Tian Wang, and Qian Gao. Uavswarm dataset: An unmanned aerial vehicle swarm dataset for multiple object tracking.Remote. Sens., 14(11):2601, 2022. 2
2022
-
[45]
Llm-seg: Bridging image seg- mentation and large language model reasoning
Junchi Wang and Lei Ke. Llm-seg: Bridging image seg- mentation and large language model reasoning. InCVPR Workshops, pages 1765–1774, 2024. 3
2024
-
[46]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.CoRR, abs/2409.12191, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[47]
Segllm: Multi-round reasoning segmentation with large language models
Xudong Wang, Shaolun Zhang, Shufan Li, Kehan Li, Kon- stantinos Kallidromitis, Yusuke Kato, Kazuki Kozuka, and Trevor Darrell. Segllm: Multi-round reasoning segmentation with large language models. InICLR, 2025. 2
2025
-
[48]
Cong Wei, Haoxian Tan, Yujie Zhong, Yujiu Yang, and Lin Ma. Lasagna: Language-based segmentation assistant for complex queries.CoRR, abs/2404.08506, 2024. 3
-
[49]
Be- longie, Jiebo Luo, Mihai Datcu, Marcello Pelillo, and Liang- pei Zhang
Gui-Song Xia, Xiang Bai, Jian Ding, Zhen Zhu, Serge J. Be- longie, Jiebo Luo, Mihai Datcu, Marcello Pelillo, and Liang- pei Zhang. DOTA: A large-scale dataset for object detection in aerial images. InCVPR, pages 3974–3983, 2018. 2, 3
2018
-
[50]
Zhuofan Xia, Dongchen Han, Yizeng Han, Xuran Pan, Shiji Song, and Gao Huang. GSV A: generalized seg- mentation via multimodal large language models.CoRR, https://doi.org/10.48550/arXiv.2312.10103, 2023. 3
-
[51]
arXiv preprint arXiv:2312.17240 (2023)
Senqiao Yang, Tianyuan Qu, Xin Lai, Zhuotao Tian, Bo- hao Peng, Shu Liu, and Jiaya Jia. An improved baseline for reasoning segmentation with large language model.CoRR, abs/2312.17240, 2023. 3
-
[52]
Remotesam: Towards segment anything for earth observa- tion
Liang Yao, Fan Liu, Delong Chen, Chuanyi Zhang, Yijun Wang, Ziyun Chen, Wei Xu, Shimin Di, and Yuhui Zheng. Remotesam: Towards segment anything for earth observa- tion.CoRR, abs/2505.18022, 2025. 3
-
[53]
Remotesam: Towards segment anything for earth observa- tion
Liang Yao, Fan Liu, Delong Chen, Chuanyi Zhang, Yijun Wang, Ziyun Chen, Wei Xu, Shimin Di, and Yuhui Zheng. Remotesam: Towards segment anything for earth observa- tion. InProceedings of the 33rd ACM International Confer- ence on Multimedia, pages 3027–3036, 2025. 3
2025
-
[54]
Remotereasoner: Towards unifying geospatial reasoning workflow.arXiv preprint arXiv:2507.19280, 2025
Liang Yao, Fan Liu, Hongbo Lu, Chuanyi Zhang, Rui Min, Shengxiang Xu, Shimin Di, and Pai Peng. Remotereasoner: Towards unifying geospatial reasoning workflow.arXiv preprint arXiv:2507.19280, 2025. 3
-
[55]
Kai Ye, YingShi Luan, Zhudi Chen, Guangyue Meng, Pingyang Dai, and Liujuan Cao. RIS- LAD: A benchmark and model for referring low- altitude drone image segmentation.arXiv Preprint, https://doi.org/10.48550/arXiv.2507.20920, 2025. 4
-
[56]
Kai Ye, Haidi Tang, Bowen Liu, Pingyang Dai, Liujuan Cao, and Rongrong Ji. More clear, more flexible, more precise: A comprehensive oriented object detection benchmark for UA V.CoRR, abs/2504.20032, 2025. 4
-
[57]
The unmanned aerial vehicle benchmark: Object detection, tracking and baseline.Int
Hongyang Yu, Guorong Li, Weigang Zhang, Qingming Huang, Dawei Du, Qi Tian, and Nicu Sebe. The unmanned aerial vehicle benchmark: Object detection, tracking and baseline.Int. J. Comput. Vis., 128(5):1141–1159, 2020. 3
2020
-
[58]
Modeling context in referring expres- sions
Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in referring expres- sions. InEuropean conference on computer vision, pages 69–85. Springer, 2016. 7
2016
-
[59]
Rsmmformer: Multimodal trans- former using multiscale self-attention for remote sensing im- age classification
Bo Zhang, Zuheng Ming, Yaqian Liu, Wei Feng, Liang He, and Kaixing Zhao. Rsmmformer: Multimodal trans- former using multiscale self-attention for remote sensing im- age classification. InCICAI (1), pages 329–339, 2023. 3
2023
-
[60]
Llama-adapter: Efficient fine-tuning of large language models with zero- initialized attention
Renrui Zhang, Jiaming Han, Chris Liu, Aojun Zhou, Pan Lu, Yu Qiao, Hongsheng Li, and Peng Gao. Llama-adapter: Efficient fine-tuning of large language models with zero- initialized attention. InICLR, 2024. 2
2024
-
[61]
Villa: Video reasoning segmentation with large language model.arXiv preprint arXiv:2407.14500, 2024
Rongkun Zheng, Lu Qi, Xi Chen, Yi Wang, Kun Wang, Yu Qiao, and Hengshuang Zhao. Villa: Video rea- soning segmentation with large language model.CoRR, abs/2407.14500, 2024. 3
-
[62]
Ts4net: Two-stage sample selective strategy for rotating ob- ject detection.Neurocomputing, 501:753–764, 2022
Jian Zhou, Kai Feng, Weixing Li, Jun Han, and Feng Pan. Ts4net: Two-stage sample selective strategy for rotating ob- ject detection.Neurocomputing, 501:753–764, 2022. 2
2022
-
[63]
Vision Meets Drones: A Challenge
Pengfei Zhu, Longyin Wen, Xiao Bian, Haibin Ling, and Qinghua Hu. Vision meets drones: A challenge.CoRR, abs/1804.07437, 2018. 3 6
work page Pith review arXiv 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.