pith. sign in

arxiv: 2605.16179 · v1 · pith:ZZOAUXJVnew · submitted 2026-05-15 · 💻 cs.CV

MAgSeg: Segmentation of Agricultural Landscapes in High-Resolution Satellite Imagery using Multimodal Large Language Models

Pith reviewed 2026-05-20 18:47 UTC · model grok-4.3

classification 💻 cs.CV
keywords satellite image segmentationmultimodal large language modelssmallholder agricultureagricultural landscape mappinginstruction tuningdecoder-free segmentationhigh-resolution imagery
0
0 comments X

The pith

A new instruction format lets standard multimodal models segment fragmented smallholder farms in satellite images without extra decoders.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that multimodal large language models can handle segmentation of complex agricultural landscapes in high-resolution satellite imagery using only text-based fine-tuning. It addresses the problems of fragmented plots, high intra-class variation, and scarce labels by introducing a data format that lets the model see the full image context but generate output tokens for just one patch at a time. This design removes the need for auxiliary vision decoders and sidesteps context-length limits while still producing accurate maps. If the method works as described, it turns existing multimodal models into practical tools for mapping smallholder agriculture across data-poor regions.

Core claim

MAgSeg demonstrates that standard multimodal large language models, when fine-tuned with a novel instruction tuning data format, can segment smallholder agricultural landscapes in high-resolution satellite imagery without auxiliary vision decoders by learning global image context while producing text tokens only for a local patch.

What carries the argument

The novel instruction tuning data format that supplies global image context but restricts token generation to one local patch per output.

If this is right

  • Standard multimodal models can now perform segmentation on high-resolution imagery without added vision components.
  • The approach scales fine-tuning to larger images by avoiding full-context token generation.
  • Evaluations across three countries show consistent gains over existing MLLM segmentation methods.
  • The method supplies a practical route to mapping fragmented smallholder environments with limited labeled data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same patch-wise output trick could be tested on other remote-sensing tasks such as building detection or land-cover change.
  • If the format works for agriculture, it may reduce reliance on specialized decoder architectures in other fragmented-object domains.
  • One could check whether the method maintains performance when applied to multi-date image stacks rather than single scenes.

Load-bearing premise

The new instruction tuning format lets the model absorb full-image context while outputting tokens for only a local patch without any drop in segmentation accuracy.

What would settle it

Run MAgSeg on the same high-resolution satellite datasets used in the paper and check whether its segmentation accuracy on smallholder plots falls below that of decoder-equipped MLLM baselines.

Figures

Figures reproduced from arXiv: 2605.16179 by Aishwarya Jayagopal, Alok Talekar, Depanshu Sani, Piyush Tiwary, Sagar Gubbi, Subhashini Venugopalan, Utkarsh Ahuja, Vaibhav Rajan.

Figure 1
Figure 1. Figure 1: Overview of MAgSeg. Data Preparation: from each high-resolution satellite image xi and its segmentation map si, multiple patches pi and their corresponding masks ri are extracted, the masks are converted to a text-based RRLE representation ti to form the instruction tuning dataset: {Itext, xi, pi} → ti. Training: consists of two stages: (1) LoRA Supervised Finetuning (SFT), where the base multimodal LLM is… view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative results comparing our approach MAgSeg with SOTA baselines. GT: Ground Truth. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Size-stratified performance analysis on the ALU dataset. [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Climatic Region stratified performance analysis on the [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ecological Region stratified performance analysis on the [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison of MAgSeg’s segmentation performance with and without GRPO post-training. Region of interests are [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative results on India data. We compare our approach, MAgSeg against SOTA segmentation baselines. GT: Ground Truth. [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative results on Cambodia data. We compare our approach, MAgSeg against SOTA segmentation baselines. GT: Ground [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative results on Vietnam data. We compare our approach, MAgSeg, against SOTA segmentation baselines. GT: Ground [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Visualization of erroneous segmentation by MAgSeg. GT: Ground Truth. Region of interests are emphasized in red colored boxes. [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
read the original abstract

Agricultural landscape segmentation in the Global South is challenging as it is characterized by fragmented plots, high intra-class variance, and a scarcity of labeled training data. Recent advances in segmentation have been made by Multimodal Large Language Models (MLLMs). However, current approaches encounter critical context length bottlenecks and a domain alignment gap in understanding satellite features. We address these limitations through MAgSeg, a novel, decoder-free MLLM segmentation approach. MAgSeg is an architecturally efficient approach that enables standard MLLMs to perform segmentation of complex smallholder agricultural landscapes from high-resolution satellite imagery, without requiring auxiliary vision decoders. We introduce a novel instruction tuning data format designed to enable scalable fine-tuning and post-training on high resolution satellite imagery, which enables MAgSeg to learn from the global context of the image while generating text tokens for only a patch within the image. Extensive evaluations on datasets spanning three countries in the Global South demonstrate that MAgSeg significantly outperforms state-of-the-art MLLM baselines, offering a scalable solution to map smallholder agricultural environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents MAgSeg, a decoder-free segmentation method that adapts standard Multimodal Large Language Models to high-resolution satellite imagery for mapping fragmented smallholder agricultural landscapes in the Global South. It introduces a novel instruction-tuning data format that purportedly allows the model to internalize global image context while restricting text-token generation to a single local patch, thereby avoiding context-length bottlenecks and eliminating the need for auxiliary vision decoders. The central claim is that this architectural and data-format change yields significant performance gains over existing MLLM baselines on datasets spanning three countries.

Significance. If the core assumption holds, the work would be significant for computer vision and remote-sensing applications: it offers a scalable, decoder-free route to leverage existing MLLMs on high-resolution imagery without custom vision heads, potentially lowering the barrier for accurate mapping of complex, data-scarce agricultural environments. The emphasis on Global South smallholder landscapes also addresses an under-served domain.

major comments (3)
  1. [§3.2] §3.2 (Novel Instruction Tuning Data Format): The description of how global context is encoded in the prompt while restricting token generation to a local patch remains high-level; no concrete prompt templates, patch-sampling strategy, or mechanism for injecting global cues (e.g., downsampled overview tokens) are provided. Without these details it is impossible to assess whether the format truly preserves segmentation accuracy on fragmented plots with high intra-class variance.
  2. [§4] §4 (Experiments): The abstract and evaluation summary claim significant outperformance on three-country datasets, yet the manuscript supplies no quantitative metrics, error bars, ablation studies on the data format, or implementation details for the MLLM baselines. This absence prevents verification of the central claim that the new format delivers global context without accuracy loss.
  3. [§4.3] §4.3 (Ablation or Component Analysis): No ablation isolating the contribution of the novel data format versus standard instruction tuning is reported. Such an experiment is load-bearing for the claim that the format is the key enabler of decoder-free performance.
minor comments (2)
  1. [Figure 2] Figure 2 (architecture diagram) would benefit from explicit annotation of the global-context injection path and the local-patch token generation boundary.
  2. [§2] The related-work section should include a brief comparison to recent decoder-free MLLM segmentation methods outside the agricultural domain to clarify novelty.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript describing MAgSeg. The comments highlight important areas where additional clarity and evidence would strengthen the presentation. We address each major comment point by point below and outline the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Novel Instruction Tuning Data Format): The description of how global context is encoded in the prompt while restricting token generation to a local patch remains high-level; no concrete prompt templates, patch-sampling strategy, or mechanism for injecting global cues (e.g., downsampled overview tokens) are provided. Without these details it is impossible to assess whether the format truly preserves segmentation accuracy on fragmented plots with high intra-class variance.

    Authors: We agree that the current description in §3.2 is high-level and would benefit from greater specificity. In the revised manuscript we will add concrete prompt templates, a detailed description of the patch-sampling strategy, and an explicit account of how global cues are injected (including the use of downsampled overview tokens). These additions will allow readers to evaluate whether the format maintains segmentation accuracy on fragmented plots with high intra-class variance. revision: yes

  2. Referee: [§4] §4 (Experiments): The abstract and evaluation summary claim significant outperformance on three-country datasets, yet the manuscript supplies no quantitative metrics, error bars, ablation studies on the data format, or implementation details for the MLLM baselines. This absence prevents verification of the central claim that the new format delivers global context without accuracy loss.

    Authors: We acknowledge that the current version of §4 does not provide sufficient quantitative detail to fully verify the performance claims. We will expand this section to report specific quantitative metrics with error bars, implementation details for all MLLM baselines, and additional ablation results on the data format. These changes will substantiate the reported outperformance across the three-country datasets. revision: yes

  3. Referee: [§4.3] §4.3 (Ablation or Component Analysis): No ablation isolating the contribution of the novel data format versus standard instruction tuning is reported. Such an experiment is load-bearing for the claim that the format is the key enabler of decoder-free performance.

    Authors: We recognize that an ablation isolating the novel instruction-tuning format is essential to support the central claim. We will add a dedicated ablation study (or expand §4.3) that directly compares the proposed data format against standard instruction tuning, thereby demonstrating its specific contribution to decoder-free performance. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained architectural innovation

full rationale

The paper introduces MAgSeg as a decoder-free MLLM approach relying on a novel instruction tuning data format to handle global context with local patch token generation. This is presented as an empirical architectural and data-format contribution evaluated on multi-country datasets, without any equations, fitted parameters, or derivations that reduce to prior outputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes imported from prior author work appear in the provided text; the central claims rest on reported outperformance rather than re-expression of inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard assumptions of multimodal model fine-tuning and the premise that the new data format preserves global context without explicit architectural changes.

axioms (1)
  • domain assumption Standard MLLM architectures can be instruction-tuned to output segmentation masks via text tokens when given appropriately formatted prompts.
    Invoked in the description of the novel instruction tuning data format.

pith-pipeline@v0.9.0 · 5760 in / 1204 out tokens · 38214 ms · 2026-05-20T18:47:00.651298+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 3 internal anchors

  1. [1]

    Subobject-level image tokenization

    [Chenet al., 2025 ] Delong Chen, Samuel Cahyawijaya, Jianfeng Liu, Baoyuan Wang, and Pascale Fung. Subobject-level image tokenization. InInternational Con- ference on Machine Learning,

  2. [2]

    Agricultural land- scape understanding at country-scale.arXiv,

    [Duaet al., 2024 ] Radhika Dua, Nikita Saxena, Aditi Agar- wal, Alex Wilson, Gaurav Singh, Hoang Tran, Ishan Deshpande, Amandeep Kaur, Gaurav Aggarwal, Chandan Nath, Arnab Basu, Vishal Batchu, Sharath Holla, Bindiya Kurle, Olana Missura, Rahul Aggarwal, Shubhika Garg, Nishi Shah, Avneet Singh, Dinesh Tewari, Agata Dondzik, Bharat Adsul, Milind Sohoni, Asi...

  3. [3]

    [FAOet al., 2023] FAO, IFAD, UNICEF, WFP, and WHO

    https: //arxiv.org/abs/2411.05359. [FAOet al., 2023] FAO, IFAD, UNICEF, WFP, and WHO. The state of food security and nutrition in the world

  4. [4]

    org/10.4060/cc3017en

    https://doi. org/10.4060/cc3017en. [Huet al., 2022 ] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3,

  5. [5]

    Kang and M.¨Ozdo˘gan

    [Kang and ¨Ozdo˘gan, 2019] Y . Kang and M.¨Ozdo˘gan. Field- level crop yield mapping with landsat using a hierarchical data assimilation approach.Remote Sensing of Environ- ment, 228:144–163,

  6. [6]

    Multi-region transfer learning for segmen- tation of crop field boundaries in satellite images with lim- ited labels

    [Kerneret al., 2023 ] Hannah Kerner, Saketh Sundar, and Mathan Satish. Multi-region transfer learning for segmen- tation of crop field boundaries in satellite images with lim- ited labels. InProceedings of the AAAI Workshop on AI to Accelerate Science and Engineering,

  7. [7]

    Segment anything

    [Kirillovet al., 2023 ] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InPro- ceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026,

  8. [8]

    Lisa: Reason- ing segmentation via large language model

    [Laiet al., 2024 ] Xin Lai, Zhuotao Tian, Yukang Chen, Yan- wei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reason- ing segmentation via large language model. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9579–9589,

  9. [9]

    Text4seg: Reimagining image segmentation as text generation

    [Lanet al., 2025 ] Mengcheng Lan, Chaofeng Chen, Yue Zhou, Jiaxing Xu, Yiping Ke, Xinjiang Wang, Litong Feng, and Wayne Zhang. Text4seg: Reimagining image segmentation as text generation. InThe Thirteenth Inter- national Conference on Learning Representations,

  10. [10]

    Lesiv, J.C

    [Lesivet al., 2019 ] M. Lesiv, J.C. Laso Bayas, L. See, M. Duerauer, D. Dahlia, N. Durando, R. Hazarika, P. Ku- mar Sahariah, M. Vakolyuk, and V . Blyshchyk. Estimating the global distribution of field size using crowdsourcing. Global Change Biology, 25:174–186,

  11. [11]

    Vision-language models in remote sensing: Current progress and future trends.IEEE Geoscience and Remote Sensing Magazine, 12(2):32–66,

    [Liet al., 2024 ] Xiang Li, Congcong Wen, Yuan Hu, Zheng- hang Yuan, and Xiao Xiang Zhu. Vision-language models in remote sensing: Current progress and future trends.IEEE Geoscience and Remote Sensing Magazine, 12(2):32–66,

  12. [12]

    Masoud, C

    [Masoudet al., 2020 ] K.M. Masoud, C. Persello, and V .A. Tolpekin. Delineation of agricultural field boundaries from sentinel-2 images using a novel super-resolution contour detector based on fully convolutional networks.Remote Sensing, 12(1):59,

  13. [13]

    Using deep learning and very- high-resolution imagery to map smallholder field bound- aries.Remote Sensing, 14(13):3046,

    [Meiet al., 2022 ] Weiye Mei, Haoyu Wang, David Fouhey, Weiqi Zhou, Isabella Hinks, Josh M Gray, Derek Van Berkel, and Meha Jain. Using deep learning and very- high-resolution imagery to map smallholder field bound- aries.Remote Sensing, 14(13):3046,

  14. [14]

    Categorisa- tion of farmers

    [Ministry of Agriculture and Farmers Welfare, 2024] Ministry of Agriculture and Farmers Welfare. Categorisa- tion of farmers. https://www.pib.gov.in/PressReleasePage. aspx?PRID=2085181,

  15. [15]

    OECD Publishing, Paris,

    [OECD, 2023] OECD.Agricultural Policy Monitoring and Evaluation 2023: Adapting Agriculture to Climate Change. OECD Publishing, Paris,

  16. [16]

    [Perselloet al., 2023 ] Claudio Persello, Jeroen Grift, Xinyan Fan, Claudia Paris, Ronny H ¨ansch, Mila Koeva, and An- drew Nelson

    https://doi.org/ 10.1787/b14de474-en. [Perselloet al., 2023 ] Claudio Persello, Jeroen Grift, Xinyan Fan, Claudia Paris, Ronny H ¨ansch, Mila Koeva, and An- drew Nelson. Ai4smallfarms: A dataset for crop field de- lineation in southeast asian smallholder farms.IEEE Geo- science and Remote Sensing Letters, 20:1–5,

  17. [17]

    [Quenumet al., 2025 ] Jerome Quenum, Wen-Han Hsieh, Tsung-Han Wu, Ritwik Gupta, Trevor Darrell, and David M. Chan. LISAt: Language-instructed segmenta- tion assistant for satellite imagery. InThe Thirty-ninth An- nual Conference on Neural Information Processing Sys- tems Datasets and Benchmarks Track,

  18. [18]

    Rada and K.O

    [Rada and Fuglie, 2019] N.E. Rada and K.O. Fuglie. New perspectives on farm size and productivity.Food Policy, 84:147–152,

  19. [19]

    Learning transferable visual models from nat- ural language supervision

    [Radfordet al., 2021 ] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from nat- ural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR,

  20. [20]

    Glamm: Pixel grounding large multimodal model

    [Rasheedet al., 2024 ] Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M Anwer, Eric Xing, Ming- Hsuan Yang, and Fahad S Khan. Glamm: Pixel grounding large multimodal model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13009–13018,

  21. [21]

    Mission critical–satellite data is a distinct modality in machine learning

    [Rolfet al., 2024 ] Esther Rolf, Konstantin Klemmer, Caleb Robinson, and Hannah Kerner. Mission critical–satellite data is a distinct modality in machine learning. InInterna- tional Conference on Learning Representations,

  22. [22]

    Agricultural intensifica- tion and changes in cultivated areas, 1970–2005.Proceed- ings of the national academy of sciences, 106(49):20675– 20680,

    [Rudelet al., 2009 ] Thomas K Rudel, Laura Schneider, Maria Uriarte, Billie Lee Turner, Ruth DeFries, Deborah Lawrence, Jacqueline Geoghegan, Susanna Hecht, Amy Ickowitz, Eric F Lambin, et al. Agricultural intensifica- tion and changes in cultivated areas, 1970–2005.Proceed- ings of the national academy of sciences, 106(49):20675– 20680,

  23. [23]

    Samberg, J.S

    [Samberget al., 2016 ] L.H. Samberg, J.S. Gerber, N. Ra- mankutty, M. Herrero, and P.C. West. Subnational distri- bution of average farm size and smallholder contributions to global food production.Environmental Research Let- ters, 11(12):124010,

  24. [24]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    [Shaoet al., 2024 ] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open lan- guage models.arXiv preprint arXiv:2402.03300,

  25. [25]

    Success stories on information and communication technologies for agriculture and rural development,

    [Sylvester and others, 2015] Gerard Sylvester et al. Success stories on information and communication technologies for agriculture and rural development,

  26. [26]

    Gemma 3 Technical Report

    [Teamet al., 2025 ] Gemma Team, Aishwarya Kamath, Jo- han Ferret, Shreya Pathak, Nino Vieillard, Ramona Mer- hej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ram´e, Morgane Rivi`ere, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786,

  27. [27]

    Watersheds in digital spaces: an efficient algorithm based on immersion simulations.IEEE Transactions on Pattern Analysis & Machine Intelligence, 13(06):583–598,

    [Vincent and Soille, 1991] Luc Vincent and Pierre Soille. Watersheds in digital spaces: an efficient algorithm based on immersion simulations.IEEE Transactions on Pattern Analysis & Machine Intelligence, 13(06):583–598,

  28. [28]

    Deep learning on edge: Extracting field boundaries from satellite images with a convolu- tional neural network.Remote sensing of environment, 245:111741,

    [Waldner and Diakogiannis, 2020] Franc ¸ois Waldner and Foivos I Diakogiannis. Deep learning on edge: Extracting field boundaries from satellite images with a convolu- tional neural network.Remote sensing of environment, 245:111741,

  29. [29]

    Detect, consolidate, delineate: Scalable mapping of field boundaries using satellite im- ages.Remote sensing, 13(11):2197,

    [Waldneret al., 2021 ] Franc ¸ois Waldner, Foivos I Diako- giannis, Kathryn Batchelor, Michael Ciccotosto-Camp, Elizabeth Cooper-Williams, Chris Herrmann, Gonzalo Mata, and Andrew Toovey. Detect, consolidate, delineate: Scalable mapping of field boundaries using satellite im- ages.Remote sensing, 13(11):2197,

  30. [30]

    [Wanget al., 2022 ] Sherrie Wang, Franc ¸ois Waldner, and David B. Lobell. Unlocking large-scale crop field delin- eation in smallholder farming systems with transfer learn- ing and weak supervision.Remote Sensing, 14(22),

  31. [31]

    Visionllm: Large language model is also an open-ended decoder for vision- centric tasks.Advances in Neural Information Processing Systems, 36:61501–61513,

    [Wanget al., 2023 ] Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, et al. Visionllm: Large language model is also an open-ended decoder for vision- centric tasks.Advances in Neural Information Processing Systems, 36:61501–61513,

  32. [32]

    Visionllm v2: An end-to-end generalist multimodal large language model for hundreds of vision-language tasks.Advances in Neural Information Processing Systems, 37:69925–69975,

    [Wuet al., 2024 ] Jiannan Wu, Muyan Zhong, Sen Xing, Zeqiang Lai, Zhaoyang Liu, Zhe Chen, Wenhai Wang, Xizhou Zhu, Lewei Lu, Tong Lu, et al. Visionllm v2: An end-to-end generalist multimodal large language model for hundreds of vision-language tasks.Advances in Neural Information Processing Systems, 37:69925–69975,

  33. [33]

    Fsvlm: A vision-language model for remote sensing farmland segmentation.IEEE Transactions on Geoscience and Remote Sensing,

    [Wuet al., 2025 ] Haiyang Wu, Zhuofei Du, Dandan Zhong, Yuze Wang, and Chao Tao. Fsvlm: A vision-language model for remote sensing farmland segmentation.IEEE Transactions on Geoscience and Remote Sensing,

  34. [34]

    Gsva: Gener- alized segmentation via multimodal large language mod- els

    [Xiaet al., 2024 ] Zhuofan Xia, Dongchen Han, Yizeng Han, Xuran Pan, Shiji Song, and Gao Huang. Gsva: Gener- alized segmentation via multimodal large language mod- els. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3858–3869, June

  35. [35]

    Seg- former: Simple and efficient design for semantic segmen- tation with transformers.Advances in neural information processing systems, 34:12077–12090,

    [Xieet al., 2021 ] Enze Xie, Wenhai Wang, Zhiding Yu, An- ima Anandkumar, Jose M Alvarez, and Ping Luo. Seg- former: Simple and efficient design for semantic segmen- tation with transformers.Advances in neural information processing systems, 34:12077–12090,

  36. [36]

    [Yanget al., 2022 ] Zhao Yang, Jiaqi Wang, Yansong Tang, Kai Chen, Hengshuang Zhao, and Philip H.S. Torr. Lavt: Language-aware vision transformer for referring image segmentation. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), pages 18155–18165, June

  37. [37]

    A com- prehensive review of agricultural parcel and boundary de- lineation from remote sensing images: Recent progress and future perspectives.arXiv preprint arXiv:2508.14558,

    [Zhenget al., 2025 ] Juepeng Zheng, Zi Ye, Yibin Wen, Jianxi Huang, Zhiwei Zhang, Qingmei Li, Qiong Hu, Baodong Xu, Lingyuan Zhao, and Haohuan Fu. A com- prehensive review of agricultural parcel and boundary de- lineation from remote sensing images: Recent progress and future perspectives.arXiv preprint arXiv:2508.14558,

  38. [38]

    Image segmentation in foundation model era: A survey.arXiv preprint arXiv:2408.12957,

    [Zhouet al., 2024 ] Tianfei Zhou, Wang Xia, Fei Zhang, Boyu Chang, Wenguan Wang, Ye Yuan, Ender Konukoglu, and Daniel Cremers. Image segmentation in foundation model era: A survey.arXiv preprint arXiv:2408.12957,

  39. [39]

    Deep learning in remote sensing: A compre- hensive review and list of resources.IEEE geoscience and remote sensing magazine, 5(4):8–36,

    [Zhuet al., 2017 ] Xiao Xiang Zhu, Devis Tuia, Lichao Mou, Gui-Song Xia, Liangpei Zhang, Feng Xu, and Friedrich Fraundorfer. Deep learning in remote sensing: A compre- hensive review and list of resources.IEEE geoscience and remote sensing magazine, 5(4):8–36,

  40. [40]

    C Evaluation Metrics Following [Duaet al., 2024 ], for instance-wise metrics, we merge and match multiple predictions that overlap with a ground truth instance

    The refined predictions are obtained by refining the coarse predictions using SAM ViT-H backbone. C Evaluation Metrics Following [Duaet al., 2024 ], for instance-wise metrics, we merge and match multiple predictions that overlap with a ground truth instance. A particular predicted instance is con- sidered to match with a ground truth instance if they belo...