MAgSeg: Segmentation of Agricultural Landscapes in High-Resolution Satellite Imagery using Multimodal Large Language Models

Aishwarya Jayagopal; Alok Talekar; Depanshu Sani; Piyush Tiwary; Sagar Gubbi; Subhashini Venugopalan; Utkarsh Ahuja; Vaibhav Rajan

arxiv: 2605.16179 · v1 · pith:ZZOAUXJVnew · submitted 2026-05-15 · 💻 cs.CV

MAgSeg: Segmentation of Agricultural Landscapes in High-Resolution Satellite Imagery using Multimodal Large Language Models

Piyush Tiwary , Utkarsh Ahuja , Depanshu Sani , Aishwarya Jayagopal , Sagar Gubbi , Subhashini Venugopalan , Alok Talekar , Vaibhav Rajan This is my paper

Pith reviewed 2026-05-20 18:47 UTC · model grok-4.3

classification 💻 cs.CV

keywords satellite image segmentationmultimodal large language modelssmallholder agricultureagricultural landscape mappinginstruction tuningdecoder-free segmentationhigh-resolution imagery

0 comments

The pith

A new instruction format lets standard multimodal models segment fragmented smallholder farms in satellite images without extra decoders.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that multimodal large language models can handle segmentation of complex agricultural landscapes in high-resolution satellite imagery using only text-based fine-tuning. It addresses the problems of fragmented plots, high intra-class variation, and scarce labels by introducing a data format that lets the model see the full image context but generate output tokens for just one patch at a time. This design removes the need for auxiliary vision decoders and sidesteps context-length limits while still producing accurate maps. If the method works as described, it turns existing multimodal models into practical tools for mapping smallholder agriculture across data-poor regions.

Core claim

MAgSeg demonstrates that standard multimodal large language models, when fine-tuned with a novel instruction tuning data format, can segment smallholder agricultural landscapes in high-resolution satellite imagery without auxiliary vision decoders by learning global image context while producing text tokens only for a local patch.

What carries the argument

The novel instruction tuning data format that supplies global image context but restricts token generation to one local patch per output.

If this is right

Standard multimodal models can now perform segmentation on high-resolution imagery without added vision components.
The approach scales fine-tuning to larger images by avoiding full-context token generation.
Evaluations across three countries show consistent gains over existing MLLM segmentation methods.
The method supplies a practical route to mapping fragmented smallholder environments with limited labeled data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same patch-wise output trick could be tested on other remote-sensing tasks such as building detection or land-cover change.
If the format works for agriculture, it may reduce reliance on specialized decoder architectures in other fragmented-object domains.
One could check whether the method maintains performance when applied to multi-date image stacks rather than single scenes.

Load-bearing premise

The new instruction tuning format lets the model absorb full-image context while outputting tokens for only a local patch without any drop in segmentation accuracy.

What would settle it

Run MAgSeg on the same high-resolution satellite datasets used in the paper and check whether its segmentation accuracy on smallholder plots falls below that of decoder-equipped MLLM baselines.

Figures

Figures reproduced from arXiv: 2605.16179 by Aishwarya Jayagopal, Alok Talekar, Depanshu Sani, Piyush Tiwary, Sagar Gubbi, Subhashini Venugopalan, Utkarsh Ahuja, Vaibhav Rajan.

**Figure 1.** Figure 1: Overview of MAgSeg. Data Preparation: from each high-resolution satellite image xi and its segmentation map si, multiple patches pi and their corresponding masks ri are extracted, the masks are converted to a text-based RRLE representation ti to form the instruction tuning dataset: {Itext, xi, pi} → ti. Training: consists of two stages: (1) LoRA Supervised Finetuning (SFT), where the base multimodal LLM is… view at source ↗

**Figure 2.** Figure 2: Qualitative results comparing our approach MAgSeg with SOTA baselines. GT: Ground Truth. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Size-stratified performance analysis on the ALU dataset. [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Climatic Region stratified performance analysis on the [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: Ecological Region stratified performance analysis on the [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative comparison of MAgSeg’s segmentation performance with and without GRPO post-training. Region of interests are [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative results on India data. We compare our approach, MAgSeg against SOTA segmentation baselines. GT: Ground Truth. [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative results on Cambodia data. We compare our approach, MAgSeg against SOTA segmentation baselines. GT: Ground [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative results on Vietnam data. We compare our approach, MAgSeg, against SOTA segmentation baselines. GT: Ground [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Visualization of erroneous segmentation by MAgSeg. GT: Ground Truth. Region of interests are emphasized in red colored boxes. [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

read the original abstract

Agricultural landscape segmentation in the Global South is challenging as it is characterized by fragmented plots, high intra-class variance, and a scarcity of labeled training data. Recent advances in segmentation have been made by Multimodal Large Language Models (MLLMs). However, current approaches encounter critical context length bottlenecks and a domain alignment gap in understanding satellite features. We address these limitations through MAgSeg, a novel, decoder-free MLLM segmentation approach. MAgSeg is an architecturally efficient approach that enables standard MLLMs to perform segmentation of complex smallholder agricultural landscapes from high-resolution satellite imagery, without requiring auxiliary vision decoders. We introduce a novel instruction tuning data format designed to enable scalable fine-tuning and post-training on high resolution satellite imagery, which enables MAgSeg to learn from the global context of the image while generating text tokens for only a patch within the image. Extensive evaluations on datasets spanning three countries in the Global South demonstrate that MAgSeg significantly outperforms state-of-the-art MLLM baselines, offering a scalable solution to map smallholder agricultural environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MAgSeg claims a decoder-free MLLM route to segmenting fragmented smallholder farms in high-res satellite images via a new patch-wise instruction format, but the abstract gives no numbers or prompt details so the gains stay unverified.

read the letter

The main point is that this paper tries to adapt existing multimodal LLMs for segmenting high-resolution satellite images of smallholder agriculture without adding vision decoders. It uses a new instruction tuning format that feeds the full image for context but restricts text token output to one local patch at a time, aiming to dodge context length limits while handling fragmented plots and high intra-class variance in data-poor regions like the Global South.

Referee Report

3 major / 2 minor

Summary. The paper presents MAgSeg, a decoder-free segmentation method that adapts standard Multimodal Large Language Models to high-resolution satellite imagery for mapping fragmented smallholder agricultural landscapes in the Global South. It introduces a novel instruction-tuning data format that purportedly allows the model to internalize global image context while restricting text-token generation to a single local patch, thereby avoiding context-length bottlenecks and eliminating the need for auxiliary vision decoders. The central claim is that this architectural and data-format change yields significant performance gains over existing MLLM baselines on datasets spanning three countries.

Significance. If the core assumption holds, the work would be significant for computer vision and remote-sensing applications: it offers a scalable, decoder-free route to leverage existing MLLMs on high-resolution imagery without custom vision heads, potentially lowering the barrier for accurate mapping of complex, data-scarce agricultural environments. The emphasis on Global South smallholder landscapes also addresses an under-served domain.

major comments (3)

[§3.2] §3.2 (Novel Instruction Tuning Data Format): The description of how global context is encoded in the prompt while restricting token generation to a local patch remains high-level; no concrete prompt templates, patch-sampling strategy, or mechanism for injecting global cues (e.g., downsampled overview tokens) are provided. Without these details it is impossible to assess whether the format truly preserves segmentation accuracy on fragmented plots with high intra-class variance.
[§4] §4 (Experiments): The abstract and evaluation summary claim significant outperformance on three-country datasets, yet the manuscript supplies no quantitative metrics, error bars, ablation studies on the data format, or implementation details for the MLLM baselines. This absence prevents verification of the central claim that the new format delivers global context without accuracy loss.
[§4.3] §4.3 (Ablation or Component Analysis): No ablation isolating the contribution of the novel data format versus standard instruction tuning is reported. Such an experiment is load-bearing for the claim that the format is the key enabler of decoder-free performance.

minor comments (2)

[Figure 2] Figure 2 (architecture diagram) would benefit from explicit annotation of the global-context injection path and the local-patch token generation boundary.
[§2] The related-work section should include a brief comparison to recent decoder-free MLLM segmentation methods outside the agricultural domain to clarify novelty.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript describing MAgSeg. The comments highlight important areas where additional clarity and evidence would strengthen the presentation. We address each major comment point by point below and outline the revisions we will make to the manuscript.

read point-by-point responses

Referee: [§3.2] §3.2 (Novel Instruction Tuning Data Format): The description of how global context is encoded in the prompt while restricting token generation to a local patch remains high-level; no concrete prompt templates, patch-sampling strategy, or mechanism for injecting global cues (e.g., downsampled overview tokens) are provided. Without these details it is impossible to assess whether the format truly preserves segmentation accuracy on fragmented plots with high intra-class variance.

Authors: We agree that the current description in §3.2 is high-level and would benefit from greater specificity. In the revised manuscript we will add concrete prompt templates, a detailed description of the patch-sampling strategy, and an explicit account of how global cues are injected (including the use of downsampled overview tokens). These additions will allow readers to evaluate whether the format maintains segmentation accuracy on fragmented plots with high intra-class variance. revision: yes
Referee: [§4] §4 (Experiments): The abstract and evaluation summary claim significant outperformance on three-country datasets, yet the manuscript supplies no quantitative metrics, error bars, ablation studies on the data format, or implementation details for the MLLM baselines. This absence prevents verification of the central claim that the new format delivers global context without accuracy loss.

Authors: We acknowledge that the current version of §4 does not provide sufficient quantitative detail to fully verify the performance claims. We will expand this section to report specific quantitative metrics with error bars, implementation details for all MLLM baselines, and additional ablation results on the data format. These changes will substantiate the reported outperformance across the three-country datasets. revision: yes
Referee: [§4.3] §4.3 (Ablation or Component Analysis): No ablation isolating the contribution of the novel data format versus standard instruction tuning is reported. Such an experiment is load-bearing for the claim that the format is the key enabler of decoder-free performance.

Authors: We recognize that an ablation isolating the novel instruction-tuning format is essential to support the central claim. We will add a dedicated ablation study (or expand §4.3) that directly compares the proposed data format against standard instruction tuning, thereby demonstrating its specific contribution to decoder-free performance. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained architectural innovation

full rationale

The paper introduces MAgSeg as a decoder-free MLLM approach relying on a novel instruction tuning data format to handle global context with local patch token generation. This is presented as an empirical architectural and data-format contribution evaluated on multi-country datasets, without any equations, fitted parameters, or derivations that reduce to prior outputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes imported from prior author work appear in the provided text; the central claims rest on reported outperformance rather than re-expression of inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard assumptions of multimodal model fine-tuning and the premise that the new data format preserves global context without explicit architectural changes.

axioms (1)

domain assumption Standard MLLM architectures can be instruction-tuned to output segmentation masks via text tokens when given appropriately formatted prompts.
Invoked in the description of the novel instruction tuning data format.

pith-pipeline@v0.9.0 · 5760 in / 1204 out tokens · 38214 ms · 2026-05-20T18:47:00.651298+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

GRPO... mean-DICE score as a direct reward signal

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 3 internal anchors

[1]

Subobject-level image tokenization

[Chenet al., 2025 ] Delong Chen, Samuel Cahyawijaya, Jianfeng Liu, Baoyuan Wang, and Pascale Fung. Subobject-level image tokenization. InInternational Con- ference on Machine Learning,

work page 2025
[2]

Agricultural land- scape understanding at country-scale.arXiv,

[Duaet al., 2024 ] Radhika Dua, Nikita Saxena, Aditi Agar- wal, Alex Wilson, Gaurav Singh, Hoang Tran, Ishan Deshpande, Amandeep Kaur, Gaurav Aggarwal, Chandan Nath, Arnab Basu, Vishal Batchu, Sharath Holla, Bindiya Kurle, Olana Missura, Rahul Aggarwal, Shubhika Garg, Nishi Shah, Avneet Singh, Dinesh Tewari, Agata Dondzik, Bharat Adsul, Milind Sohoni, Asi...

work page 2024
[3]

[FAOet al., 2023] FAO, IFAD, UNICEF, WFP, and WHO

https: //arxiv.org/abs/2411.05359. [FAOet al., 2023] FAO, IFAD, UNICEF, WFP, and WHO. The state of food security and nutrition in the world

work page internal anchor Pith review arXiv 2023
[4]

org/10.4060/cc3017en

https://doi. org/10.4060/cc3017en. [Huet al., 2022 ] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3,

work page doi:10.4060/cc3017en 2022
[5]

Kang and M.¨Ozdo˘gan

[Kang and ¨Ozdo˘gan, 2019] Y . Kang and M.¨Ozdo˘gan. Field- level crop yield mapping with landsat using a hierarchical data assimilation approach.Remote Sensing of Environ- ment, 228:144–163,

work page 2019
[6]

Multi-region transfer learning for segmen- tation of crop field boundaries in satellite images with lim- ited labels

[Kerneret al., 2023 ] Hannah Kerner, Saketh Sundar, and Mathan Satish. Multi-region transfer learning for segmen- tation of crop field boundaries in satellite images with lim- ited labels. InProceedings of the AAAI Workshop on AI to Accelerate Science and Engineering,

work page 2023
[7]

Segment anything

[Kirillovet al., 2023 ] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InPro- ceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026,

work page 2023
[8]

Lisa: Reason- ing segmentation via large language model

[Laiet al., 2024 ] Xin Lai, Zhuotao Tian, Yukang Chen, Yan- wei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reason- ing segmentation via large language model. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9579–9589,

work page 2024
[9]

Text4seg: Reimagining image segmentation as text generation

[Lanet al., 2025 ] Mengcheng Lan, Chaofeng Chen, Yue Zhou, Jiaxing Xu, Yiping Ke, Xinjiang Wang, Litong Feng, and Wayne Zhang. Text4seg: Reimagining image segmentation as text generation. InThe Thirteenth Inter- national Conference on Learning Representations,

work page 2025
[10]

Lesiv, J.C

[Lesivet al., 2019 ] M. Lesiv, J.C. Laso Bayas, L. See, M. Duerauer, D. Dahlia, N. Durando, R. Hazarika, P. Ku- mar Sahariah, M. Vakolyuk, and V . Blyshchyk. Estimating the global distribution of field size using crowdsourcing. Global Change Biology, 25:174–186,

work page 2019
[11]

Vision-language models in remote sensing: Current progress and future trends.IEEE Geoscience and Remote Sensing Magazine, 12(2):32–66,

[Liet al., 2024 ] Xiang Li, Congcong Wen, Yuan Hu, Zheng- hang Yuan, and Xiao Xiang Zhu. Vision-language models in remote sensing: Current progress and future trends.IEEE Geoscience and Remote Sensing Magazine, 12(2):32–66,

work page 2024
[12]

Masoud, C

[Masoudet al., 2020 ] K.M. Masoud, C. Persello, and V .A. Tolpekin. Delineation of agricultural field boundaries from sentinel-2 images using a novel super-resolution contour detector based on fully convolutional networks.Remote Sensing, 12(1):59,

work page 2020
[13]

Using deep learning and very- high-resolution imagery to map smallholder field bound- aries.Remote Sensing, 14(13):3046,

[Meiet al., 2022 ] Weiye Mei, Haoyu Wang, David Fouhey, Weiqi Zhou, Isabella Hinks, Josh M Gray, Derek Van Berkel, and Meha Jain. Using deep learning and very- high-resolution imagery to map smallholder field bound- aries.Remote Sensing, 14(13):3046,

work page 2022
[14]

Categorisa- tion of farmers

[Ministry of Agriculture and Farmers Welfare, 2024] Ministry of Agriculture and Farmers Welfare. Categorisa- tion of farmers. https://www.pib.gov.in/PressReleasePage. aspx?PRID=2085181,

work page 2024
[15]

OECD Publishing, Paris,

[OECD, 2023] OECD.Agricultural Policy Monitoring and Evaluation 2023: Adapting Agriculture to Climate Change. OECD Publishing, Paris,

work page 2023
[16]

[Perselloet al., 2023 ] Claudio Persello, Jeroen Grift, Xinyan Fan, Claudia Paris, Ronny H ¨ansch, Mila Koeva, and An- drew Nelson

https://doi.org/ 10.1787/b14de474-en. [Perselloet al., 2023 ] Claudio Persello, Jeroen Grift, Xinyan Fan, Claudia Paris, Ronny H ¨ansch, Mila Koeva, and An- drew Nelson. Ai4smallfarms: A dataset for crop field de- lineation in southeast asian smallholder farms.IEEE Geo- science and Remote Sensing Letters, 20:1–5,

work page doi:10.1787/b14de474-en 2023
[17]

[Quenumet al., 2025 ] Jerome Quenum, Wen-Han Hsieh, Tsung-Han Wu, Ritwik Gupta, Trevor Darrell, and David M. Chan. LISAt: Language-instructed segmenta- tion assistant for satellite imagery. InThe Thirty-ninth An- nual Conference on Neural Information Processing Sys- tems Datasets and Benchmarks Track,

work page 2025
[18]

Rada and K.O

[Rada and Fuglie, 2019] N.E. Rada and K.O. Fuglie. New perspectives on farm size and productivity.Food Policy, 84:147–152,

work page 2019
[19]

Learning transferable visual models from nat- ural language supervision

[Radfordet al., 2021 ] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from nat- ural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR,

work page 2021
[20]

Glamm: Pixel grounding large multimodal model

[Rasheedet al., 2024 ] Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M Anwer, Eric Xing, Ming- Hsuan Yang, and Fahad S Khan. Glamm: Pixel grounding large multimodal model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13009–13018,

work page 2024
[21]

Mission critical–satellite data is a distinct modality in machine learning

[Rolfet al., 2024 ] Esther Rolf, Konstantin Klemmer, Caleb Robinson, and Hannah Kerner. Mission critical–satellite data is a distinct modality in machine learning. InInterna- tional Conference on Learning Representations,

work page 2024
[22]

Agricultural intensifica- tion and changes in cultivated areas, 1970–2005.Proceed- ings of the national academy of sciences, 106(49):20675– 20680,

[Rudelet al., 2009 ] Thomas K Rudel, Laura Schneider, Maria Uriarte, Billie Lee Turner, Ruth DeFries, Deborah Lawrence, Jacqueline Geoghegan, Susanna Hecht, Amy Ickowitz, Eric F Lambin, et al. Agricultural intensifica- tion and changes in cultivated areas, 1970–2005.Proceed- ings of the national academy of sciences, 106(49):20675– 20680,

work page 2009
[23]

Samberg, J.S

[Samberget al., 2016 ] L.H. Samberg, J.S. Gerber, N. Ra- mankutty, M. Herrero, and P.C. West. Subnational distri- bution of average farm size and smallholder contributions to global food production.Environmental Research Let- ters, 11(12):124010,

work page 2016
[24]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

[Shaoet al., 2024 ] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open lan- guage models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Success stories on information and communication technologies for agriculture and rural development,

[Sylvester and others, 2015] Gerard Sylvester et al. Success stories on information and communication technologies for agriculture and rural development,

work page 2015
[26]

Gemma 3 Technical Report

[Teamet al., 2025 ] Gemma Team, Aishwarya Kamath, Jo- han Ferret, Shreya Pathak, Nino Vieillard, Ramona Mer- hej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ram´e, Morgane Rivi`ere, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786,

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Watersheds in digital spaces: an efficient algorithm based on immersion simulations.IEEE Transactions on Pattern Analysis & Machine Intelligence, 13(06):583–598,

[Vincent and Soille, 1991] Luc Vincent and Pierre Soille. Watersheds in digital spaces: an efficient algorithm based on immersion simulations.IEEE Transactions on Pattern Analysis & Machine Intelligence, 13(06):583–598,

work page 1991
[28]

Deep learning on edge: Extracting field boundaries from satellite images with a convolu- tional neural network.Remote sensing of environment, 245:111741,

[Waldner and Diakogiannis, 2020] Franc ¸ois Waldner and Foivos I Diakogiannis. Deep learning on edge: Extracting field boundaries from satellite images with a convolu- tional neural network.Remote sensing of environment, 245:111741,

work page 2020
[29]

Detect, consolidate, delineate: Scalable mapping of field boundaries using satellite im- ages.Remote sensing, 13(11):2197,

[Waldneret al., 2021 ] Franc ¸ois Waldner, Foivos I Diako- giannis, Kathryn Batchelor, Michael Ciccotosto-Camp, Elizabeth Cooper-Williams, Chris Herrmann, Gonzalo Mata, and Andrew Toovey. Detect, consolidate, delineate: Scalable mapping of field boundaries using satellite im- ages.Remote sensing, 13(11):2197,

work page 2021
[30]

[Wanget al., 2022 ] Sherrie Wang, Franc ¸ois Waldner, and David B. Lobell. Unlocking large-scale crop field delin- eation in smallholder farming systems with transfer learn- ing and weak supervision.Remote Sensing, 14(22),

work page 2022
[31]

Visionllm: Large language model is also an open-ended decoder for vision- centric tasks.Advances in Neural Information Processing Systems, 36:61501–61513,

[Wanget al., 2023 ] Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, et al. Visionllm: Large language model is also an open-ended decoder for vision- centric tasks.Advances in Neural Information Processing Systems, 36:61501–61513,

work page 2023
[32]

Visionllm v2: An end-to-end generalist multimodal large language model for hundreds of vision-language tasks.Advances in Neural Information Processing Systems, 37:69925–69975,

[Wuet al., 2024 ] Jiannan Wu, Muyan Zhong, Sen Xing, Zeqiang Lai, Zhaoyang Liu, Zhe Chen, Wenhai Wang, Xizhou Zhu, Lewei Lu, Tong Lu, et al. Visionllm v2: An end-to-end generalist multimodal large language model for hundreds of vision-language tasks.Advances in Neural Information Processing Systems, 37:69925–69975,

work page 2024
[33]

Fsvlm: A vision-language model for remote sensing farmland segmentation.IEEE Transactions on Geoscience and Remote Sensing,

[Wuet al., 2025 ] Haiyang Wu, Zhuofei Du, Dandan Zhong, Yuze Wang, and Chao Tao. Fsvlm: A vision-language model for remote sensing farmland segmentation.IEEE Transactions on Geoscience and Remote Sensing,

work page 2025
[34]

Gsva: Gener- alized segmentation via multimodal large language mod- els

[Xiaet al., 2024 ] Zhuofan Xia, Dongchen Han, Yizeng Han, Xuran Pan, Shiji Song, and Gao Huang. Gsva: Gener- alized segmentation via multimodal large language mod- els. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3858–3869, June

work page 2024
[35]

Seg- former: Simple and efficient design for semantic segmen- tation with transformers.Advances in neural information processing systems, 34:12077–12090,

[Xieet al., 2021 ] Enze Xie, Wenhai Wang, Zhiding Yu, An- ima Anandkumar, Jose M Alvarez, and Ping Luo. Seg- former: Simple and efficient design for semantic segmen- tation with transformers.Advances in neural information processing systems, 34:12077–12090,

work page 2021
[36]

[Yanget al., 2022 ] Zhao Yang, Jiaqi Wang, Yansong Tang, Kai Chen, Hengshuang Zhao, and Philip H.S. Torr. Lavt: Language-aware vision transformer for referring image segmentation. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), pages 18155–18165, June

work page 2022
[37]

A com- prehensive review of agricultural parcel and boundary de- lineation from remote sensing images: Recent progress and future perspectives.arXiv preprint arXiv:2508.14558,

[Zhenget al., 2025 ] Juepeng Zheng, Zi Ye, Yibin Wen, Jianxi Huang, Zhiwei Zhang, Qingmei Li, Qiong Hu, Baodong Xu, Lingyuan Zhao, and Haohuan Fu. A com- prehensive review of agricultural parcel and boundary de- lineation from remote sensing images: Recent progress and future perspectives.arXiv preprint arXiv:2508.14558,

work page arXiv 2025
[38]

Image segmentation in foundation model era: A survey.arXiv preprint arXiv:2408.12957,

[Zhouet al., 2024 ] Tianfei Zhou, Wang Xia, Fei Zhang, Boyu Chang, Wenguan Wang, Ye Yuan, Ender Konukoglu, and Daniel Cremers. Image segmentation in foundation model era: A survey.arXiv preprint arXiv:2408.12957,

work page arXiv 2024
[39]

Deep learning in remote sensing: A compre- hensive review and list of resources.IEEE geoscience and remote sensing magazine, 5(4):8–36,

[Zhuet al., 2017 ] Xiao Xiang Zhu, Devis Tuia, Lichao Mou, Gui-Song Xia, Liangpei Zhang, Feng Xu, and Friedrich Fraundorfer. Deep learning in remote sensing: A compre- hensive review and list of resources.IEEE geoscience and remote sensing magazine, 5(4):8–36,

work page 2017
[40]

C Evaluation Metrics Following [Duaet al., 2024 ], for instance-wise metrics, we merge and match multiple predictions that overlap with a ground truth instance

The refined predictions are obtained by refining the coarse predictions using SAM ViT-H backbone. C Evaluation Metrics Following [Duaet al., 2024 ], for instance-wise metrics, we merge and match multiple predictions that overlap with a ground truth instance. A particular predicted instance is con- sidered to match with a ground truth instance if they belo...

work page 2024

[1] [1]

Subobject-level image tokenization

[Chenet al., 2025 ] Delong Chen, Samuel Cahyawijaya, Jianfeng Liu, Baoyuan Wang, and Pascale Fung. Subobject-level image tokenization. InInternational Con- ference on Machine Learning,

work page 2025

[2] [2]

Agricultural land- scape understanding at country-scale.arXiv,

[Duaet al., 2024 ] Radhika Dua, Nikita Saxena, Aditi Agar- wal, Alex Wilson, Gaurav Singh, Hoang Tran, Ishan Deshpande, Amandeep Kaur, Gaurav Aggarwal, Chandan Nath, Arnab Basu, Vishal Batchu, Sharath Holla, Bindiya Kurle, Olana Missura, Rahul Aggarwal, Shubhika Garg, Nishi Shah, Avneet Singh, Dinesh Tewari, Agata Dondzik, Bharat Adsul, Milind Sohoni, Asi...

work page 2024

[3] [3]

[FAOet al., 2023] FAO, IFAD, UNICEF, WFP, and WHO

https: //arxiv.org/abs/2411.05359. [FAOet al., 2023] FAO, IFAD, UNICEF, WFP, and WHO. The state of food security and nutrition in the world

work page internal anchor Pith review arXiv 2023

[4] [4]

org/10.4060/cc3017en

https://doi. org/10.4060/cc3017en. [Huet al., 2022 ] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3,

work page doi:10.4060/cc3017en 2022

[5] [5]

Kang and M.¨Ozdo˘gan

[Kang and ¨Ozdo˘gan, 2019] Y . Kang and M.¨Ozdo˘gan. Field- level crop yield mapping with landsat using a hierarchical data assimilation approach.Remote Sensing of Environ- ment, 228:144–163,

work page 2019

[6] [6]

Multi-region transfer learning for segmen- tation of crop field boundaries in satellite images with lim- ited labels

[Kerneret al., 2023 ] Hannah Kerner, Saketh Sundar, and Mathan Satish. Multi-region transfer learning for segmen- tation of crop field boundaries in satellite images with lim- ited labels. InProceedings of the AAAI Workshop on AI to Accelerate Science and Engineering,

work page 2023

[7] [7]

Segment anything

[Kirillovet al., 2023 ] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InPro- ceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026,

work page 2023

[8] [8]

Lisa: Reason- ing segmentation via large language model

[Laiet al., 2024 ] Xin Lai, Zhuotao Tian, Yukang Chen, Yan- wei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reason- ing segmentation via large language model. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9579–9589,

work page 2024

[9] [9]

Text4seg: Reimagining image segmentation as text generation

[Lanet al., 2025 ] Mengcheng Lan, Chaofeng Chen, Yue Zhou, Jiaxing Xu, Yiping Ke, Xinjiang Wang, Litong Feng, and Wayne Zhang. Text4seg: Reimagining image segmentation as text generation. InThe Thirteenth Inter- national Conference on Learning Representations,

work page 2025

[10] [10]

Lesiv, J.C

[Lesivet al., 2019 ] M. Lesiv, J.C. Laso Bayas, L. See, M. Duerauer, D. Dahlia, N. Durando, R. Hazarika, P. Ku- mar Sahariah, M. Vakolyuk, and V . Blyshchyk. Estimating the global distribution of field size using crowdsourcing. Global Change Biology, 25:174–186,

work page 2019

[11] [11]

Vision-language models in remote sensing: Current progress and future trends.IEEE Geoscience and Remote Sensing Magazine, 12(2):32–66,

[Liet al., 2024 ] Xiang Li, Congcong Wen, Yuan Hu, Zheng- hang Yuan, and Xiao Xiang Zhu. Vision-language models in remote sensing: Current progress and future trends.IEEE Geoscience and Remote Sensing Magazine, 12(2):32–66,

work page 2024

[12] [12]

Masoud, C

[Masoudet al., 2020 ] K.M. Masoud, C. Persello, and V .A. Tolpekin. Delineation of agricultural field boundaries from sentinel-2 images using a novel super-resolution contour detector based on fully convolutional networks.Remote Sensing, 12(1):59,

work page 2020

[13] [13]

Using deep learning and very- high-resolution imagery to map smallholder field bound- aries.Remote Sensing, 14(13):3046,

[Meiet al., 2022 ] Weiye Mei, Haoyu Wang, David Fouhey, Weiqi Zhou, Isabella Hinks, Josh M Gray, Derek Van Berkel, and Meha Jain. Using deep learning and very- high-resolution imagery to map smallholder field bound- aries.Remote Sensing, 14(13):3046,

work page 2022

[14] [14]

Categorisa- tion of farmers

[Ministry of Agriculture and Farmers Welfare, 2024] Ministry of Agriculture and Farmers Welfare. Categorisa- tion of farmers. https://www.pib.gov.in/PressReleasePage. aspx?PRID=2085181,

work page 2024

[15] [15]

OECD Publishing, Paris,

[OECD, 2023] OECD.Agricultural Policy Monitoring and Evaluation 2023: Adapting Agriculture to Climate Change. OECD Publishing, Paris,

work page 2023

[16] [16]

[Perselloet al., 2023 ] Claudio Persello, Jeroen Grift, Xinyan Fan, Claudia Paris, Ronny H ¨ansch, Mila Koeva, and An- drew Nelson

https://doi.org/ 10.1787/b14de474-en. [Perselloet al., 2023 ] Claudio Persello, Jeroen Grift, Xinyan Fan, Claudia Paris, Ronny H ¨ansch, Mila Koeva, and An- drew Nelson. Ai4smallfarms: A dataset for crop field de- lineation in southeast asian smallholder farms.IEEE Geo- science and Remote Sensing Letters, 20:1–5,

work page doi:10.1787/b14de474-en 2023

[17] [17]

[Quenumet al., 2025 ] Jerome Quenum, Wen-Han Hsieh, Tsung-Han Wu, Ritwik Gupta, Trevor Darrell, and David M. Chan. LISAt: Language-instructed segmenta- tion assistant for satellite imagery. InThe Thirty-ninth An- nual Conference on Neural Information Processing Sys- tems Datasets and Benchmarks Track,

work page 2025

[18] [18]

Rada and K.O

[Rada and Fuglie, 2019] N.E. Rada and K.O. Fuglie. New perspectives on farm size and productivity.Food Policy, 84:147–152,

work page 2019

[19] [19]

Learning transferable visual models from nat- ural language supervision

[Radfordet al., 2021 ] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from nat- ural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR,

work page 2021

[20] [20]

Glamm: Pixel grounding large multimodal model

[Rasheedet al., 2024 ] Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M Anwer, Eric Xing, Ming- Hsuan Yang, and Fahad S Khan. Glamm: Pixel grounding large multimodal model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13009–13018,

work page 2024

[21] [21]

Mission critical–satellite data is a distinct modality in machine learning

[Rolfet al., 2024 ] Esther Rolf, Konstantin Klemmer, Caleb Robinson, and Hannah Kerner. Mission critical–satellite data is a distinct modality in machine learning. InInterna- tional Conference on Learning Representations,

work page 2024

[22] [22]

Agricultural intensifica- tion and changes in cultivated areas, 1970–2005.Proceed- ings of the national academy of sciences, 106(49):20675– 20680,

[Rudelet al., 2009 ] Thomas K Rudel, Laura Schneider, Maria Uriarte, Billie Lee Turner, Ruth DeFries, Deborah Lawrence, Jacqueline Geoghegan, Susanna Hecht, Amy Ickowitz, Eric F Lambin, et al. Agricultural intensifica- tion and changes in cultivated areas, 1970–2005.Proceed- ings of the national academy of sciences, 106(49):20675– 20680,

work page 2009

[23] [23]

Samberg, J.S

[Samberget al., 2016 ] L.H. Samberg, J.S. Gerber, N. Ra- mankutty, M. Herrero, and P.C. West. Subnational distri- bution of average farm size and smallholder contributions to global food production.Environmental Research Let- ters, 11(12):124010,

work page 2016

[24] [24]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

[Shaoet al., 2024 ] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open lan- guage models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

Success stories on information and communication technologies for agriculture and rural development,

[Sylvester and others, 2015] Gerard Sylvester et al. Success stories on information and communication technologies for agriculture and rural development,

work page 2015

[26] [26]

Gemma 3 Technical Report

[Teamet al., 2025 ] Gemma Team, Aishwarya Kamath, Jo- han Ferret, Shreya Pathak, Nino Vieillard, Ramona Mer- hej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ram´e, Morgane Rivi`ere, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786,

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

Watersheds in digital spaces: an efficient algorithm based on immersion simulations.IEEE Transactions on Pattern Analysis & Machine Intelligence, 13(06):583–598,

[Vincent and Soille, 1991] Luc Vincent and Pierre Soille. Watersheds in digital spaces: an efficient algorithm based on immersion simulations.IEEE Transactions on Pattern Analysis & Machine Intelligence, 13(06):583–598,

work page 1991

[28] [28]

Deep learning on edge: Extracting field boundaries from satellite images with a convolu- tional neural network.Remote sensing of environment, 245:111741,

[Waldner and Diakogiannis, 2020] Franc ¸ois Waldner and Foivos I Diakogiannis. Deep learning on edge: Extracting field boundaries from satellite images with a convolu- tional neural network.Remote sensing of environment, 245:111741,

work page 2020

[29] [29]

Detect, consolidate, delineate: Scalable mapping of field boundaries using satellite im- ages.Remote sensing, 13(11):2197,

[Waldneret al., 2021 ] Franc ¸ois Waldner, Foivos I Diako- giannis, Kathryn Batchelor, Michael Ciccotosto-Camp, Elizabeth Cooper-Williams, Chris Herrmann, Gonzalo Mata, and Andrew Toovey. Detect, consolidate, delineate: Scalable mapping of field boundaries using satellite im- ages.Remote sensing, 13(11):2197,

work page 2021

[30] [30]

[Wanget al., 2022 ] Sherrie Wang, Franc ¸ois Waldner, and David B. Lobell. Unlocking large-scale crop field delin- eation in smallholder farming systems with transfer learn- ing and weak supervision.Remote Sensing, 14(22),

work page 2022

[31] [31]

Visionllm: Large language model is also an open-ended decoder for vision- centric tasks.Advances in Neural Information Processing Systems, 36:61501–61513,

[Wanget al., 2023 ] Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, et al. Visionllm: Large language model is also an open-ended decoder for vision- centric tasks.Advances in Neural Information Processing Systems, 36:61501–61513,

work page 2023

[32] [32]

Visionllm v2: An end-to-end generalist multimodal large language model for hundreds of vision-language tasks.Advances in Neural Information Processing Systems, 37:69925–69975,

[Wuet al., 2024 ] Jiannan Wu, Muyan Zhong, Sen Xing, Zeqiang Lai, Zhaoyang Liu, Zhe Chen, Wenhai Wang, Xizhou Zhu, Lewei Lu, Tong Lu, et al. Visionllm v2: An end-to-end generalist multimodal large language model for hundreds of vision-language tasks.Advances in Neural Information Processing Systems, 37:69925–69975,

work page 2024

[33] [33]

Fsvlm: A vision-language model for remote sensing farmland segmentation.IEEE Transactions on Geoscience and Remote Sensing,

[Wuet al., 2025 ] Haiyang Wu, Zhuofei Du, Dandan Zhong, Yuze Wang, and Chao Tao. Fsvlm: A vision-language model for remote sensing farmland segmentation.IEEE Transactions on Geoscience and Remote Sensing,

work page 2025

[34] [34]

Gsva: Gener- alized segmentation via multimodal large language mod- els

[Xiaet al., 2024 ] Zhuofan Xia, Dongchen Han, Yizeng Han, Xuran Pan, Shiji Song, and Gao Huang. Gsva: Gener- alized segmentation via multimodal large language mod- els. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3858–3869, June

work page 2024

[35] [35]

Seg- former: Simple and efficient design for semantic segmen- tation with transformers.Advances in neural information processing systems, 34:12077–12090,

[Xieet al., 2021 ] Enze Xie, Wenhai Wang, Zhiding Yu, An- ima Anandkumar, Jose M Alvarez, and Ping Luo. Seg- former: Simple and efficient design for semantic segmen- tation with transformers.Advances in neural information processing systems, 34:12077–12090,

work page 2021

[36] [36]

[Yanget al., 2022 ] Zhao Yang, Jiaqi Wang, Yansong Tang, Kai Chen, Hengshuang Zhao, and Philip H.S. Torr. Lavt: Language-aware vision transformer for referring image segmentation. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), pages 18155–18165, June

work page 2022

[37] [37]

A com- prehensive review of agricultural parcel and boundary de- lineation from remote sensing images: Recent progress and future perspectives.arXiv preprint arXiv:2508.14558,

[Zhenget al., 2025 ] Juepeng Zheng, Zi Ye, Yibin Wen, Jianxi Huang, Zhiwei Zhang, Qingmei Li, Qiong Hu, Baodong Xu, Lingyuan Zhao, and Haohuan Fu. A com- prehensive review of agricultural parcel and boundary de- lineation from remote sensing images: Recent progress and future perspectives.arXiv preprint arXiv:2508.14558,

work page arXiv 2025

[38] [38]

Image segmentation in foundation model era: A survey.arXiv preprint arXiv:2408.12957,

[Zhouet al., 2024 ] Tianfei Zhou, Wang Xia, Fei Zhang, Boyu Chang, Wenguan Wang, Ye Yuan, Ender Konukoglu, and Daniel Cremers. Image segmentation in foundation model era: A survey.arXiv preprint arXiv:2408.12957,

work page arXiv 2024

[39] [39]

Deep learning in remote sensing: A compre- hensive review and list of resources.IEEE geoscience and remote sensing magazine, 5(4):8–36,

[Zhuet al., 2017 ] Xiao Xiang Zhu, Devis Tuia, Lichao Mou, Gui-Song Xia, Liangpei Zhang, Feng Xu, and Friedrich Fraundorfer. Deep learning in remote sensing: A compre- hensive review and list of resources.IEEE geoscience and remote sensing magazine, 5(4):8–36,

work page 2017

[40] [40]

C Evaluation Metrics Following [Duaet al., 2024 ], for instance-wise metrics, we merge and match multiple predictions that overlap with a ground truth instance

The refined predictions are obtained by refining the coarse predictions using SAM ViT-H backbone. C Evaluation Metrics Following [Duaet al., 2024 ], for instance-wise metrics, we merge and match multiple predictions that overlap with a ground truth instance. A particular predicted instance is con- sidered to match with a ground truth instance if they belo...

work page 2024