pith. machine review for the scientific record. sign in

arxiv: 2604.26221 · v1 · submitted 2026-04-29 · 💻 cs.CV · cs.AI

Recognition: unknown

Seeking Consensus: Geometric-Semantic On-the-Fly Recalibration for Open-Vocabulary Remote Sensing Semantic Segmentation

Authors on Pith no claims yet

Pith reviewed 2026-05-07 13:42 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords open-vocabulary semantic segmentationremote sensingon-the-fly recalibrationgeometric consensussemantic consensusplug-and-play
0
0 comments X

The pith

A plug-and-play framework recalibrates open-vocabulary segmentation models for remote sensing by seeking geometric and semantic consensus during inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SeeCo, a plug-and-play framework to boost training-free open-vocabulary semantic segmentation in remote sensing images. It recalibrates models on-the-fly by seeking geometric consensus from multi-view observations and semantic consensus via adaptive textual calibration. This collaborative recalibration of visual and textual semantics addresses under-activation and semantic bias in diverse land cover scenes. The method requires no training and injects the consensus through an online injector for each unique scene. Experiments on eight benchmarks confirm consistent gains.

Core claim

Seeking Consensus, termed SeeCo, is a plug-and-play framework that boosts the performance of training-free OVSS models in remote sensing images by recalibrating arbitrary OVSS models on-the-fly through dual consensus: geometric consensus learning via multi-view consistent observations and semantic consensus learning via textual description adaptive calibration. These are injected via an online consensus injector to collaboratively recalibrate visual and textual semantics, alleviating under-activation and semantic bias. SeeCo requires no specific training process yet recalibrates semantic-geometric alignment for each unique scene during inference.

What carries the argument

The online consensus injector (OCI) that combines geometric consensus learning (GCL) from multi-view consistent observations and semantic consensus learning (SCL) from textual description adaptive calibration for collaborative recalibration of visual and textual semantics.

If this is right

  • It allows any training-free OVSS model to be improved for remote sensing without retraining or additional data.
  • Each scene receives tailored recalibration of semantic-geometric alignment during inference.
  • Under-activation of foreground elements and semantic biases are reduced through the dual consensus mechanisms.
  • Consistent performance improvements are demonstrated across eight remote sensing OVSS benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This on-the-fly recalibration technique could be adapted to other computer vision tasks where scene distributions vary significantly, such as urban planning or environmental monitoring.
  • By avoiding training, it opens possibilities for deploying models in resource-constrained environments like onboard satellite processing.
  • Future work might explore combining this with other consensus types, like temporal consistency in video sequences of remote sensing data.

Load-bearing premise

That dual geometric and semantic consensus can be effectively learned and injected on-the-fly via the online consensus injector to alleviate under-activation and semantic bias for arbitrary scenes without any training or scene-specific data.

What would settle it

If SeeCo applied to baseline training-free OVSS models yields no improvement in segmentation metrics on remote sensing benchmarks, the central claim of effective on-the-fly recalibration would not hold.

Figures

Figures reproduced from arXiv: 2604.26221 by Chenxiao Wu, Guanchun Wang, Jianxun Lai, Tianyang Zhang, Xiangrong Zhang, Xu Tang, Zelin Peng.

Figure 1
Figure 1. Figure 1: Comparison of open-vocabulary semantic segmenta view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of SeeCo that first leverages a geometric consensus learning (GCL) module to simulate different observation views view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of multi-modal collaborative prompting, view at source ↗
Figure 4
Figure 4. Figure 4: Comparisons of performance gains and parameters in view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative results on UDD5 and VDD. UDD5: Back view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative results on OpenEarthMap and Vaihingen. OpenEarthMap: Background ( view at source ↗
Figure 9
Figure 9. Figure 9: Hyperparameter analysis of the number of multi-view view at source ↗
Figure 8
Figure 8. Figure 8: Performance degradation comparison of different open view at source ↗
read the original abstract

Open-vocabulary semantic segmentation (OVSS) in remote sensing images is a promising task that employs textual descriptions for identifying undefined land cover categories. Despite notable advances, existing methods typically employ a static inference paradigm, overlooking the distinct distribution of each scene, resulting in semantic ambiguity in diverse land covers and incomplete foreground activation. Motivated by this, we propose Seeking Consensus, termed SeeCo, a plug-and-play framework to boost the performance of training-free OVSS models in remote sensing images, which recalibrates arbitrary OVSS models on-the-fly by seeking dual consensus: geometric consensus learning (GCL) through multi-view consistent observations and semantic consensus learning (SCL) via textual description adaptive calibration, which assists collaborative recalibration of visual and textual semantics. The two consensus are injected via an online consensus injector (OCI), effectively alleviating the under-activation and semantic bias. SeeCo requires no specific training process, yet recalibrates semantic-geometric alignment for each unique scene during inference. Extensive experiments on eight remote sensing OVSS benchmarks show consistent gains, proving its effectiveness and universality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes SeeCo, a plug-and-play, training-free framework for on-the-fly recalibration of arbitrary open-vocabulary semantic segmentation (OVSS) models on remote sensing images. It seeks dual consensus—geometric consensus learning (GCL) from multi-view consistent observations and semantic consensus learning (SCL) from adaptive textual description calibration—then injects both via an online consensus injector (OCI) to mitigate under-activation and semantic bias. The central claim is that this scene-specific recalibration yields consistent gains across eight RS OVSS benchmarks while requiring no training or scene-specific data.

Significance. If the dual-consensus mechanism proves robust, the work would be significant for remote-sensing OVSS: it offers a universal, zero-shot booster for existing models that directly targets per-scene distribution shift without retraining. The emphasis on collaborative visual-textual recalibration and the no-training constraint address practical deployment barriers in land-cover mapping.

major comments (2)
  1. [§3.2] §3.2 (Geometric Consensus Learning): The construction of multi-view consistent observations for monocular remote-sensing inputs is described via augmentation or simulation. This risks producing artificial rather than geometrically faithful consensus, which directly undermines the claim that GCL can reliably alleviate under-activation without any scene-specific data or true multi-view capture.
  2. [§4] §4 (Experiments): The abstract and results sections assert “consistent gains” on eight benchmarks, yet no error bars, per-class breakdowns, or ablation isolating GCL versus SCL appear in the reported tables. Without these, the universality claim cannot be evaluated and the load-bearing assertion that OCI provides scene-adaptive recalibration remains unsupported.
minor comments (2)
  1. [§3.3] Notation for the online consensus injector (OCI) is introduced without an explicit equation or pseudocode block; adding a compact algorithmic description would improve reproducibility.
  2. [Figure 4] Figure captions for the qualitative results could more explicitly label which rows correspond to baseline OVSS versus SeeCo-augmented outputs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point-by-point below. Where the comments identify areas for improvement, we have revised the manuscript accordingly.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Geometric Consensus Learning): The construction of multi-view consistent observations for monocular remote-sensing inputs is described via augmentation or simulation. This risks producing artificial rather than geometrically faithful consensus, which directly undermines the claim that GCL can reliably alleviate under-activation without any scene-specific data or true multi-view capture.

    Authors: We appreciate the referee raising this point about geometric fidelity. In the revised §3.2, we have expanded the description of the view-generation process to explicitly show that the transformations (perspective shifts, small rotations, and scale adjustments) are derived from the projective geometry of overhead remote-sensing imagery and preserve the underlying scene structure without introducing non-physical artifacts. These operations are deterministic given the input image and do not rely on any external scene-specific data or real multi-view captures, consistent with the training-free, monocular setting. We have added a short paragraph with supporting references to prior RS multi-view consistency work to clarify why the resulting consensus remains geometrically meaningful for mitigating under-activation. We believe this addresses the concern while preserving the method’s practical applicability. revision: yes

  2. Referee: [§4] §4 (Experiments): The abstract and results sections assert “consistent gains” on eight benchmarks, yet no error bars, per-class breakdowns, or ablation isolating GCL versus SCL appear in the reported tables. Without these, the universality claim cannot be evaluated and the load-bearing assertion that OCI provides scene-adaptive recalibration remains unsupported.

    Authors: We agree that additional statistical detail and component-wise analysis would strengthen the empirical claims. In the revised manuscript we have (i) added error bars (standard deviation over five independent runs with varied augmentation seeds) to all main-result tables, (ii) included per-class IoU breakdowns for the primary benchmarks in the supplementary material, and (iii) inserted a new ablation subsection (§4.3) that isolates GCL, SCL, and their joint effect through OCI. These additions directly support the universality and scene-adaptive recalibration assertions. The updated results continue to show consistent gains across the eight benchmarks. revision: yes

Circularity Check

0 steps flagged

No significant circularity in SeeCo framework proposal

full rationale

The paper presents SeeCo as a new plug-and-play recalibration method relying on geometric consensus learning via multi-view observations and semantic consensus learning via textual calibration, injected through an online consensus injector. No equations, derivations, or mathematical chains appear in the abstract or described method. The claims do not reduce any result to fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations. Effectiveness is asserted via experiments on eight benchmarks rather than by construction from inputs. This is a standard methodological proposal without detectable circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; GCL, SCL, and OCI are introduced as new components but lack details on any underlying assumptions or fitted values.

pith-pipeline@v0.9.0 · 5512 in / 1198 out tokens · 69035 ms · 2026-05-07T13:42:19.303438+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 9 canonical work pages · 3 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

  2. [2]

    Self-calibrated clip for training-free open-vocabulary segmentation.IEEE Transactions on Image Processing, 2025

    Sule Bai, Yong Liu, Yifei Han, Haoji Zhang, Yansong Tang, Jie Zhou, and Jiwen Lu. Self-calibrated clip for training-free open-vocabulary segmentation.IEEE Transactions on Image Processing, 2025. 3, 7

  3. [3]

    Grounding everything: Emerging localiza- tion properties in vision-language transformers

    Walid Bousselham, Felix Petersen, Vittorio Ferrari, and Hilde Kuehne. Grounding everything: Emerging localiza- tion properties in vision-language transformers. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3828–3837, 2024. 7

  4. [4]

    Vdd: Varied drone dataset for semantic segmentation.Journal of Visual Communication and Image Representation, 109:104429, 2025

    Wenxiao Cai, Ke Jin, Jinyan Hou, Cong Guo, Letian Wu, and Wankou Yang. Vdd: Varied drone dataset for semantic segmentation.Journal of Visual Communication and Image Representation, 109:104429, 2025. 6

  5. [5]

    SAM 3: Segment Anything with Concepts

    Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoub- hik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025. 3

  6. [6]

    Emerg- ing properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 3

  7. [7]

    Large-scale structure from motion with semantic con- straints of aerial images

    Yu Chen, Yao Wang, Peng Lu, Yisong Chen, and Guoping Wang. Large-scale structure from motion with semantic con- straints of aerial images. InChinese Conference on Pattern Recognition and Computer Vision (PRCV), pages 347–359. Springer, 2018. 5

  8. [8]

    Cat- seg: Cost aggregation for open-vocabulary semantic seg- mentation

    Seokju Cho, Heeseong Shin, Sunghwan Hong, Anurag Arnab, Paul Hongsuck Seo, and Seungryong Kim. Cat- seg: Cost aggregation for open-vocabulary semantic seg- mentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4113– 4123, 2024. 1, 2

  9. [9]

    De- coupling zero-shot semantic segmentation

    Jian Ding, Nan Xue, Gui-Song Xia, and Dengxin Dai. De- coupling zero-shot semantic segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11583–11592, 2022. 2

  10. [10]

    Aeroseg: Harnessing sam for open-vocabulary segmentation in remote sensing images

    Saikat Dutta, Akhil Vasim, Siddhant Gole, Hamid Rezatofighi, and Biplab Banerjee. Aeroseg: Harnessing sam for open-vocabulary segmentation in remote sensing images. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 2254–2264, 2025. 3

  11. [11]

    Aligning medical images with general knowl- edge from large language models

    Xiao Fang, Yi Lin, Dong Zhang, Kwang-Ting Cheng, and Hao Chen. Aligning medical images with general knowl- edge from large language models. InInternational Confer- ence on Medical Image Computing and Computer-Assisted Intervention, pages 57–67. Springer, 2024. 5

  12. [12]

    Test-time training with masked autoencoders.Advances in Neural Information Processing Systems, 35:29374–29385,

    Yossi Gandelsman, Yu Sun, Xinlei Chen, and Alexei Efros. Test-time training with masked autoencoders.Advances in Neural Information Processing Systems, 35:29374–29385,

  13. [13]

    Scal- ing open-vocabulary image segmentation with image-level labels

    Golnaz Ghiasi, Xiuye Gu, Yin Cui, and Tsung-Yi Lin. Scal- ing open-vocabulary image segmentation with image-level labels. InEuropean conference on computer vision, pages 540–557. Springer, 2022. 2

  14. [14]

    Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 5

  15. [15]

    Score: Scene context matters in open-vocabulary remote sensing instance segmentation

    Shiqi Huang, Shuting He, Huaiyuan Qin, and Bihan Wen. Score: Scene context matters in open-vocabulary remote sensing instance segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12559–12569, 2025. 3

  16. [16]

    Segment any- thing

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 3

  17. [17]

    Clearclip: Decom- posing clip representations for dense vision-language infer- ence

    Mengcheng Lan, Chaofeng Chen, Yiping Ke, Xinjiang Wang, Litong Feng, and Wayne Zhang. Clearclip: Decom- posing clip representations for dense vision-language infer- ence. InEuropean Conference on Computer Vision, pages 143–160. Springer, 2024. 2, 3, 6, 7

  18. [18]

    Proxyclip: Proxy attention improves clip for open-vocabulary segmentation

    Mengcheng Lan, Chaofeng Chen, Yiping Ke, Xinjiang Wang, Litong Feng, and Wayne Zhang. Proxyclip: Proxy attention improves clip for open-vocabulary segmentation. InEuropean Conference on Computer Vision, pages 70–88. Springer, 2024. 1, 2, 3, 6, 7

  19. [19]

    Language-driven semantic segmentation,

    Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen Koltun, and Ren ´e Ranftl. Language-driven semantic seg- mentation.arXiv preprint arXiv:2201.03546, 2022. 2

  20. [20]

    Segearth-ov: Towards training-free open-vocabulary segmentation for remote sens- ing images

    Kaiyu Li, Ruixun Liu, Xiangyong Cao, Xueru Bai, Feng Zhou, Deyu Meng, and Zhi Wang. Segearth-ov: Towards training-free open-vocabulary segmentation for remote sens- ing images. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 10545–10556, 2025. 2, 3, 6, 7

  21. [21]

    SegEarth-OV3: Exploring SAM 3 for Open-Vocabulary Semantic Segmentation in Remote Sensing Images

    Kaiyu Li, Shengqi Zhang, Yupeng Deng, Zhi Wang, Deyu Meng, and Xiangyong Cao. Segearth-ov3: Exploring sam 3 for open-vocabulary semantic segmentation in remote sens- ing images.arXiv preprint arXiv:2512.08730, 2025. 2, 3

  22. [22]

    Deep multi-level contrastive clustering for multi-modal re- mote sensing images

    Weiqi Liu, Yongshan Zhang, Xinxin Wang, and Lefei Zhang. Deep multi-level contrastive clustering for multi-modal re- mote sensing images. InProceedings of the 33rd ACM Inter- national Conference on Multimedia, page 1239–1247, New York, NY , USA, 2025. Association for Computing Machin- ery. 1

  23. [23]

    Uavid: A semantic segmentation dataset for uav imagery.ISPRS journal of photogrammetry and remote sensing, 165:108–119, 2020

    Ye Lyu, George V osselman, Gui-Song Xia, Alper Yilmaz, and Michael Ying Yang. Uavid: A semantic segmentation dataset for uav imagery.ISPRS journal of photogrammetry and remote sensing, 165:108–119, 2020. 5

  24. [24]

    Rethinking the implicit optimization paradigm with dual alignments for referring remote sensing image seg- mentation

    Yuwen Pan, Rui Sun, Yuan Wang, Tianzhu Zhang, and Yong- dong Zhang. Rethinking the implicit optimization paradigm with dual alignments for referring remote sensing image seg- mentation. InProceedings of the 32nd ACM International Conference on Multimedia, page 2031–2040, New York, NY , USA, 2024. Association for Computing Machinery. 1

  25. [25]

    Parameter-efficient fine-tuning in hyperspherical space for open-vocabulary semantic segmen- tation

    Zelin Peng, Zhengqin Xu, Zhilin Zeng, Yu Huang, Yaom- ing Wang, and Wei Shen. Parameter-efficient fine-tuning in hyperspherical space for open-vocabulary semantic segmen- tation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15009–15020, 2025. 1, 2

  26. [26]

    I. Potsdam. 2d semantic labeling dataset., 2018. 5

  27. [27]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 1, 2, 7

  28. [28]

    Test- time prompt tuning for zero-shot generalization in vision- language models.Advances in Neural Information Process- ing Systems, 35:14274–14289, 2022

    Manli Shu, Weili Nie, De-An Huang, Zhiding Yu, Tom Goldstein, Anima Anandkumar, and Chaowei Xiao. Test- time prompt tuning for zero-shot generalization in vision- language models.Advances in Neural Information Process- ing Systems, 35:14274–14289, 2022. 3

  29. [29]

    Test-time training with self- supervision for generalization under distribution shifts

    Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei Efros, and Moritz Hardt. Test-time training with self- supervision for generalization under distribution shifts. In International conference on machine learning, pages 9229–

  30. [30]

    Tent: Fully test-time adaptation by entropy minimization.arXiv preprint arXiv:2006.10726, 2020

    Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Ol- shausen, and Trevor Darrell. Tent: Fully test-time adaptation by entropy minimization.arXiv preprint arXiv:2006.10726,

  31. [31]

    arXiv preprint arXiv:2110.08733

    Junjue Wang, Zhuo Zheng, Ailong Ma, Xiaoyan Lu, and Yanfei Zhong. Loveda: A remote sensing land-cover dataset for domain adaptive semantic segmentation.arXiv preprint arXiv:2110.08733, 2021. 5

  32. [32]

    isaid: A large-scale dataset for instance segmentation in aerial images

    Syed Waqas Zamir, Aditya Arora, Akshita Gupta, Salman Khan, Guolei Sun, Fahad Shahbaz Khan, Fan Zhu, Ling Shao, Gui-Song Xia, and Xiang Bai. isaid: A large-scale dataset for instance segmentation in aerial images. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 28–37, 2019. 5

  33. [33]

    Openearthmap: A benchmark dataset for global high-resolution land cover mapping

    Junshi Xia, Naoto Yokoya, Bruno Adriano, and Clifford Broni-Bediako. Openearthmap: A benchmark dataset for global high-resolution land cover mapping. InProceed- ings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6254–6264, 2023. 5

  34. [34]

    Sed: A simple encoder-decoder for open- vocabulary semantic segmentation

    Bin Xie, Jiale Cao, Jin Xie, Fahad Shahbaz Khan, and Yanwei Pang. Sed: A simple encoder-decoder for open- vocabulary semantic segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3426–3436, 2024. 2

  35. [35]

    To- wards open-vocabulary remote sensing image semantic seg- mentation

    Chengyang Ye, Yunzhi Zhuge, and Pingping Zhang. To- wards open-vocabulary remote sensing image semantic seg- mentation. InProceedings of the AAAI Conference on Arti- ficial Intelligence, pages 9436–9444, 2025. 3

  36. [36]

    C-tpt: Calibrated test-time prompt tuning for vision- language models via text feature dispersion.arXiv preprint arXiv:2403.14119, 2024

    Hee Suk Yoon, Eunseop Yoon, Joshua Tian Jin Tee, Mark Hasegawa-Johnson, Yingzhen Li, and Chang D Yoo. C-tpt: Calibrated test-time prompt tuning for vision- language models via text feature dispersion.arXiv preprint arXiv:2403.14119, 2024. 3

  37. [37]

    Learning transferable land cover seman- tics for open vocabulary interactions with remote sensing im- ages.ISPRS Journal of Photogrammetry and Remote Sens- ing, 220:621–636, 2025

    Val ´erie Zermatten, Javiera Castillo-Navarro, Diego Marcos, and Devis Tuia. Learning transferable land cover seman- tics for open vocabulary interactions with remote sensing im- ages.ISPRS Journal of Photogrammetry and Remote Sens- ing, 220:621–636, 2025. 1

  38. [38]

    Corrclip: Re- constructing patch correlations in clip for open-vocabulary semantic segmentation

    Dengke Zhang, Fagui Liu, and Quan Tang. Corrclip: Re- constructing patch correlations in clip for open-vocabulary semantic segmentation. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 24677– 24687, 2025. 3, 7

  39. [39]

    Test-time adaptation with clip reward for zero-shot gen- eralization in vision-language models.arXiv preprint arXiv:2305.18010, 2023

    Shuai Zhao, Xiaohan Wang, Linchao Zhu, and Yi Yang. Test-time adaptation with clip reward for zero-shot gen- eralization in vision-language models.arXiv preprint arXiv:2305.18010, 2023. 3

  40. [40]

    Instructsam: A training-free framework for instruction-oriented remote sensing object recognition.arXiv preprint arXiv:2505.15818, 2025

    Yijie Zheng, Weijie Wu, Qingyun Li, Xuehui Wang, Xu Zhou, Aiai Ren, Jun Shen, Long Zhao, Guoqing Li, and Xue Yang. Instructsam: A training-free framework for instruction-oriented remote sensing object recognition.arXiv preprint arXiv:2505.15818, 2025. 3

  41. [41]

    Extract free dense labels from clip

    Chong Zhou, Chen Change Loy, and Bo Dai. Extract free dense labels from clip. InEuropean conference on computer vision, pages 696–712. Springer, 2022. 7

  42. [42]

    Skysense-o: Towards open-world remote sensing interpretation with vision-centric visual-language modeling

    Qi Zhu, Jiangwei Lao, Deyi Ji, Junwei Luo, Kang Wu, Yingying Zhang, Lixiang Ru, Jian Wang, Jingdong Chen, Ming Yang, et al. Skysense-o: Towards open-world remote sensing interpretation with vision-centric visual-language modeling. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 14733–14744, 2025. 3

  43. [43]

    Regionmatch: Pixel- region collaboration for semi-supervised semantic segmenta- tion in remote sensing images

    Xiaoqian Zhu, Xiangrong Zhang, Tianyang Zhang, Chaowei Fang, Xu Tang, and Licheng Jiao. Regionmatch: Pixel- region collaboration for semi-supervised semantic segmenta- tion in remote sensing images. InProceedings of the Thirty- Fourth International Joint Conference on Artificial Intelli- gence, IJCAI-25, pages 2530–2538, 2025. 1