Recognition: unknown
Seeking Consensus: Geometric-Semantic On-the-Fly Recalibration for Open-Vocabulary Remote Sensing Semantic Segmentation
Pith reviewed 2026-05-07 13:42 UTC · model grok-4.3
The pith
A plug-and-play framework recalibrates open-vocabulary segmentation models for remote sensing by seeking geometric and semantic consensus during inference.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Seeking Consensus, termed SeeCo, is a plug-and-play framework that boosts the performance of training-free OVSS models in remote sensing images by recalibrating arbitrary OVSS models on-the-fly through dual consensus: geometric consensus learning via multi-view consistent observations and semantic consensus learning via textual description adaptive calibration. These are injected via an online consensus injector to collaboratively recalibrate visual and textual semantics, alleviating under-activation and semantic bias. SeeCo requires no specific training process yet recalibrates semantic-geometric alignment for each unique scene during inference.
What carries the argument
The online consensus injector (OCI) that combines geometric consensus learning (GCL) from multi-view consistent observations and semantic consensus learning (SCL) from textual description adaptive calibration for collaborative recalibration of visual and textual semantics.
If this is right
- It allows any training-free OVSS model to be improved for remote sensing without retraining or additional data.
- Each scene receives tailored recalibration of semantic-geometric alignment during inference.
- Under-activation of foreground elements and semantic biases are reduced through the dual consensus mechanisms.
- Consistent performance improvements are demonstrated across eight remote sensing OVSS benchmarks.
Where Pith is reading between the lines
- This on-the-fly recalibration technique could be adapted to other computer vision tasks where scene distributions vary significantly, such as urban planning or environmental monitoring.
- By avoiding training, it opens possibilities for deploying models in resource-constrained environments like onboard satellite processing.
- Future work might explore combining this with other consensus types, like temporal consistency in video sequences of remote sensing data.
Load-bearing premise
That dual geometric and semantic consensus can be effectively learned and injected on-the-fly via the online consensus injector to alleviate under-activation and semantic bias for arbitrary scenes without any training or scene-specific data.
What would settle it
If SeeCo applied to baseline training-free OVSS models yields no improvement in segmentation metrics on remote sensing benchmarks, the central claim of effective on-the-fly recalibration would not hold.
Figures
read the original abstract
Open-vocabulary semantic segmentation (OVSS) in remote sensing images is a promising task that employs textual descriptions for identifying undefined land cover categories. Despite notable advances, existing methods typically employ a static inference paradigm, overlooking the distinct distribution of each scene, resulting in semantic ambiguity in diverse land covers and incomplete foreground activation. Motivated by this, we propose Seeking Consensus, termed SeeCo, a plug-and-play framework to boost the performance of training-free OVSS models in remote sensing images, which recalibrates arbitrary OVSS models on-the-fly by seeking dual consensus: geometric consensus learning (GCL) through multi-view consistent observations and semantic consensus learning (SCL) via textual description adaptive calibration, which assists collaborative recalibration of visual and textual semantics. The two consensus are injected via an online consensus injector (OCI), effectively alleviating the under-activation and semantic bias. SeeCo requires no specific training process, yet recalibrates semantic-geometric alignment for each unique scene during inference. Extensive experiments on eight remote sensing OVSS benchmarks show consistent gains, proving its effectiveness and universality.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes SeeCo, a plug-and-play, training-free framework for on-the-fly recalibration of arbitrary open-vocabulary semantic segmentation (OVSS) models on remote sensing images. It seeks dual consensus—geometric consensus learning (GCL) from multi-view consistent observations and semantic consensus learning (SCL) from adaptive textual description calibration—then injects both via an online consensus injector (OCI) to mitigate under-activation and semantic bias. The central claim is that this scene-specific recalibration yields consistent gains across eight RS OVSS benchmarks while requiring no training or scene-specific data.
Significance. If the dual-consensus mechanism proves robust, the work would be significant for remote-sensing OVSS: it offers a universal, zero-shot booster for existing models that directly targets per-scene distribution shift without retraining. The emphasis on collaborative visual-textual recalibration and the no-training constraint address practical deployment barriers in land-cover mapping.
major comments (2)
- [§3.2] §3.2 (Geometric Consensus Learning): The construction of multi-view consistent observations for monocular remote-sensing inputs is described via augmentation or simulation. This risks producing artificial rather than geometrically faithful consensus, which directly undermines the claim that GCL can reliably alleviate under-activation without any scene-specific data or true multi-view capture.
- [§4] §4 (Experiments): The abstract and results sections assert “consistent gains” on eight benchmarks, yet no error bars, per-class breakdowns, or ablation isolating GCL versus SCL appear in the reported tables. Without these, the universality claim cannot be evaluated and the load-bearing assertion that OCI provides scene-adaptive recalibration remains unsupported.
minor comments (2)
- [§3.3] Notation for the online consensus injector (OCI) is introduced without an explicit equation or pseudocode block; adding a compact algorithmic description would improve reproducibility.
- [Figure 4] Figure captions for the qualitative results could more explicitly label which rows correspond to baseline OVSS versus SeeCo-augmented outputs.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point-by-point below. Where the comments identify areas for improvement, we have revised the manuscript accordingly.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Geometric Consensus Learning): The construction of multi-view consistent observations for monocular remote-sensing inputs is described via augmentation or simulation. This risks producing artificial rather than geometrically faithful consensus, which directly undermines the claim that GCL can reliably alleviate under-activation without any scene-specific data or true multi-view capture.
Authors: We appreciate the referee raising this point about geometric fidelity. In the revised §3.2, we have expanded the description of the view-generation process to explicitly show that the transformations (perspective shifts, small rotations, and scale adjustments) are derived from the projective geometry of overhead remote-sensing imagery and preserve the underlying scene structure without introducing non-physical artifacts. These operations are deterministic given the input image and do not rely on any external scene-specific data or real multi-view captures, consistent with the training-free, monocular setting. We have added a short paragraph with supporting references to prior RS multi-view consistency work to clarify why the resulting consensus remains geometrically meaningful for mitigating under-activation. We believe this addresses the concern while preserving the method’s practical applicability. revision: yes
-
Referee: [§4] §4 (Experiments): The abstract and results sections assert “consistent gains” on eight benchmarks, yet no error bars, per-class breakdowns, or ablation isolating GCL versus SCL appear in the reported tables. Without these, the universality claim cannot be evaluated and the load-bearing assertion that OCI provides scene-adaptive recalibration remains unsupported.
Authors: We agree that additional statistical detail and component-wise analysis would strengthen the empirical claims. In the revised manuscript we have (i) added error bars (standard deviation over five independent runs with varied augmentation seeds) to all main-result tables, (ii) included per-class IoU breakdowns for the primary benchmarks in the supplementary material, and (iii) inserted a new ablation subsection (§4.3) that isolates GCL, SCL, and their joint effect through OCI. These additions directly support the universality and scene-adaptive recalibration assertions. The updated results continue to show consistent gains across the eight benchmarks. revision: yes
Circularity Check
No significant circularity in SeeCo framework proposal
full rationale
The paper presents SeeCo as a new plug-and-play recalibration method relying on geometric consensus learning via multi-view observations and semantic consensus learning via textual calibration, injected through an online consensus injector. No equations, derivations, or mathematical chains appear in the abstract or described method. The claims do not reduce any result to fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations. Effectiveness is asserted via experiments on eight benchmarks rather than by construction from inputs. This is a standard methodological proposal without detectable circular reduction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review arXiv
-
[2]
Self-calibrated clip for training-free open-vocabulary segmentation.IEEE Transactions on Image Processing, 2025
Sule Bai, Yong Liu, Yifei Han, Haoji Zhang, Yansong Tang, Jie Zhou, and Jiwen Lu. Self-calibrated clip for training-free open-vocabulary segmentation.IEEE Transactions on Image Processing, 2025. 3, 7
2025
-
[3]
Grounding everything: Emerging localiza- tion properties in vision-language transformers
Walid Bousselham, Felix Petersen, Vittorio Ferrari, and Hilde Kuehne. Grounding everything: Emerging localiza- tion properties in vision-language transformers. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3828–3837, 2024. 7
2024
-
[4]
Vdd: Varied drone dataset for semantic segmentation.Journal of Visual Communication and Image Representation, 109:104429, 2025
Wenxiao Cai, Ke Jin, Jinyan Hou, Cong Guo, Letian Wu, and Wankou Yang. Vdd: Varied drone dataset for semantic segmentation.Journal of Visual Communication and Image Representation, 109:104429, 2025. 6
2025
-
[5]
SAM 3: Segment Anything with Concepts
Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoub- hik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025. 3
work page internal anchor Pith review arXiv 2025
-
[6]
Emerg- ing properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 3
2021
-
[7]
Large-scale structure from motion with semantic con- straints of aerial images
Yu Chen, Yao Wang, Peng Lu, Yisong Chen, and Guoping Wang. Large-scale structure from motion with semantic con- straints of aerial images. InChinese Conference on Pattern Recognition and Computer Vision (PRCV), pages 347–359. Springer, 2018. 5
2018
-
[8]
Cat- seg: Cost aggregation for open-vocabulary semantic seg- mentation
Seokju Cho, Heeseong Shin, Sunghwan Hong, Anurag Arnab, Paul Hongsuck Seo, and Seungryong Kim. Cat- seg: Cost aggregation for open-vocabulary semantic seg- mentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4113– 4123, 2024. 1, 2
2024
-
[9]
De- coupling zero-shot semantic segmentation
Jian Ding, Nan Xue, Gui-Song Xia, and Dengxin Dai. De- coupling zero-shot semantic segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11583–11592, 2022. 2
2022
-
[10]
Aeroseg: Harnessing sam for open-vocabulary segmentation in remote sensing images
Saikat Dutta, Akhil Vasim, Siddhant Gole, Hamid Rezatofighi, and Biplab Banerjee. Aeroseg: Harnessing sam for open-vocabulary segmentation in remote sensing images. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 2254–2264, 2025. 3
2025
-
[11]
Aligning medical images with general knowl- edge from large language models
Xiao Fang, Yi Lin, Dong Zhang, Kwang-Ting Cheng, and Hao Chen. Aligning medical images with general knowl- edge from large language models. InInternational Confer- ence on Medical Image Computing and Computer-Assisted Intervention, pages 57–67. Springer, 2024. 5
2024
-
[12]
Test-time training with masked autoencoders.Advances in Neural Information Processing Systems, 35:29374–29385,
Yossi Gandelsman, Yu Sun, Xinlei Chen, and Alexei Efros. Test-time training with masked autoencoders.Advances in Neural Information Processing Systems, 35:29374–29385,
-
[13]
Scal- ing open-vocabulary image segmentation with image-level labels
Golnaz Ghiasi, Xiuye Gu, Yin Cui, and Tsung-Yi Lin. Scal- ing open-vocabulary image segmentation with image-level labels. InEuropean conference on computer vision, pages 540–557. Springer, 2022. 2
2022
-
[14]
Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 5
2022
-
[15]
Score: Scene context matters in open-vocabulary remote sensing instance segmentation
Shiqi Huang, Shuting He, Huaiyuan Qin, and Bihan Wen. Score: Scene context matters in open-vocabulary remote sensing instance segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12559–12569, 2025. 3
2025
-
[16]
Segment any- thing
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 3
2023
-
[17]
Clearclip: Decom- posing clip representations for dense vision-language infer- ence
Mengcheng Lan, Chaofeng Chen, Yiping Ke, Xinjiang Wang, Litong Feng, and Wayne Zhang. Clearclip: Decom- posing clip representations for dense vision-language infer- ence. InEuropean Conference on Computer Vision, pages 143–160. Springer, 2024. 2, 3, 6, 7
2024
-
[18]
Proxyclip: Proxy attention improves clip for open-vocabulary segmentation
Mengcheng Lan, Chaofeng Chen, Yiping Ke, Xinjiang Wang, Litong Feng, and Wayne Zhang. Proxyclip: Proxy attention improves clip for open-vocabulary segmentation. InEuropean Conference on Computer Vision, pages 70–88. Springer, 2024. 1, 2, 3, 6, 7
2024
-
[19]
Language-driven semantic segmentation,
Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen Koltun, and Ren ´e Ranftl. Language-driven semantic seg- mentation.arXiv preprint arXiv:2201.03546, 2022. 2
-
[20]
Segearth-ov: Towards training-free open-vocabulary segmentation for remote sens- ing images
Kaiyu Li, Ruixun Liu, Xiangyong Cao, Xueru Bai, Feng Zhou, Deyu Meng, and Zhi Wang. Segearth-ov: Towards training-free open-vocabulary segmentation for remote sens- ing images. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 10545–10556, 2025. 2, 3, 6, 7
2025
-
[21]
SegEarth-OV3: Exploring SAM 3 for Open-Vocabulary Semantic Segmentation in Remote Sensing Images
Kaiyu Li, Shengqi Zhang, Yupeng Deng, Zhi Wang, Deyu Meng, and Xiangyong Cao. Segearth-ov3: Exploring sam 3 for open-vocabulary semantic segmentation in remote sens- ing images.arXiv preprint arXiv:2512.08730, 2025. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
Deep multi-level contrastive clustering for multi-modal re- mote sensing images
Weiqi Liu, Yongshan Zhang, Xinxin Wang, and Lefei Zhang. Deep multi-level contrastive clustering for multi-modal re- mote sensing images. InProceedings of the 33rd ACM Inter- national Conference on Multimedia, page 1239–1247, New York, NY , USA, 2025. Association for Computing Machin- ery. 1
2025
-
[23]
Uavid: A semantic segmentation dataset for uav imagery.ISPRS journal of photogrammetry and remote sensing, 165:108–119, 2020
Ye Lyu, George V osselman, Gui-Song Xia, Alper Yilmaz, and Michael Ying Yang. Uavid: A semantic segmentation dataset for uav imagery.ISPRS journal of photogrammetry and remote sensing, 165:108–119, 2020. 5
2020
-
[24]
Rethinking the implicit optimization paradigm with dual alignments for referring remote sensing image seg- mentation
Yuwen Pan, Rui Sun, Yuan Wang, Tianzhu Zhang, and Yong- dong Zhang. Rethinking the implicit optimization paradigm with dual alignments for referring remote sensing image seg- mentation. InProceedings of the 32nd ACM International Conference on Multimedia, page 2031–2040, New York, NY , USA, 2024. Association for Computing Machinery. 1
2031
-
[25]
Parameter-efficient fine-tuning in hyperspherical space for open-vocabulary semantic segmen- tation
Zelin Peng, Zhengqin Xu, Zhilin Zeng, Yu Huang, Yaom- ing Wang, and Wei Shen. Parameter-efficient fine-tuning in hyperspherical space for open-vocabulary semantic segmen- tation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15009–15020, 2025. 1, 2
2025
-
[26]
I. Potsdam. 2d semantic labeling dataset., 2018. 5
2018
-
[27]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 1, 2, 7
2021
-
[28]
Test- time prompt tuning for zero-shot generalization in vision- language models.Advances in Neural Information Process- ing Systems, 35:14274–14289, 2022
Manli Shu, Weili Nie, De-An Huang, Zhiding Yu, Tom Goldstein, Anima Anandkumar, and Chaowei Xiao. Test- time prompt tuning for zero-shot generalization in vision- language models.Advances in Neural Information Process- ing Systems, 35:14274–14289, 2022. 3
2022
-
[29]
Test-time training with self- supervision for generalization under distribution shifts
Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei Efros, and Moritz Hardt. Test-time training with self- supervision for generalization under distribution shifts. In International conference on machine learning, pages 9229–
-
[30]
Tent: Fully test-time adaptation by entropy minimization.arXiv preprint arXiv:2006.10726, 2020
Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Ol- shausen, and Trevor Darrell. Tent: Fully test-time adaptation by entropy minimization.arXiv preprint arXiv:2006.10726,
-
[31]
arXiv preprint arXiv:2110.08733
Junjue Wang, Zhuo Zheng, Ailong Ma, Xiaoyan Lu, and Yanfei Zhong. Loveda: A remote sensing land-cover dataset for domain adaptive semantic segmentation.arXiv preprint arXiv:2110.08733, 2021. 5
-
[32]
isaid: A large-scale dataset for instance segmentation in aerial images
Syed Waqas Zamir, Aditya Arora, Akshita Gupta, Salman Khan, Guolei Sun, Fahad Shahbaz Khan, Fan Zhu, Ling Shao, Gui-Song Xia, and Xiang Bai. isaid: A large-scale dataset for instance segmentation in aerial images. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 28–37, 2019. 5
2019
-
[33]
Openearthmap: A benchmark dataset for global high-resolution land cover mapping
Junshi Xia, Naoto Yokoya, Bruno Adriano, and Clifford Broni-Bediako. Openearthmap: A benchmark dataset for global high-resolution land cover mapping. InProceed- ings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6254–6264, 2023. 5
2023
-
[34]
Sed: A simple encoder-decoder for open- vocabulary semantic segmentation
Bin Xie, Jiale Cao, Jin Xie, Fahad Shahbaz Khan, and Yanwei Pang. Sed: A simple encoder-decoder for open- vocabulary semantic segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3426–3436, 2024. 2
2024
-
[35]
To- wards open-vocabulary remote sensing image semantic seg- mentation
Chengyang Ye, Yunzhi Zhuge, and Pingping Zhang. To- wards open-vocabulary remote sensing image semantic seg- mentation. InProceedings of the AAAI Conference on Arti- ficial Intelligence, pages 9436–9444, 2025. 3
2025
-
[36]
Hee Suk Yoon, Eunseop Yoon, Joshua Tian Jin Tee, Mark Hasegawa-Johnson, Yingzhen Li, and Chang D Yoo. C-tpt: Calibrated test-time prompt tuning for vision- language models via text feature dispersion.arXiv preprint arXiv:2403.14119, 2024. 3
-
[37]
Learning transferable land cover seman- tics for open vocabulary interactions with remote sensing im- ages.ISPRS Journal of Photogrammetry and Remote Sens- ing, 220:621–636, 2025
Val ´erie Zermatten, Javiera Castillo-Navarro, Diego Marcos, and Devis Tuia. Learning transferable land cover seman- tics for open vocabulary interactions with remote sensing im- ages.ISPRS Journal of Photogrammetry and Remote Sens- ing, 220:621–636, 2025. 1
2025
-
[38]
Corrclip: Re- constructing patch correlations in clip for open-vocabulary semantic segmentation
Dengke Zhang, Fagui Liu, and Quan Tang. Corrclip: Re- constructing patch correlations in clip for open-vocabulary semantic segmentation. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 24677– 24687, 2025. 3, 7
2025
-
[39]
Shuai Zhao, Xiaohan Wang, Linchao Zhu, and Yi Yang. Test-time adaptation with clip reward for zero-shot gen- eralization in vision-language models.arXiv preprint arXiv:2305.18010, 2023. 3
-
[40]
Yijie Zheng, Weijie Wu, Qingyun Li, Xuehui Wang, Xu Zhou, Aiai Ren, Jun Shen, Long Zhao, Guoqing Li, and Xue Yang. Instructsam: A training-free framework for instruction-oriented remote sensing object recognition.arXiv preprint arXiv:2505.15818, 2025. 3
-
[41]
Extract free dense labels from clip
Chong Zhou, Chen Change Loy, and Bo Dai. Extract free dense labels from clip. InEuropean conference on computer vision, pages 696–712. Springer, 2022. 7
2022
-
[42]
Skysense-o: Towards open-world remote sensing interpretation with vision-centric visual-language modeling
Qi Zhu, Jiangwei Lao, Deyi Ji, Junwei Luo, Kang Wu, Yingying Zhang, Lixiang Ru, Jian Wang, Jingdong Chen, Ming Yang, et al. Skysense-o: Towards open-world remote sensing interpretation with vision-centric visual-language modeling. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 14733–14744, 2025. 3
2025
-
[43]
Regionmatch: Pixel- region collaboration for semi-supervised semantic segmenta- tion in remote sensing images
Xiaoqian Zhu, Xiangrong Zhang, Tianyang Zhang, Chaowei Fang, Xu Tang, and Licheng Jiao. Regionmatch: Pixel- region collaboration for semi-supervised semantic segmenta- tion in remote sensing images. InProceedings of the Thirty- Fourth International Joint Conference on Artificial Intelli- gence, IJCAI-25, pages 2530–2538, 2025. 1
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.