Prompt-Calibrated SAM 3 for Open-Vocabulary Remote Sensing Semantic Segmentation

Chengfu Yang; Haonan Yin; Nanqing Liu; Qi Ming; Yanghui Song; Yingjie Gao

arxiv: 2606.21863 · v1 · pith:MDPXPAN2new · submitted 2026-06-20 · 💻 cs.CV

Prompt-Calibrated SAM 3 for Open-Vocabulary Remote Sensing Semantic Segmentation

Yanghui Song , Nanqing Liu , Haonan Yin , Yingjie Gao , Chengfu Yang , Qi Ming This is my paper

Pith reviewed 2026-06-26 12:26 UTC · model grok-4.3

classification 💻 cs.CV

keywords open-vocabulary semantic segmentationremote sensingSAM 3prompt calibrationtraining-freemultimodal large language modelsemantic segmentationaerial imagery

0 comments

The pith

ProC-SAM3 calibrates SAM 3 prompts offline to reach 56.1 percent average mIoU in remote sensing open-vocabulary segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to fix three prompt problems in SAM 3 when segmenting remote sensing images into categories that were never seen in training. It builds an offline pool of refined prompts from multimodal language model outputs, caches the embeddings so they are not recomputed for every image, and adds a gating step that suppresses noisy decoder results while keeping activations for small objects. A sympathetic reader would care because remote sensing data constantly introduces new object types in satellite or aerial views, and a training-free method that works across eight different benchmarks would let analysts apply the same model to changing label needs without retraining. If the calibration works as described, the approach would cut both redundant computation and false activations that previously hurt accuracy on complex scenes.

Core claim

ProC-SAM3 calibrates SAM 3's prompt interface for remote sensing OVSS from three complementary aspects. First, an offline prompt pool is built where a Category Matcher groups MLLM-generated candidates into per-category sets and Expansion Constraints refine each set using category-specific prior knowledge. Second, the resulting text embeddings are cached and reused across all test images. Third, Presence-Guided Residual Fusion gates unreliable decoder outputs by prompt presence and confidence, followed by peak-preserving class aggregation that retains fine-grained activations for small and sparse objects. On eight benchmarks this yields an average mIoU of 56.1 percent, 3.9 percentage points a

What carries the argument

The three-aspect prompt calibration pipeline: offline prompt pool via Category Matcher and Expansion Constraints, cached text embeddings, and Presence-Guided Residual Fusion with peak-preserving aggregation.

If this is right

Caching embeddings removes repeated online text encoding for every test image.
Presence-Guided Residual Fusion limits noisy activations from reaching the final map.
Peak-preserving aggregation keeps fine detail for small and sparse objects that earlier aggregation steps tended to lose.
The full pipeline remains training-free and produces measurable gains on eight separate remote sensing benchmarks.
The method directly targets the three listed prompt-interface shortcomings of prior SAM 3 OVSS work.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The offline prompt-pool construction could be reused with newer multimodal models if their candidate generation is swapped in.
Gating based on prompt presence might transfer to other decoder-heavy segmentation models that suffer from activation noise.
The reported gains on small-object retention suggest the approach could help monitoring tasks where targets occupy few pixels, such as infrastructure or vegetation mapping.
If the Category Matcher logic proves robust, similar grouping steps could reduce prompt redundancy in non-remote-sensing open-vocabulary tasks.

Load-bearing premise

The assumption that MLLM-generated prompt candidates grouped by the Category Matcher and refined by Expansion Constraints, combined with presence-guided gating, will reliably reduce noise and preserve small-object detail across unseen remote sensing distributions without introducing new biases or missing categories.

What would settle it

Evaluating ProC-SAM3 on a new remote sensing test set rich in small, sparse objects from unseen categories and measuring whether its mIoU falls below the previous training-free baseline would directly test the performance claim.

Figures

Figures reproduced from arXiv: 2606.21863 by Chengfu Yang, Haonan Yin, Nanqing Liu, Qi Ming, Yanghui Song, Yingjie Gao.

**Figure 1.** Figure 1: Comparison of SAM 3 adaptation paradigms for remote sensing OVSS across three phases: prompt construction, inference, and mask aggregation. SegEarth-OV3 lacks semantic calibration and encodes prompts repeatedly online; ConceptBank achieves calibration but relies on ground-truth annotations; our ProC-SAM3 provides GT-free calibration with cached embeddings and presence-guided aggregation. However, the key c… view at source ↗

**Figure 2.** Figure 2: Overview of the proposed ProC-SAM3, which operates in a fully training-free manner through three stages: (a) it first constructs a semantically enriched prompt pool offline via MLLM candidate generation, category matching, and expansion constraint refinement; (b) it then caches the prompt embeddings and performs efficient batched decoding; and (c) it finally fuses per-prompt responses into per-class predic… view at source ↗

**Figure 3.** Figure 3: Qualitative comparison on representative remote sensing scenes. and VDD. Input images are resized to 1008 × 1008 by the SAM3 processor. The offline prompt pool is constructed once per dataset using Qwen3-VL-8B under constraint files, and text embeddings are pre-computed and cached by the prompt bank. During inference, prompts are processed in batches with a size of 1 for UDD5, VDD, and UAVid, and 4 for oth… view at source ↗

**Figure 4.** Figure 4: Ablation on different fusion strategies. Potsdam Vaihingen OpenEarthMap LoveDA 0.0 0.5 1.0 1.5 2.0 2.5 3.0 FPS 1.35 1.14 0.69 0.66 2.42 1.95 1.02 1.01 w/o POBI w/ POBI [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

read the original abstract

Open-vocabulary semantic segmentation (OVSS) in remote sensing images aims to segment categories beyond a fixed label space. Recent SAM 3-based methods provide a promising training-free foundation, yet three key issues remain: (1) a single class-name prompt lacks sufficient semantic coverage for complex remote sensing categories; (2) expanding each category into multiple prompts introduces redundant online text encoding; and (3) directly aggregating multiple prompt responses propagates noisy activations into the final prediction. To address these issues, we propose ProC-SAM3, which calibrates SAM 3's prompt interface for remote sensing OVSS from three complementary aspects. First, we construct an offline prompt pool where a Category Matcher groups MLLM-generated candidates into per-category sets, and Expansion Constraints further refine each set using category-specific prior knowledge. Second, the resulting text embeddings are cached and reused across all test images, eliminating repeated text encoding. Third, we introduce Presence-Guided Residual Fusion to gate unreliable decoder outputs by prompt presence and confidence, followed by peak-preserving class aggregation that retains fine-grained activations for small and sparse objects. Experiments on eight benchmarks show that ProC-SAM3 achieves an average mIoU of 56.1%, outperforming the previous best training-free method by 3.9 percentage points. Code will be available at https://github.com/YanghuiSong/ProC-SAM3.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ProC-SAM3 adds a concrete offline prompt calibration and gated fusion step to SAM 3 for remote sensing OVSS and reports a 3.9-point mIoU gain on eight benchmarks, but the experimental controls remain thin.

read the letter

The paper's main contribution is a three-part calibration of SAM 3's prompt interface for open-vocabulary remote sensing segmentation. It builds an offline prompt pool by running an MLLM, grouping candidates with a Category Matcher, and tightening them with Expansion Constraints; caches the resulting embeddings; and replaces naive aggregation with Presence-Guided Residual Fusion plus peak-preserving class aggregation. This directly targets the three problems named in the abstract: narrow semantic coverage from single prompts, repeated text encoding cost, and noise from multiple prompt responses.

The approach is internally consistent and keeps the method training-free, which matters for the target domain. The reported average mIoU of 56.1% and the 3.9-point lift over the prior best training-free baseline are the concrete numbers to evaluate. Caching is a straightforward efficiency win, and the gating logic aims to protect small-object detail without extra training.

The soft spots are in the experimental section. The abstract gives no detail on baseline re-implementations, hyper-parameter search, or statistical significance of the gains. The central assumption—that MLLM-generated prompts plus presence gating will reduce noise without systematically missing categories or introducing new biases across unseen remote sensing distributions—receives no direct stress test in the provided summary. If those choices turn out to be post-hoc or dataset-specific, the margin could shrink.

This is useful reading for people already working on SAM-based OVSS or remote sensing segmentation pipelines. It does not claim a new paradigm, but the pipeline is described clearly enough that a reader could try to reproduce the calibration steps. The work shows honest engagement with the practical constraints of the domain.

I would send it to peer review. The empirical claim is falsifiable on public benchmarks and the method is specified at a level that allows checking, even if revisions will be needed on controls and ablations.

Referee Report

3 major / 2 minor

Summary. The paper proposes ProC-SAM3, a training-free calibration of SAM 3 for open-vocabulary semantic segmentation on remote sensing images. It addresses limited semantic coverage of single class-name prompts, redundant online text encoding from prompt expansion, and noisy aggregation of multiple prompt responses via three components: an offline prompt pool (MLLM-generated candidates grouped by Category Matcher and refined by Expansion Constraints), cached text embeddings, and Presence-Guided Residual Fusion (gating by prompt presence/confidence plus peak-preserving class aggregation to retain small-object detail). Experiments on eight benchmarks report 56.1% average mIoU, a 3.9 pp gain over the prior best training-free baseline.

Significance. If the reported gains prove robust, the work supplies a practical, efficient mechanism for improving training-free OVSS in remote sensing by reducing prompt redundancy and controlling noise while preserving fine-grained activations. The offline caching and presence-guided gating are internally consistent with the problem statement and could transfer to other SAM-based pipelines where repeated encoding cost or small-object fidelity is a concern.

major comments (3)

[§4] §4 (Experiments): The central claim of a 3.9 pp mIoU improvement rests on comparisons to prior training-free methods, yet the manuscript provides no implementation details for the baselines, no statistical significance tests, and no analysis of post-hoc choices in fusion thresholds or presence thresholds; without these, it is impossible to determine whether the reported average of 56.1% is reproducible or sensitive to implementation decisions.
[§3.3] §3.3 (Presence-Guided Residual Fusion): The gating mechanism is presented as reliably reducing noise while preserving small-object detail, but no ablation isolates the contribution of the presence-guided term versus simple averaging, nor are failure cases or category-miss rates reported on the eight benchmarks; this leaves the weakest assumption (reliable noise reduction across unseen distributions) unsupported by direct evidence.
[Abstract, §4] Abstract and §4: The eight-benchmark evaluation is summarized only by the aggregate 56.1% mIoU; per-dataset scores, variance across random seeds, and comparison against strong training-based OVSS baselines are absent, weakening the interpretation that the architectural changes are the primary driver of the observed delta.

minor comments (2)

[§3.1] Notation for the Category Matcher and Expansion Constraints is introduced without an explicit algorithm box or pseudocode, making the offline pool construction harder to re-implement.
[Abstract] The paper states that code will be released but does not specify the exact prompt-generation procedure or the MLLM used, which affects reproducibility of the prompt pool.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the insightful comments, which highlight important aspects of reproducibility and empirical validation. We address each major comment below and will incorporate revisions to strengthen the paper.

read point-by-point responses

Referee: [§4] The central claim of a 3.9 pp mIoU improvement rests on comparisons to prior training-free methods, yet the manuscript provides no implementation details for the baselines, no statistical significance tests, and no analysis of post-hoc choices in fusion thresholds or presence thresholds; without these, it is impossible to determine whether the reported average of 56.1% is reproducible or sensitive to implementation decisions.

Authors: We agree that additional details are necessary for reproducibility. In the revised version, we will provide full implementation details for the baselines (including any hyperparameters and code references), conduct statistical significance tests on the mIoU improvements, and include an analysis of the sensitivity to the fusion and presence thresholds. Since the code will be released, this will allow full verification. revision: yes
Referee: [§3.3] The gating mechanism is presented as reliably reducing noise while preserving small-object detail, but no ablation isolates the contribution of the presence-guided term versus simple averaging, nor are failure cases or category-miss rates reported on the eight benchmarks; this leaves the weakest assumption (reliable noise reduction across unseen distributions) unsupported by direct evidence.

Authors: We acknowledge the need for more direct evidence. We will add an ablation study that compares the full Presence-Guided Residual Fusion against a simple averaging baseline. We will also report failure cases and category-miss rates across the benchmarks to better support the effectiveness of the noise reduction mechanism. revision: yes
Referee: [Abstract, §4] The eight-benchmark evaluation is summarized only by the aggregate 56.1% mIoU; per-dataset scores, variance across random seeds, and comparison against strong training-based OVSS baselines are absent, weakening the interpretation that the architectural changes are the primary driver of the observed delta.

Authors: We will revise the experiments section to report per-dataset mIoU scores for all eight benchmarks, include variance measures (e.g., across different prompt seeds if applicable), and add comparisons to representative training-based OVSS methods to provide broader context. This will help clarify the contributions of our training-free approach. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes an empirical training-free method (ProC-SAM3) consisting of an offline prompt pool construction via Category Matcher and Expansion Constraints, cached embeddings, and Presence-Guided Residual Fusion with peak-preserving aggregation. The central claim is an observed 56.1% average mIoU on eight external benchmarks, presented as the direct experimental outcome of these architectural changes rather than any derived theorem, fitted parameter renamed as prediction, or self-referential equation. No load-bearing step reduces by construction to its own inputs, no uniqueness theorem is imported via self-citation, and the performance delta is not forced by internal definitions. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard assumptions about foundation models and MLLMs rather than new postulates; no free parameters or invented entities are introduced in the abstract. Limited visibility into full paper prevents exhaustive audit.

axioms (2)

domain assumption SAM 3 produces usable segmentation masks from text prompts in remote sensing imagery
The entire calibration approach presupposes that the base SAM 3 decoder outputs are meaningful enough to be gated and aggregated.
domain assumption MLLM-generated text candidates can be grouped and constrained to form useful per-category prompt sets
This is invoked in the first aspect of the method for constructing the offline prompt pool.

pith-pipeline@v0.9.1-grok · 5792 in / 1532 out tokens · 32459 ms · 2026-06-26T12:26:50.892117+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 2 linked inside Pith

[1]

Clearclip: Decomposing clip representations for dense vision-language inference,

M. Lan, C. Chen, Y . Ke, X. Wang, L. Feng, and W. Zhang, “Clearclip: Decomposing clip representations for dense vision-language inference,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 143– 160

2024
[2]

Corrclip: Reconstructing patch correla- tions in clip for open-vocabulary semantic segmentation,

D. Zhang, F. Liu, and Q. Tang, “Corrclip: Reconstructing patch correla- tions in clip for open-vocabulary semantic segmentation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 24 677–24 687

2025
[3]

Open-vocabulary high-resolution remote sensing image semantic segmentation,

Q. Cao, Y . Chen, C. Ma, and X. Yang, “Open-vocabulary high-resolution remote sensing image semantic segmentation,”IEEE Transactions on Geoscience and Remote Sensing, 2025

2025
[4]

Sota: Self-adaptive optimal transport for zero-shot classification with multiple foundation models,

Z. Hu, Q. Xu, Y . Duan, Y . Tai, and H. Li, “Sota: Self-adaptive optimal transport for zero-shot classification with multiple foundation models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026, pp. 26 624–26 634

2026
[5]

Segearth-ov: Towards training-free open-vocabulary segmentation for remote sensing images,

K. Li, R. Liu, X. Cao, X. Bai, F. Zhou, D. Meng, and Z. Wang, “Segearth-ov: Towards training-free open-vocabulary segmentation for remote sensing images,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 10 545–10 556

2025
[6]

Annotation-free open-vocabulary segmentation for remote-sensing im- ages,

K. Li, X. Cao, R. Liu, S. Wang, Z. Jiang, Z. Wang, and D. Meng, “Annotation-free open-vocabulary segmentation for remote-sensing im- ages,”arXiv preprint arXiv:2508.18067, 2025

arXiv 2025
[7]

Segearth- ov3: Exploring sam 3 for open-vocabulary semantic segmentation in remote sensing images,

K. Li, S. Zhang, Y . Deng, Z. Wang, D. Meng, and X. Cao, “Segearth- ov3: Exploring sam 3 for open-vocabulary semantic segmentation in remote sensing images,”arXiv preprint arXiv:2512.08730, 2025

Pith/arXiv arXiv 2025
[8]

Pointsam: Pointly- supervised segment anything model for remote sensing images,

N. Liu, X. Xu, Y . Su, H. Zhang, and H.-C. Li, “Pointsam: Pointly- supervised segment anything model for remote sensing images,”IEEE Transactions on Geoscience and Remote Sensing, vol. 63, pp. 1–15, 2025

2025
[9]

Taming sam3 in the wild: A concept bank for open-vocabulary segmentation,

G. Pei, X. Jiang, Y . Yao, X. Shu, F. Shen, and B. Jeon, “Taming sam3 in the wild: A concept bank for open-vocabulary segmentation,”arXiv preprint arXiv:2602.06333, 2026

arXiv 2026
[10]

Vip: Visual- guided prompt evolution for efficient dense vision-language inference,

H. Zhu, S. Jin, W. Liao, J. Xiao, Y . Zhu, S. Yu, and F. Dai, “Vip: Visual- guided prompt evolution for efficient dense vision-language inference,” arXiv preprint arXiv:2605.12325, 2026

Pith/arXiv arXiv 2026
[11]

Sclip: Rethinking self-attention for dense vision-language inference,

F. Wang, J. Mei, and A. Yuille, “Sclip: Rethinking self-attention for dense vision-language inference,” inEuropean conference on computer vision. Springer, 2024, pp. 315–332

2024
[12]

Proxyclip: Proxy attention improves clip for open-vocabulary segmentation,

M. Lan, C. Chen, Y . Ke, X. Wang, L. Feng, and W. Zhang, “Proxyclip: Proxy attention improves clip for open-vocabulary segmentation,” in European Conference on Computer Vision. Springer, 2024, pp. 70– 88

2024
[13]

Rsclip for training-free open- vocabulary remote sensing image semantic segmentation,

S. Wang, X. Sun, J. Han, and X. X. Zhu, “Rsclip for training-free open- vocabulary remote sensing image semantic segmentation,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2026

2026
[14]

Coninfer: Context- aware inference for training-free open-vocabulary remote sensing seg- mentation,

W. Chen, Z. Hu, Y . Zhang, H. Ning, and Y . Tai, “Coninfer: Context- aware inference for training-free open-vocabulary remote sensing seg- mentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026, pp. 7408–7418

2026
[15]

Reattnclip: Training- free open-vocabulary remote sensing image segmentation via re-defined attention in clip,

X. Niu, M. Zhao, D. Jiang, Y . Wu, and B. Su, “Reattnclip: Training- free open-vocabulary remote sensing image segmentation via re-defined attention in clip,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026, pp. 24 980–24 989

2026
[16]

Activesam: Image-conditional class pruning for fast and accurate open-vocabulary segmentation,

T. D. Tien and Z. Shen, “Activesam: Image-conditional class pruning for fast and accurate open-vocabulary segmentation,”arXiv preprint arXiv:2606.16996, 2026

arXiv 2026

[1] [1]

Clearclip: Decomposing clip representations for dense vision-language inference,

M. Lan, C. Chen, Y . Ke, X. Wang, L. Feng, and W. Zhang, “Clearclip: Decomposing clip representations for dense vision-language inference,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 143– 160

2024

[2] [2]

Corrclip: Reconstructing patch correla- tions in clip for open-vocabulary semantic segmentation,

D. Zhang, F. Liu, and Q. Tang, “Corrclip: Reconstructing patch correla- tions in clip for open-vocabulary semantic segmentation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 24 677–24 687

2025

[3] [3]

Open-vocabulary high-resolution remote sensing image semantic segmentation,

Q. Cao, Y . Chen, C. Ma, and X. Yang, “Open-vocabulary high-resolution remote sensing image semantic segmentation,”IEEE Transactions on Geoscience and Remote Sensing, 2025

2025

[4] [4]

Sota: Self-adaptive optimal transport for zero-shot classification with multiple foundation models,

Z. Hu, Q. Xu, Y . Duan, Y . Tai, and H. Li, “Sota: Self-adaptive optimal transport for zero-shot classification with multiple foundation models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026, pp. 26 624–26 634

2026

[5] [5]

Segearth-ov: Towards training-free open-vocabulary segmentation for remote sensing images,

K. Li, R. Liu, X. Cao, X. Bai, F. Zhou, D. Meng, and Z. Wang, “Segearth-ov: Towards training-free open-vocabulary segmentation for remote sensing images,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 10 545–10 556

2025

[6] [6]

Annotation-free open-vocabulary segmentation for remote-sensing im- ages,

K. Li, X. Cao, R. Liu, S. Wang, Z. Jiang, Z. Wang, and D. Meng, “Annotation-free open-vocabulary segmentation for remote-sensing im- ages,”arXiv preprint arXiv:2508.18067, 2025

arXiv 2025

[7] [7]

Segearth- ov3: Exploring sam 3 for open-vocabulary semantic segmentation in remote sensing images,

K. Li, S. Zhang, Y . Deng, Z. Wang, D. Meng, and X. Cao, “Segearth- ov3: Exploring sam 3 for open-vocabulary semantic segmentation in remote sensing images,”arXiv preprint arXiv:2512.08730, 2025

Pith/arXiv arXiv 2025

[8] [8]

Pointsam: Pointly- supervised segment anything model for remote sensing images,

N. Liu, X. Xu, Y . Su, H. Zhang, and H.-C. Li, “Pointsam: Pointly- supervised segment anything model for remote sensing images,”IEEE Transactions on Geoscience and Remote Sensing, vol. 63, pp. 1–15, 2025

2025

[9] [9]

Taming sam3 in the wild: A concept bank for open-vocabulary segmentation,

G. Pei, X. Jiang, Y . Yao, X. Shu, F. Shen, and B. Jeon, “Taming sam3 in the wild: A concept bank for open-vocabulary segmentation,”arXiv preprint arXiv:2602.06333, 2026

arXiv 2026

[10] [10]

Vip: Visual- guided prompt evolution for efficient dense vision-language inference,

H. Zhu, S. Jin, W. Liao, J. Xiao, Y . Zhu, S. Yu, and F. Dai, “Vip: Visual- guided prompt evolution for efficient dense vision-language inference,” arXiv preprint arXiv:2605.12325, 2026

Pith/arXiv arXiv 2026

[11] [11]

Sclip: Rethinking self-attention for dense vision-language inference,

F. Wang, J. Mei, and A. Yuille, “Sclip: Rethinking self-attention for dense vision-language inference,” inEuropean conference on computer vision. Springer, 2024, pp. 315–332

2024

[12] [12]

Proxyclip: Proxy attention improves clip for open-vocabulary segmentation,

M. Lan, C. Chen, Y . Ke, X. Wang, L. Feng, and W. Zhang, “Proxyclip: Proxy attention improves clip for open-vocabulary segmentation,” in European Conference on Computer Vision. Springer, 2024, pp. 70– 88

2024

[13] [13]

Rsclip for training-free open- vocabulary remote sensing image semantic segmentation,

S. Wang, X. Sun, J. Han, and X. X. Zhu, “Rsclip for training-free open- vocabulary remote sensing image semantic segmentation,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2026

2026

[14] [14]

Coninfer: Context- aware inference for training-free open-vocabulary remote sensing seg- mentation,

W. Chen, Z. Hu, Y . Zhang, H. Ning, and Y . Tai, “Coninfer: Context- aware inference for training-free open-vocabulary remote sensing seg- mentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026, pp. 7408–7418

2026

[15] [15]

Reattnclip: Training- free open-vocabulary remote sensing image segmentation via re-defined attention in clip,

X. Niu, M. Zhao, D. Jiang, Y . Wu, and B. Su, “Reattnclip: Training- free open-vocabulary remote sensing image segmentation via re-defined attention in clip,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026, pp. 24 980–24 989

2026

[16] [16]

Activesam: Image-conditional class pruning for fast and accurate open-vocabulary segmentation,

T. D. Tien and Z. Shen, “Activesam: Image-conditional class pruning for fast and accurate open-vocabulary segmentation,”arXiv preprint arXiv:2606.16996, 2026

arXiv 2026