arxiv: 2604.21032 · v1 · submitted 2026-04-22 · 💻 cs.CV

Recognition: unknown

Unlocking Multi-Spectral Data for Multi-Modal Models with Guided Inputs and Chain-of-Thought Reasoning

Dahun Kim , Ganesh Satish Mallya , Anelia Angelova

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:02 UTC · model grok-4.3

classification 💻 cs.CV

keywords multi-spectral imagerylarge multi-modal modelsremote sensingchain-of-thought reasoningtraining-free methodzero-shot performanceinput adaptation

0 comments

The pith

Standard RGB-trained multi-modal models can process multi-spectral imagery by adapting inputs and adding domain-guided chain-of-thought prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

General large multi-modal models are usually limited to ordinary color photographs, so they miss the extra wavelength bands that multi-spectral satellite images provide for tasks like land-use classification. The paper introduces a method that converts those extra bands into a visual format the model already recognizes and then supplies the model with domain knowledge plus explicit step-by-step reasoning instructions inside the prompt. All of this occurs at inference time with no model updates or task-specific training, turning existing generalist models into tools for specialized remote sensing data. Experiments with Gemini 2.5 on standard benchmarks show clear zero-shot accuracy gains, suggesting geospatial work can now draw on powerful reasoning without building new specialized models.

Core claim

We propose a training-free approach that introduces multi-spectral data within the inference pipeline of standard RGB-only LMMs by adapting non-RGB inputs to the model's visual space and injecting domain-specific information and Chain-of-Thought reasoning as instructions, yielding strong zero-shot performance gains on remote sensing benchmarks when tested with Gemini 2.5.

What carries the argument

Input adaptation that maps multi-spectral bands into an RGB model's visual space, combined with prompt injection of domain-specific facts and Chain-of-Thought reasoning steps.

If this is right

Existing generalist LMMs become immediately usable for multi-spectral remote sensing without retraining or new model creation.
Zero-shot accuracy rises on standard land-cover and environmental monitoring benchmarks.
Geospatial professionals can apply rich reasoning from large models directly to specialized sensor inputs.
The high cost of training dedicated multi-spectral multi-modal models can be avoided for many applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same adaptation pattern could be tested on other non-visible sensor types such as hyperspectral or thermal imagery to check whether the visual-space mapping generalizes.
Different ways of phrasing the domain prompts might be explored to see if further gains are possible on specific remote sensing subtasks.
The approach suggests that prompt engineering can serve as a lightweight bridge across sensor-domain gaps in frozen models.

Load-bearing premise

That multi-spectral inputs can be mapped into the visual space already understood by an RGB-only model and that added domain instructions will reliably produce correct reasoning without any changes to the model itself.

What would settle it

A direct comparison on the same remote sensing benchmarks where the model receives the identical multi-spectral images but without the input adaptation or the guided chain-of-thought prompts, showing no performance improvement or a drop relative to the proposed method.

Figures

Figures reproduced from arXiv: 2604.21032 by Anelia Angelova, Dahun Kim, Ganesh Satish Mallya.

**Figure 2.** Figure 2: Examples of the six input modalities derived from the multi [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Example results on EuroSat. Top: Our multi-spectral model with Chain-of-Thought (CoT) reasoning correctly predicts ‘River’, whereas the RGB-only baseline outputs ‘Highway’. The multi-spectral inputs, particularly the NDWI (4-th image), clearly distinguish water bodies where RGB features are ambiguous. Bottom: A ‘Forest’ example correctly identified by our method. The RGB-only baseline misinterprets the gre… view at source ↗

read the original abstract

Multi-spectral imagery is a valuable input signal for Remote Sensing applications, such as land-use and land-cover classification and environmental monitoring. However, generalist Large Multi-modal Models (LMMs) are typically trained on RGB images, limiting their applicability to the RGB domain. At the same time, training multi-spectral multi-modal models is expensive and produces uniquely specialized models. To address this, we propose a novel training-free approach that introduces multi-spectral data within the inference pipeline of standard RGB-only LMMs, allowing large gains in performance. Our approach leverages the LMMs' understanding of the visual space by adapting non-RGB inputs to that space and injecting domain-specific information and Chain-of-Thought reasoning as instructions. We demonstrate this with the Gemini 2.5 model and observe strong Zero-Shot performance gains on popular Remote Sensing benchmarks. These results highlight the potential for geospatial professionals to leverage powerful generalist models for specialized sensor inputs, benefiting from rich reasoning capabilities grounded in specialized data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper offers a training-free prompting method to run multi-spectral remote-sensing images through frozen RGB LMMs like Gemini, but the reported gains could easily come from the added instructions alone.

read the letter

This paper's core contribution is a simple inference-time recipe: adapt multi-spectral inputs to fit an RGB-only LMM's visual space, then feed in domain-specific instructions plus chain-of-thought steps. They test it on Gemini 2.5 and report zero-shot improvements on standard remote-sensing benchmarks for tasks like land-cover classification. The motivation is solid and practical. Remote-sensing work often has multi-spectral data but faces high costs to train new models from scratch, so a no-training approach that reuses existing generalist models has clear appeal for practitioners who want quick access to strong reasoning without custom fine-tuning. The method stays lightweight and avoids introducing new parameters or derivations. The main gap is the missing control. There is no ablation that holds the domain instructions and chain-of-thought fixed while swapping the adapted multi-spectral inputs for ordinary RGB images. Without that, any lift could be explained by the extra prompting steps, which are already known to help on RGB data. The abstract mentions strong gains but gives no concrete numbers, baselines, or error breakdowns, so the size and reliability of the effect stay hard to judge from the high-level description. The full paper may fill some of this in, but the current evidence does not yet separate the claimed adaptation benefit from standard prompting effects. This is aimed at remote-sensing researchers and engineers who already use LMMs and want to try multi-spectral inputs without retraining. A reader looking for prompting ideas in specialized sensor settings could pick up useful details, though anyone needing rigorous isolation of components would want the ablations first. I would send it to peer review. The problem is real and the approach is straightforward enough that referees could give concrete feedback on controls and comparisons.

Referee Report

2 major / 2 minor

Summary. The paper claims to introduce a novel training-free approach that enables standard RGB-only Large Multi-modal Models (LMMs) to process multi-spectral imagery for remote sensing tasks. The method adapts non-RGB inputs to the model's visual space and augments inference with domain-specific information plus Chain-of-Thought reasoning instructions, purportedly yielding strong zero-shot performance gains on popular remote sensing benchmarks when demonstrated with Gemini 2.5.

Significance. If the claimed gains are validated with appropriate controls, the work could have moderate significance for computer vision and remote sensing applications. It suggests a practical way to extend generalist LMMs to specialized sensor data without retraining, potentially allowing geospatial users to combine multi-spectral signals with the reasoning strengths of large models.

major comments (2)

[Method / Results] Method / Results sections: The central claim attributes performance gains to the multi-spectral input adaptation. However, no ablation is described that holds the domain-specific instructions and Chain-of-Thought prompts fixed while comparing adapted multi-spectral inputs against standard RGB inputs. Without this control, improvements cannot be distinguished from the known benefits of CoT prompting alone on RGB data.
[Abstract] Abstract: The abstract asserts 'large gains in performance' and 'strong Zero-Shot performance gains' on 'popular Remote Sensing benchmarks' but supplies no quantitative numbers, specific benchmark names, baseline comparisons, or error analysis. This omission prevents evaluation of the magnitude, statistical significance, or reliability of the reported results.

minor comments (2)

[Method] The adaptation procedure ('adapting non-RGB inputs to that space') is described at a high level; concrete implementation details (e.g., channel remapping, normalization, or pseudo-RGB conversion) would improve reproducibility.
[Discussion] The manuscript would benefit from a limitations or failure-case analysis to clarify when the guided-input approach succeeds or breaks down.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive suggestions. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of our results.

read point-by-point responses

Referee: [Method / Results] Method / Results sections: The central claim attributes performance gains to the multi-spectral input adaptation. However, no ablation is described that holds the domain-specific instructions and Chain-of-Thought prompts fixed while comparing adapted multi-spectral inputs against standard RGB inputs. Without this control, improvements cannot be distinguished from the known benefits of CoT prompting alone on RGB data.

Authors: We concur with the referee that an ablation study isolating the effect of the multi-spectral input adaptation, while keeping the domain-specific instructions and Chain-of-Thought prompts constant, would provide stronger evidence for our claims. We will add this analysis to the Method and Results sections in the revised manuscript. Specifically, we will report performance on standard RGB inputs using the same prompting strategy and compare it to the adapted multi-spectral case. revision: yes
Referee: [Abstract] Abstract: The abstract asserts 'large gains in performance' and 'strong Zero-Shot performance gains' on 'popular Remote Sensing benchmarks' but supplies no quantitative numbers, specific benchmark names, baseline comparisons, or error analysis. This omission prevents evaluation of the magnitude, statistical significance, or reliability of the reported results.

Authors: We agree that the abstract would benefit from including quantitative results to allow readers to assess the magnitude of the improvements. In the revised version, we will incorporate specific performance numbers, the names of the benchmarks, baseline comparisons, and notes on the statistical reliability of the findings. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical prompting recipe with independent empirical validation

full rationale

The paper describes a training-free inference-time method that adapts multi-spectral inputs to an RGB-only LMM's visual space and augments prompts with domain knowledge plus Chain-of-Thought instructions. No equations, fitted parameters, or mathematical derivations appear in the provided text. The central claim rests on zero-shot benchmark results rather than any self-referential definition or self-citation chain. The method is not equivalent to its inputs by construction; performance gains are presented as an empirical observation open to external verification or ablation. This is the normal case of a non-derivational empirical paper.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on untested assumptions about how well an RGB-trained model can interpret adapted multi-spectral signals when prompted; no free parameters, invented entities, or formal axioms are stated.

axioms (2)

domain assumption An RGB-only LMM can interpret suitably adapted non-RGB imagery when given domain instructions and chain-of-thought prompts
This is the central premise that allows the training-free claim.
domain assumption Chain-of-thought prompting reliably improves performance on remote-sensing tasks for these models
Invoked to justify the reasoning component of the method.

pith-pipeline@v0.9.0 · 5484 in / 1328 out tokens · 42300 ms · 2026-05-10T00:02:17.226406+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 13 canonical work pages · 1 internal anchor

[1]

11 Published at The 2nd Workshop on Foundation Models for Science at ICLR 2026 Remi Denton and Vighnesh Birodkar

Y . Cong, S. Khanna, C. Meng, P. Liu, E. Rozi, Y . He, M. Burke, D. B. Lobell, and S. Ermon, “SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery,” inAdvances in Neural Information Processing Systems, A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, Eds., 2022. [Online]. Available: https://arxiv.org/abs/2207.08051

work page arXiv 2022
[2]

Jakubik, S

J. Jakubik, S. Roy, C. E. Phillips, P. Fraccaro, D. Godwin, B. Zadrozny, D. Szwarcman, C. Gomes, G. Nyirjesy, B. Ed- wards, D. Kimura, N. Simumba, L. Chu, S. K. Mukkavilli, D. Lambhate, K. Das, R. Bangalore, D. Oliveira, M. Muszynski, K. Ankur, M. Ramasubramanian, I. Gurung, S. Khallaghi, H. S. Li, M. Cecil, M. Ahmadi, F. Kordi, H. Alemohammad, M. Maskey,...

work page arXiv 2023
[3]

Spectralgpt: Spectral remote sensing foundation model,

D. Hong, B. Zhang, X. Li, Y . Li, C. Li, J. Yao, N. Yokoya, H. Li, P. Ghamisi, X. Jia, A. Plaza, P. Gamba, J. A. Benediktsson, and J. Chanussot, “Spectralgpt: Spectral remote sensing foundation model,” inIEEE Trans. Pattern Anal. Mach. Intell., 2024

2024
[4]

Hypersigma: Hyperspectral intelligence comprehen- sion foundation model,

D. Wang, M. Hu, Y . Jin, Y . Miao, J. Yang, Y . Xu, X. Qin, J. Ma, L. Sun, C. Li, C. Fu, H. Chen, C. Han, N. Yokoya, J. Zhang, M. Xu, L. Liu, L. Zhang, C. Wu, B. Du, D. Tao, and L. Zhang, “Hypersigma: Hyperspectral intelligence comprehen- sion foundation model,” inIEEE Trans. Pattern Anal. Mach. Intell., 2025

2025
[5]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

G. Team, “Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,” inarxiv.org/abs/2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Vision foundation models in remote sensing: A survey,

S. Lu, J. Guo, J. R. Zimmer-Dauphinee, J. M. Nieusma, X. Wang, P. VanValkenburgh, S. A. Wernke, and Y . Huo, “Vision foundation models in remote sensing: A survey,” in arxiv:2408.03464, 2024

work page arXiv 2024
[7]

RemoteCLIP: A vision language foundation model for remote sensing,

F. Liu, D. Chen, Z. Guan, X. Zhou, J. Zhu, Q. Ye, L. Fu, and J. Zhou, “RemoteCLIP: A vision language foundation model for remote sensing,” inIEEE Transactions on Geoscience and Remote Sensing (TGRS), 2024

2024
[8]

Skyscript: A large and semantically diverse vision-language dataset for remote sensing,

Z. Wang, R. Prabha1, T. Huang, J. Wu, and R. Rajagopal, “Skyscript: A large and semantically diverse vision-language dataset for remote sensing,” inAAAI, 2024. [Online]. Available: https://arxiv.org/pdf/2312.12856v1

work page arXiv 2024
[9]

Rs-clip: Zero shot remote sensing scene classification via contrastive vision-language supervision,

X. Li, C. Wen, Y . Hu, and N. Zhou, “Rs-clip: Zero shot remote sensing scene classification via contrastive vision-language supervision,” inInternational Journal of Applied Earth Observation and Geoinformation, 2023. [Online]. Available: https://www.sciencedirect.com/science/ article/pii/S1569843223003217

2023
[10]

A recipe for improving remote sensing vlm zero shot generalization,

A. Barzilai, Y . Gigi, V . Silverman, Y . Refael, B. Jaber, A. Helmy, T. Shekel, G. Leifman, and G. Beryozkin, “A recipe for improving remote sensing vlm zero shot generalization,”ArXiv, vol. abs/2503.08722, 2025. [Online]. Available: https://api.semanticscholar.org/CorpusID:276937917

work page arXiv 2025
[11]

Satlaspretrain: A large-scale dataset for remote sensing image understanding,

F. Bastani, P. Wolters, R. Gupta, J. Ferdinando, and A. Kem- bhavi, “Satlaspretrain: A large-scale dataset for remote sensing image understanding,” inInt. Conf. Comput. Vis., 2023

2023
[12]

Skysense: A multi-modal remote sensing foundation model towards universal interpretation for earth ob- servation imagery,

X. Guo, J. Lao, B. Dang, Y . Zhang, L. Yu, L. Ru, L. Zhong, Z. Huang, K. Wu, D. Hu, H. He, J. Wang, J. Chen, M. Yang, Y . Zhang, and Y . Li, “Skysense: A multi-modal remote sensing foundation model towards universal interpretation for earth ob- servation imagery,” inIEEE Conf. Comput. Vis. Pattern Recog., 2024

2024
[13]

Earthpt: a time series foundation model for earth observation,

M. J. Smith, L. Fleming, and J. E. Geach, “Earthpt: a time series foundation model for earth observation,” inarxiv:2309.07207, 2023

work page arXiv 2023
[14]

Geochat: Grounded large vision-language model for remote sensing,

K. Kuckreja, M. S. Danish, M. Naseer, A. Das, S. Khan, and F. S. Khan, “Geochat: Grounded large vision-language model for remote sensing,” 2024

2024
[15]

Scale- mae: A scale-aware masked autoencoder for multiscale geospa- tial representation learning,

C. J. Reed, R. Gupta, S. Li, S. Brockman, C. Funk, B. Clipp, K. Keutzer, S. Candido, M. Uyttendaele, and T. Darrell, “Scale- mae: A scale-aware masked autoencoder for multiscale geospa- tial representation learning,”Int. Conf. Comput. Vis., 2023

2023
[16]

Towards geospatial foundation models via continual pretraining,

M. Mendieta, B. Han, X. Shi, Y . Zhu, and C. Chen, “Towards geospatial foundation models via continual pretraining,”Int. Conf. Comput. Vis., 2023

2023
[17]

A billion-scale foundation model for remote sensing images,

T. L. Keumgang Cha, Junghoon Seo, “A billion-scale foundation model for remote sensing images,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (IEEE J-STARS), 2023

2023
[18]

Lora-nir: Low- rank adaptation of vision transformers for remote sensing with near-infrared imagery,

I. Ulku, O. O. Tanriover, and E. Akag ¨und¨uz, “Lora-nir: Low- rank adaptation of vision transformers for remote sensing with near-infrared imagery,” inIEEE Geoscience and Remote Sens- ing Letters, 2024

2024
[19]

Masked auto-encoding spectral–spatial transformer for hyper- spectral image classification,

D. Iba ˜nez, R. Fernandez-Beltran, F. Pla, and N. Yokoya, “Masked auto-encoding spectral–spatial transformer for hyper- spectral image classification,”IEEE Transactions on Geoscience and Remote Sensing, 2022

2022
[20]

Spectralformer: Rethinking hyperspec- tral image classification with transformers,

D. Hong, Z. Han, J. Yao, L. Gao, B. Zhang, A. Plaza, and J. Chanussot, “Spectralformer: Rethinking hyperspec- tral image classification with transformers,”IEEE Trans. Geosci. Remote Sens., vol. 60, pp. 1–15, 2022, dOI: 10.1109/TGRS.2021.3130716

work page doi:10.1109/tgrs.2021.3130716 2022
[21]

Masked vision transformers for hyperspectral image classification,

L. Scheibenreif, M. Mommert, and D. Borth, “Masked vision transformers for hyperspectral image classification,”EEE/CVF Conference on Computer Vision and Pattern Recognition Work- shops (CVPRW), 2023

2023
[22]

Croma: Remote sensing representations with contrastive radar-optical masked autoencoders,

A. Fuller, K. Millard, and J. R. Green, “Croma: Remote sensing representations with contrastive radar-optical masked autoencoders,” 2023

2023
[23]

Omnisat: Self-supervised modality fusion for earth observation,

G. Astruc, N. Gonthier, C. Mallet, and L. Landrieu, “Omnisat: Self-supervised modality fusion for earth observation,”Eur. Conf. Comput. Vis., 2024

2024
[24]

Mmearth: Exploring multi-modal pretext tasks for geospatial representation learning,

V . Nedungadi, A. Kariryaa, S. Oehmcke, S. Belongie, C. Igel, and N. Lang, “Mmearth: Exploring multi-modal pretext tasks for geospatial representation learning,”Eur. Conf. Comput. Vis., 2024

2024
[25]

Interband retrieval and classification using the mul- tilabeled sentinel-2 bigearthnet archive,

U. Chaudhuri, S. Dey, M. Datcu, B. Banerjee, and A. Bhat- tacharya, “Interband retrieval and classification using the mul- tilabeled sentinel-2 bigearthnet archive,” inIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2021

2021
[26]

Enhancing remote sensing representations through mixed-modality masked autoencoding,

O. Linial, G. Leifman, Y . Blau, N. Sherman, Y . Gigi, W. Sirko, and G. Beryozkin, “Enhancing remote sensing representations through mixed-modality masked autoencoding,” inWinter Con- ference on Applications of Computer Vision (WACV) Workshops, 2025

2025
[27]

A generalizable and accessible approach to machine learning with global satellite imagery,

E. Rolf, J. Proctor, T. Carleton, I. Bolliger, V . Shankar, M. Ishi- hara, B. Recht, and S. Hsiang, “A generalizable and accessible approach to machine learning with global satellite imagery,” in arxiv.org/abs/2010.08168, 2020

work page arXiv 2010
[28]

S2vec: Self-supervised geospatial embeddings,

S. Choudhury, E. Aharoni, C. Suvarna, I. Tsogsuren, A. R. Krei- dieh, C.-T. Lu, and N. Arora, “S2vec: Self-supervised geospatial embeddings,” inhttps://arxiv.org/abs/2504.16942, 2025

work page arXiv 2025
[29]

Alphaearth foundations: An embedding field model for accurate and efficient global mapping from sparse label data.arXiv preprint arXiv:2507.22291, 2025

C. F. Brown, M. R. Kazmierski, V . J. Pasquarella, W. J. Rucklidge, M. Samsikova, C. Zhang, E. Shelhamer, E. Lahera, O. Wiles, S. Ilyushchenko, N. Gorelick, L. L. Zhang, S. Alj, E. Schechter, S. Askay, O. Guinan, R. Moore, A. Boukouvalas, and P. Kohli, “Alphaearth Foundations: An embedding field model for accurate and efficient global mapping from sparse ...

work page arXiv 2025
[30]

Zero-shot multi-spectral learning: Reimagining a generalist multimodal gemini 2.5 model for remote sensing applications,

G. Mallya, Y . Gigi, D. Kim, M. Neumann, G. Beryozkin, T. Shekel, and A. Angelova, “Zero-shot multi-spectral learning: Reimagining a generalist multimodal gemini 2.5 model for remote sensing applications,”arXiv preprint arXiv:2509.19087, 2025

work page arXiv 2025
[31]

Chain-of-thought prompting elicits reasoning in large language models,

J. W. X. W. D. S. M. B. B. I. F. X. E. H. C. Q. V . L. D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” 2022

2022
[32]

Good at captioning, bad at counting: Benchmarking gpt- 4v on earth observation data

C. Zhang and S. Wang, “Good at captioning, bad at count- ing: Benchmarking gpt-4v on earth observation data,” in arxiv.org/pdf/2401.17600, 2024

work page arXiv 2024
[33]

Bigearth- net: A large-scale benchmark archive for remote sensing image understanding,

G. Sumbul, M. Charfuelan, B. Demir, and V . Markl, “Bigearth- net: A large-scale benchmark archive for remote sensing image understanding,”IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 2019

2019
[34]

Bigearthnet-mm: A large scale multi-modal multi-label bench- mark archive for remote sensing image classification and re- trieval,

G. Sumbul, A. de Wall, T. Kreuziger, F. Marcelino, H. Costa, P. Benevides, M. Caetano, B. Demir, and V . Markl, “Bigearthnet-mm: A large scale multi-modal multi-label bench- mark archive for remote sensing image classification and re- trieval,”IEEE Geoscience and Remote Sensing Magazine, 2021

2021
[35]

Beyond the visible: Multispectral vision- language learning for earth observation,

C. T. Marimo, B. Blumenstiel, M. Nitsche, J. Jakubik, and T. Brunschwiler, “Beyond the visible: Multispectral vision- language learning for earth observation,”ECML PKDD, 2025

2025
[36]

Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification,

P. Helber, B. Bischke, A. Dengel, and D. Borth, “Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2019

2019
[37]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agar- wal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” inICML, 2021

2021
[38]

Label propagation for zero-shot classification with vision-language models,

V . Stojnic, Y . Kalantidis, and G. Tolias, “Label propagation for zero-shot classification with vision-language models,” inIEEE Conf. Comput. Vis. Pattern Recog., 2024

2024
[39]

Remote sensing vision-language foundation models without annotations via ground remote alignment,

U. Mall, C. P. Phoo, M. K. Liu, C. V ondrick, B. Hariharan, and K. Bala, “Remote sensing vision-language foundation models without annotations via ground remote alignment,”Int. Conf. Learn. Represent., 2024

2024