MAIL++: Multi-Modal Bi-directional Agent Layer for Vision-Language Models

Hui Xue; Kaixiang Chen; Pengfei Fang

arxiv: 2605.25479 · v1 · pith:4ZBHC77Enew · submitted 2026-05-25 · 💻 cs.CV

MAIL++: Multi-Modal Bi-directional Agent Layer for Vision-Language Models

Kaixiang Chen , Pengfei Fang , Hui Xue This is my paper

Pith reviewed 2026-06-29 22:37 UTC · model grok-4.3

classification 💻 cs.CV

keywords parameter-efficient fine-tuningvision-language modelscross-modal couplingfew-shot learningagent layersCLIP adaptationbidirectional bridges

0 comments

The pith

MAIL inserts lightweight agent layers after LayerNorm in vision-language models to directly couple vision and language streams during adaptation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes that embedding cross-modal coupling inside the VLM's own computation modules, rather than through external add-ons, allows lightweight agent layers to approximate full fine-tuning updates while preserving efficiency. This matters in low-data regimes where full fine-tuning overfits and standard PEFT methods suffer from coarse, decoupled interactions between modalities. MAIL achieves the coupling via a bottleneck text-to-image bridge that jointly optimizes paired layers; MAIL++ adds bidirectional exchange through meta layers and bridges. At inference the layers fold back into the frozen backbone. Experiments on few-shot classification and cross-domain retrieval show consistent gains over prior PEFT approaches.

Core claim

MAIL freezes the VLM backbone and inserts lightweight agent layers after core modules such as LayerNorm; a bottleneck-based text-to-image bridge jointly optimizes the paired layers across modalities to coordinate adaptation. MAIL++ extends the design with a meta agent layer, meta-text bridge, and meta-image bridge for bidirectional exchange. All agent layers re-parameterize into the backbone at inference, keeping original efficiency. The approach yields higher accuracy than existing PEFT methods on few-shot image classification and few-shot universal cross-domain retrieval.

What carries the argument

Multi-Modal Interactive Agent Layer (MAIL) with bottleneck text-to-image bridge and meta bridges that jointly adapt paired vision and language computation modules inside the frozen VLM.

If this is right

Direct insertion of coupling inside existing modules improves representational expressiveness over external auxiliary modules.
Re-parameterization at inference restores the original VLM speed and memory footprint.
Bidirectional meta bridges in MAIL++ further strengthen cross-modal coordination beyond one-way bridges.
The method remains applicable to any VLM whose forward pass contains modules such as LayerNorm.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same agent-layer pattern could be tested on non-CLIP VLMs or on tasks requiring generation rather than retrieval.
If the approximation holds, the approach may reduce the need for task-specific prompt engineering in multimodal settings.
Scaling the meta bridges to deeper layers or larger models would be a direct next measurement.

Load-bearing premise

Lightweight agent layers placed after modules like LayerNorm can structurally stand in for the parameter changes of full fine-tuning and still support effective cross-modal coordination via the bottleneck and meta bridges.

What would settle it

A controlled run on the same few-shot classification and retrieval benchmarks where MAIL or MAIL++ accuracy falls below the best prior PEFT baseline while keeping the same frozen backbone and training budget.

Figures

Figures reproduced from arXiv: 2605.25479 by Hui Xue, Kaixiang Chen, Pengfei Fang.

**Figure 1.** Figure 1: Modality-coupled methods in fine-tuning VLMs: (a) MaPLe and MMRL achieve cross-modal coupling by establishing interconnections between the [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Illustration of the coupling mechanisms in the agent layer. (a) The [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: The proposed Multi-Modal Interactive Agent Layer++ (MAIL++) for the transformer-based CLIP models. During training, we only fine-tune the agent [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Ablation studies on insertion strategies across Transformer blocks. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Ablation studies on insertion strategies within Transformer blocks, the dimension of the meta-scaling vector, and the rank of the bridge function. [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Ablation studies on the initialization methods of the bridge function. [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Visualization of the norms of agent layers inserted after the MHSA [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

read the original abstract

Adapting large vision-language models (VLMs) such as CLIP to downstream tasks remains challenging, as full fine-tuning is computationally prohibitive and prone to overfitting in low-data regimes. Parameter-efficient fine-tuning (PEFT) alleviates these issues with lightweight prompt- or adapter-based modules, and cross-modal coupling has proven especially effective by strengthening interactions between vision and language. However, existing coupling mechanisms predominantly rely on external auxiliary modules, leading to indirect, coarse-grained interactions that are structurally decoupled from the original VLM and thus limit representational expressiveness. In this paper, we propose Multi-Modal Interactive Agent Layer (MAIL), a PEFT paradigm that embeds cross-modal coupling directly into the intrinsic computation modules of VLMs. MAIL freezes the backbone and inserts lightweight agent layers after core modules, such as LayerNorm, to approximate the parameter updates induced by full fine-tuning. To couple visual and textual streams at this level, we introduce a bottleneck-based text-to-image bridge that jointly optimizes paired agent layers across modalities, coordinating the adaptation of corresponding computation modules. We further present MAIL++, which enables bidirectional cross-modal exchange through a meta agent layer, a meta-text bridge, and a meta-image bridge. At inference time, all agent layers are re-parameterized into the frozen backbone, preserving the original computational efficiency. Extensive experiments on few-shot image classification and few-shot universal cross-domain retrieval demonstrate that MAIL and MAIL++ consistently outperform state-of-the-art PEFT methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MAIL++ puts agent layers inside VLMs after LayerNorm for tighter cross-modal PEFT, but the approximation to full fine-tuning updates lacks a shown mechanism.

read the letter

The main thing here is a PEFT method that inserts lightweight agent layers directly after modules like LayerNorm inside the VLM backbone, using bottleneck bridges to couple vision and text streams at that level, plus meta layers for bidirectional exchange in MAIL++. At inference the layers get re-parameterized back in. This is positioned as more intrinsic than external auxiliary modules.

The new element is the specific architecture of paired agent layers optimized jointly via the text-to-image bottleneck, extended to meta bridges for bidirectional flow. The paper does a reasonable job framing why external coupling stays coarse and decoupled from the original computation, and it reports consistent gains over prior PEFT baselines on few-shot image classification and universal cross-domain retrieval.

The soft spot is the load-bearing claim that these post-LayerNorm insertions structurally approximate the parameter updates from full fine-tuning. LayerNorm normalizes activations and does not itself contain the linear weights whose changes full fine-tuning would produce, and the description gives no re-parameterization math, gradient-matching argument, or loss-landscape equivalence to support the approximation. The abstract supplies no equations, training details, or ablation breakdowns, so the empirical superiority is hard to assess from the given text.

This is for people working on efficient adaptation of large VLMs in low-data regimes who care about internal cross-modal mechanisms. A reader already following PEFT for multimodal models would get the architectural idea and the reported numbers, but would still need the full experiments and any supporting analysis to decide on the approximation claim.

It deserves a serious referee to check whether the experiments are solid and whether the mechanism is better justified in the body.

Referee Report

2 major / 2 minor

Summary. The paper proposes MAIL, a parameter-efficient fine-tuning (PEFT) paradigm for vision-language models that freezes the backbone and inserts lightweight agent layers after core modules such as LayerNorm to approximate full fine-tuning parameter updates. Cross-modal coupling is achieved via a bottleneck-based text-to-image bridge that jointly optimizes paired agent layers; MAIL++ extends this with bidirectional exchange using a meta agent layer and meta bridges. All agent layers are re-parameterized into the frozen backbone at inference. The central empirical claim is that MAIL and MAIL++ consistently outperform state-of-the-art PEFT methods on few-shot image classification and few-shot universal cross-domain retrieval.

Significance. If the approximation mechanism is rigorously justified and the reported gains hold under standard controls, the approach would offer a structurally integrated alternative to external auxiliary modules for cross-modal adaptation, potentially improving representational expressiveness in low-data regimes while preserving inference efficiency.

major comments (2)

[Method (abstract and presumed §3)] The central claim that post-LayerNorm (and similar) agent layers structurally approximate the parameter updates of full fine-tuning lacks any re-parameterization identity, gradient-matching argument, or loss-landscape equivalence. LayerNorm normalizes activations and does not contain the linear weights whose deltas full fine-tuning would produce; the bottleneck and meta bridges are described only as jointly optimizing paired layers for coordination, without a derivation showing functional equivalence.
[Abstract / Experiments] The abstract asserts that MAIL and MAIL++ "consistently outperform state-of-the-art PEFT methods," yet supplies no quantitative results, tables, datasets, shot counts, or statistical controls. Without these, the superiority claim cannot be evaluated and remains unsupported by the provided text.

minor comments (2)

[Method] Notation for the bottleneck bridge, meta bridges, and re-parameterization step should be introduced with explicit equations rather than prose descriptions alone.
[Method] Clarify whether the agent layers are inserted after every LayerNorm or only selected ones, and whether this choice is ablated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and are prepared to revise the manuscript to improve clarity and support for the claims.

read point-by-point responses

Referee: [Method (abstract and presumed §3)] The central claim that post-LayerNorm (and similar) agent layers structurally approximate the parameter updates of full fine-tuning lacks any re-parameterization identity, gradient-matching argument, or loss-landscape equivalence. LayerNorm normalizes activations and does not contain the linear weights whose deltas full fine-tuning would produce; the bottleneck and meta bridges are described only as jointly optimizing paired layers for coordination, without a derivation showing functional equivalence.

Authors: We thank the referee for this observation. The manuscript positions the post-LayerNorm agent layers as a structural mechanism to approximate the effects of full fine-tuning updates on the core computation modules, motivated by the typical placement of linear transformations after normalization in VLM architectures. This is presented as a design heuristic supported by empirical performance rather than a formal mathematical identity. We will revise Section 3 to explicitly clarify this distinction, expand the motivation for the placement and bridging mechanism, and emphasize that the approach is validated through downstream task results rather than theoretical equivalence proofs. revision: yes
Referee: [Abstract / Experiments] The abstract asserts that MAIL and MAIL++ "consistently outperform state-of-the-art PEFT methods," yet supplies no quantitative results, tables, datasets, shot counts, or statistical controls. Without these, the superiority claim cannot be evaluated and remains unsupported by the provided text.

Authors: We agree that the abstract would benefit from concrete quantitative support for the performance claim. In the revised version we will incorporate key results, including average accuracy gains on few-shot classification (e.g., across ImageNet, CIFAR-100 and other benchmarks at 1/2/4/8/16 shots) and retrieval metrics, along with the primary datasets and comparison baselines referenced in the experiments section. revision: yes

Circularity Check

0 steps flagged

No derivation chain; empirical PEFT method validated by experiments

full rationale

The paper describes an empirical PEFT architecture (MAIL/MAIL++) that inserts agent layers after modules like LayerNorm and claims they approximate full fine-tuning updates via bottleneck bridges, with re-parameterization at inference. No equations, derivations, or mathematical claims appear in the provided text. The central assertions rest on experimental outperformance rather than any reduction of outputs to inputs by construction, self-citation of uniqueness theorems, or fitted parameters renamed as predictions. This is a standard empirical contribution with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract contains no equations, derivations, or modeling assumptions that can be audited for free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5793 in / 1044 out tokens · 30830 ms · 2026-06-29T22:37:42.305575+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

67 extracted references · 5 canonical work pages · 3 internal anchors

[1]

On the Opportunities and Risks of Foundation Models

R. Bommasani, “On the opportunities and risks of foundation models,” arXiv:2108.07258, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[2]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inProc. Int. Conf. Mach. Learn, 2021, pp. 8748–8763

2021
[3]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2016, pp. 770–778

2016
[4]

An image is worth 16x16 words: Transformers for image recognition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” inProc. Int. Conf. Learn. Representations, 2021

2021
[5]

A unified metric architecture for ai infrastructure: A cross-layer taxonomy integrating performance, efficiency, and cost,

Q. He, “A unified metric architecture for ai infrastructure: A cross-layer taxonomy integrating performance, efficiency, and cost,” arXiv:2511.21772, 2025

work page arXiv 2025
[6]

Cp-clip: Core- periphery feature alignment clip for zero-shot medical image analysis,

X. Yu, Z. Wu, L. Zhang, J. Zhang, Y . Lyu, and D. Zhu, “Cp-clip: Core- periphery feature alignment clip for zero-shot medical image analysis,” inProc. Int. Conf. Med. Image Comput. Comput.-Assist. Interv., 2024, pp. 88–97

2024
[7]

Vpl: Visual proxy learning framework for zero-shot medical image diagnosis,

J. Liu, T. Hu, H. Xiong, J. Du, Y . Feng, J. Wu, J. Zhou, and Z. Liu, “Vpl: Visual proxy learning framework for zero-shot medical image diagnosis,” inFindings of ACL: EMNLP 2024, 2024, pp. 9978–9992

2024
[8]

Medclip: Contrastive learning from unpaired medical images and text,

Z. Wang, Z. Wu, D. Agarwal, and J. Sun, “Medclip: Contrastive learning from unpaired medical images and text,” inProc. Conf. Empir. Methods Nat. Lang. Process., vol. 2022, 2022, pp. 3876–3887

2022
[9]

Pros: Prompting-to-simulate generalized knowledge for universal cross-domain retrieval,

K. Fang, J. Song, L. Gao, P. Zeng, Z.-Q. Cheng, X. Li, and H. T. Shen, “Pros: Prompting-to-simulate generalized knowledge for universal cross-domain retrieval,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2024, pp. 17 292–17 301

2024
[10]

Depro: Domain ensemble using decoupled prompts for universal cross-domain retrieval,

K. Chen, P. Fang, and H. Xue, “Depro: Domain ensemble using decoupled prompts for universal cross-domain retrieval,” inProc. Int. ACM SIGIR Conf. Res. Dev. Inf. Retr., 2025, pp. 958–967

2025
[11]

Distill clip (dclip): Enhancing image-text retrieval via cross-modal transformer distillation,

D. Csizmadia, A. Codreanu, V . Sim, V . Prabhu, M. Lu, K. Zhu, S. O’Brien, and V . Sharma, “Distill clip (dclip): Enhancing image-text retrieval via cross-modal transformer distillation,”arXiv:2505.21549, 2025

work page arXiv 2025
[12]

Building a multi-modal spatiotemporal expert for zero-shot action recognition with clip,

Y . Yu, C. Cao, Y . Zhang, Q. Lv, L. Min, and Y . Zhang, “Building a multi-modal spatiotemporal expert for zero-shot action recognition with clip,” inProc. AAAI Conf. Artif. Intell., vol. 39, no. 9, 2025, pp. 9689– 9697

2025
[13]

Leveraging temporal contextu- alization for video action recognition,

M. Kim, D. Han, T. Kim, and B. Han, “Leveraging temporal contextu- alization for video action recognition,” inProc. Eur. Conf. Comput. Vis., 2024, pp. 74–91

2024
[14]

Open-vclip: Transforming clip to an open-vocabulary video model via interpolated weight optimization,

Z. Weng, X. Yang, A. Li, Z. Wu, and Y .-G. Jiang, “Open-vclip: Transforming clip to an open-vocabulary video model via interpolated weight optimization,” inProc. Int. Conf. Mach. Learn, 2023, pp. 36 978– 36 989

2023
[16]

Learning to prompt for vision- language models,

K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision- language models,”Int. J. Comput. Vis., vol. 130, no. 9, pp. 2337–2348, 2022

2022
[17]

Conditional prompt learning for vision-language models,

——, “Conditional prompt learning for vision-language models,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 16 816–16 825

2022
[18]

Visual-language prompt tuning with knowledge-guided context optimization,

H. Yao, R. Zhang, and C. Xu, “Visual-language prompt tuning with knowledge-guided context optimization,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2023, pp. 6757–6767

2023
[19]

Self-regulating prompts: Foundational model adaptation without forgetting,

M. U. Khattak, S. T. Wasim, M. Naseer, S. Khan, M.-H. Yang, and F. S. Khan, “Self-regulating prompts: Foundational model adaptation without forgetting,” inProc. IEEE/CVF Int. Conf. Comput. Vis., 2023, pp. 15 190–15 200

2023
[20]

Maple: Multi-modal prompt learning,

M. U. Khattak, H. Rasheed, M. Maaz, S. Khan, and F. S. Khan, “Maple: Multi-modal prompt learning,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2023, pp. 19 113–19 122

2023
[21]

Mmrl: Multi-modal representation learning for vision-language models,

Y . Guo and X. Gu, “Mmrl: Multi-modal representation learning for vision-language models,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2025, pp. 25 015–25 025

2025
[22]

The Power of Scale for Parameter-Efficient Prompt Tuning

B. Lester, R. Al-Rfou, and N. Constant, “The power of scale for parameter-efficient prompt tuning,”arXiv:2104.08691, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[23]

Tip-adapter: Training-free adaption of clip for few-shot classification,

R. Zhang, W. Zhang, R. Fang, P. Gao, K. Li, J. Dai, Y . Qiao, and H. Li, “Tip-adapter: Training-free adaption of clip for few-shot classification,” inProc. Eur. Conf. Comput. Vis., 2022, pp. 493–510

2022
[24]

Clip-adapter: Better vision-language models with feature adapters,

P. Gao, S. Geng, R. Zhang, T. Ma, R. Fang, Y . Zhang, H. Li, and Y . Qiao, “Clip-adapter: Better vision-language models with feature adapters,”Int. J. Comput. Vis., vol. 132, no. 2, pp. 581–595, 2024

2024
[25]

Mma: Multi-modal adapter for vision-language models,

L. Yang, R.-Y . Zhang, Y . Wang, and X. Xie, “Mma: Multi-modal adapter for vision-language models,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2024, pp. 23 826–23 837

2024
[26]

Scaling & shifting your features: A new baseline for efficient model tuning,

D. Lian, D. Zhou, J. Feng, and X. Wang, “Scaling & shifting your features: A new baseline for efficient model tuning,” inAdv. Neural Inf. Process. Syst., vol. 35, 2022, pp. 109–123

2022
[27]

Multi-modal interactive agent layer for few-shot universal cross-domain retrieval and beyond,

K. Chen, P. Fang, and H. Xue, “Multi-modal interactive agent layer for few-shot universal cross-domain retrieval and beyond,” inAdv. Neural Inf. Process. Syst., vol. 38, 2025

2025
[28]

Vilt: Vision-and-language transformer without convolution or region supervision,

W. Kim, B. Son, and I. Kim, “Vilt: Vision-and-language transformer without convolution or region supervision,” inProc. Int. Conf. Mach. Learn, 2021, pp. 5583–5594

2021
[29]

Align before fuse: Vision and language representation learning with momentum distillation,

J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, and S. C. H. Hoi, “Align before fuse: Vision and language representation learning with momentum distillation,”Adv. Neural Inf. Process. Syst., vol. 34, pp. 9694–9705, 2021

2021
[30]

Vlmo: Unified vision-language pre- 12 training with mixture-of-modality-experts,

H. Bao, W. Wang, L. Dong, Q. Liu, O. K. Mohammed, K. Aggarwal, S. Som, S. Piao, and F. Wei, “Vlmo: Unified vision-language pre- 12 training with mixture-of-modality-experts,”Adv. Neural Inf. Process. Syst., vol. 35, pp. 32 897–32 912, 2022

2022
[31]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,

J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” inProc. Int. Conf. Mach. Learn, 2022, pp. 12 888–12 900

2022
[32]

Visual instruction tuning,

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,”Adv. Neural Inf. Process. Syst., vol. 36, pp. 34 892–34 916, 2023

2023
[33]

Parameter-efficient transfer learning for nlp,

N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-efficient transfer learning for nlp,” inProc. Int. Conf. Mach. Learn, 2019, pp. 2790–2799

2019
[34]

Lora: Low-rank adaptation of large language models,

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” inProc. Int. Conf. Learn. Representations, 2022

2022
[35]

Bitfit: Simple parameter- efficient fine-tuning for transformer-based masked language-models,

E. B. Zaken, Y . Goldberg, and S. Ravfogel, “Bitfit: Simple parameter- efficient fine-tuning for transformer-based masked language-models,” in Proc. Annu. Meet. Assoc. Comput. Linguist., 2022, pp. 1–9

2022
[36]

Learning with enriched inductive biases for vision-language models,

L. Yang, R.-Y . Zhang, Q. Chen, and X. Xie, “Learning with enriched inductive biases for vision-language models,”Int. J. Comput. Vis., vol. 133, no. 6, pp. 3746–3761, 2025

2025
[37]

Not all features matter: Enhancing few-shot CLIP with adaptive prior refinement,

X. Zhu, R. Zhang, B. He, A. Zhou, D. Wang, B. Zhao, and P. Gao, “Not all features matter: Enhancing few-shot CLIP with adaptive prior refinement,” inProc. IEEE/CVF Int. Conf. Comput. Vis., 2023, pp. 2605– 2615

2023
[38]

Task residual for tuning vision-language models,

T. Yu, Z. Lu, X. Jin, Z. Chen, and X. Wang, “Task residual for tuning vision-language models,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2023, pp. 10 899–10 909

2023
[39]

Prompt, generate, then cache: Cascade of foundation models makes strong few-shot learners,

R. Zhang, X. Hu, B. Li, S. Huang, H. Deng, Y . Qiao, P. Gao, and H. Li, “Prompt, generate, then cache: Cascade of foundation models makes strong few-shot learners,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2023, pp. 15 211–15 222

2023
[40]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdv. Neural Inf. Process. Syst., vol. 30, 2017, pp. 5998–6008

2017
[41]

Imagenet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2009, pp. 248–255

2009
[42]

Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,

L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshop, 2004, pp. 178–178

2004
[43]

Cats and dogs,

O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. Jawahar, “Cats and dogs,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2012, pp. 3498–3505

2012
[44]

3d object representations for fine-grained categorization,

J. Krause, M. Stark, J. Deng, and L. Fei-Fei, “3d object representations for fine-grained categorization,” inProc. IEEE/CVF Int. Conf. Comput. Vis. Workshops, 2013, pp. 554–561

2013
[45]

Automated flower classification over a large number of classes,

M.-E. Nilsback and A. Zisserman, “Automated flower classification over a large number of classes,” inProc. Indian Conf. Comput. Vis. Graph. Image Process., 2008, pp. 722–729

2008
[46]

Food-101–mining discriminative components with random forests,

L. Bossard, M. Guillaumin, and L. Van Gool, “Food-101–mining discriminative components with random forests,” inProc. Eur. Conf. Comput. Vis., 2014, pp. 446–461

2014
[47]

Fine-Grained Visual Classification of Aircraft

S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi, “Fine- grained visual classification of aircraft,”arXiv:1306.5151, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[48]

Sun database: Large-scale scene recognition from abbey to zoo,

J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “Sun database: Large-scale scene recognition from abbey to zoo,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2010, pp. 3485–3492

2010
[49]

A dataset of 101 human action classes from videos in the wild,

K. Soomro, A. R. Zamir, and M. Shah, “A dataset of 101 human action classes from videos in the wild,”Center for Research in Computer Vision, vol. 2, no. 11, pp. 1–7, 2012

2012
[50]

Describing textures in the wild,

M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi, “Describing textures in the wild,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2014, pp. 3606–3613

2014
[51]

Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification,

P. Helber, B. Bischke, A. Dengel, and D. Borth, “Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification,”IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., vol. 12, no. 7, pp. 2217–2226, 2019

2019
[52]

Do imagenet classifiers generalize to imagenet?

B. Recht, R. Roelofs, L. Schmidt, and V . Shankar, “Do imagenet classifiers generalize to imagenet?” inProc. Int. Conf. Mach. Learn, 2019, pp. 5389–5400

2019
[53]

Learning robust global representations by penalizing local predictive power,

H. Wang, S. Ge, Z. Lipton, and E. P. Xing, “Learning robust global representations by penalizing local predictive power,”Adv. Neural Inf. Process. Syst., vol. 32, pp. 10 506–10 518, 2019

2019
[54]

Natural adversarial examples,

D. Hendrycks, K. Zhao, S. Basart, J. Steinhardt, and D. Song, “Natural adversarial examples,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 15 262–15 271

2021
[55]

The many faces of robustness: A critical analysis of out-of-distribution generalization,

D. Hendrycks, S. Basart, N. Mu, S. Kadavath, F. Wang, E. Dorundo, R. Desai, T. Zhu, S. Parajuli, M. Guoet al., “The many faces of robustness: A critical analysis of out-of-distribution generalization,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 8340–8349

2021
[56]

Moment matching for multi-source domain adaptation,

X. Peng, Q. Bai, X. Xia, Z. Huang, K. Saenko, and B. Wang, “Moment matching for multi-source domain adaptation,” inProc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 1406–1415

2019
[57]

The sketchy database: learning to retrieve badly drawn bunnies,

P. Sangkloy, N. Burnell, C. Ham, and J. Hays, “The sketchy database: learning to retrieve badly drawn bunnies,”ACM Trans. Graph., vol. 35, no. 4, pp. 1–12, 2016

2016
[58]

Deep sketch hashing: Fast free-hand sketch-based image retrieval,

L. Liu, F. Shen, Y . Shen, X. Liu, and L. Shao, “Deep sketch hashing: Fast free-hand sketch-based image retrieval,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2017, pp. 2862–2871

2017
[59]

How do humans sketch objects?

M. Eitz, J. Hays, and M. Alexa, “How do humans sketch objects?”ACM Trans. Graph., vol. 31, no. 4, pp. 1–10, 2012

2012
[60]

Sketchnet: Sketch classification with web images,

H. Zhang, S. Liu, C. Zhang, W. Ren, R. Wang, and X. Cao, “Sketchnet: Sketch classification with web images,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2016, pp. 1105–1113

2016
[61]

Tcp: Textual-based class-aware prompt tuning for visual-language model,

H. Yao, R. Zhang, and C. Xu, “Tcp: Textual-based class-aware prompt tuning for visual-language model,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2024, pp. 23 438–23 448

2024
[62]

Divergence-enhanced knowledge-guided context optimization for visual-language prompt tun- ing,

Y . Li, M. Cheng, X. Han, and W. Song, “Divergence-enhanced knowledge-guided context optimization for visual-language prompt tun- ing,” inProc. Int. Conf. Learn. Representations, 2025

2025
[63]

Bi-modality individual- aware prompt tuning for visual-language model,

H. Yao, R. Zhang, H. Lyu, Y . Zhang, and C. Xu, “Bi-modality individual- aware prompt tuning for visual-language model,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 47, no. 8, pp. 6352–6368, 2025

2025
[64]

Frequency-based comprehensive prompt learning for vision-language models,

L. Liu, N. Wang, C. Chen, D. Liu, X. Yang, X. Gao, and T. Liu, “Frequency-based comprehensive prompt learning for vision-language models,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 47, no. 12, pp. 11 974–11 989, 2025

2025
[65]

Hierarchical cross- modal prompt learning for vision-language models,

H. Zheng, S. Yang, Z. He, J. Yang, and Z. Huang, “Hierarchical cross- modal prompt learning for vision-language models,” inProc. IEEE/CVF Int. Conf. Comput. Vis., 2025, pp. 1891–1901

2025
[66]

Promptkd: Unsupervised prompt distillation for vision-language mod- els,

Z. Li, X. Li, X. Fu, X. Zhang, W. Wang, S. Chen, and J. Yang, “Promptkd: Unsupervised prompt distillation for vision-language mod- els,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2024, pp. 26 617–26 626

2024
[67]

Adapt- former: Adapting vision transformers for scalable visual recognition,

S. Chen, C. Ge, Z. Tong, J. Wang, Y . Song, J. Wang, and P. Luo, “Adapt- former: Adapting vision transformers for scalable visual recognition,” in Adv. Neural Inf. Process. Syst., vol. 35, 2022, pp. 16 664–16 678

2022
[68]

Visual prompt tuning,

M. Jia, L. Tang, B.-C. Chen, C. Cardie, S. Belongie, B. Hariharan, and S.-N. Lim, “Visual prompt tuning,” inProc. Eur. Conf. Comput. Vis., 2022, pp. 709–727. Kaixiang Chenreceived the BSc degree from Zhejiang University of Technology, Hangzhou, China, in 2021, the MEng degree from the Nanjing University of Aeronautics and Astronautics, Nanjing, China, in ...

2022

[1] [1]

On the Opportunities and Risks of Foundation Models

R. Bommasani, “On the opportunities and risks of foundation models,” arXiv:2108.07258, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[2] [2]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inProc. Int. Conf. Mach. Learn, 2021, pp. 8748–8763

2021

[3] [3]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2016, pp. 770–778

2016

[4] [4]

An image is worth 16x16 words: Transformers for image recognition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” inProc. Int. Conf. Learn. Representations, 2021

2021

[5] [5]

A unified metric architecture for ai infrastructure: A cross-layer taxonomy integrating performance, efficiency, and cost,

Q. He, “A unified metric architecture for ai infrastructure: A cross-layer taxonomy integrating performance, efficiency, and cost,” arXiv:2511.21772, 2025

work page arXiv 2025

[6] [6]

Cp-clip: Core- periphery feature alignment clip for zero-shot medical image analysis,

X. Yu, Z. Wu, L. Zhang, J. Zhang, Y . Lyu, and D. Zhu, “Cp-clip: Core- periphery feature alignment clip for zero-shot medical image analysis,” inProc. Int. Conf. Med. Image Comput. Comput.-Assist. Interv., 2024, pp. 88–97

2024

[7] [7]

Vpl: Visual proxy learning framework for zero-shot medical image diagnosis,

J. Liu, T. Hu, H. Xiong, J. Du, Y . Feng, J. Wu, J. Zhou, and Z. Liu, “Vpl: Visual proxy learning framework for zero-shot medical image diagnosis,” inFindings of ACL: EMNLP 2024, 2024, pp. 9978–9992

2024

[8] [8]

Medclip: Contrastive learning from unpaired medical images and text,

Z. Wang, Z. Wu, D. Agarwal, and J. Sun, “Medclip: Contrastive learning from unpaired medical images and text,” inProc. Conf. Empir. Methods Nat. Lang. Process., vol. 2022, 2022, pp. 3876–3887

2022

[9] [9]

Pros: Prompting-to-simulate generalized knowledge for universal cross-domain retrieval,

K. Fang, J. Song, L. Gao, P. Zeng, Z.-Q. Cheng, X. Li, and H. T. Shen, “Pros: Prompting-to-simulate generalized knowledge for universal cross-domain retrieval,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2024, pp. 17 292–17 301

2024

[10] [10]

Depro: Domain ensemble using decoupled prompts for universal cross-domain retrieval,

K. Chen, P. Fang, and H. Xue, “Depro: Domain ensemble using decoupled prompts for universal cross-domain retrieval,” inProc. Int. ACM SIGIR Conf. Res. Dev. Inf. Retr., 2025, pp. 958–967

2025

[11] [11]

Distill clip (dclip): Enhancing image-text retrieval via cross-modal transformer distillation,

D. Csizmadia, A. Codreanu, V . Sim, V . Prabhu, M. Lu, K. Zhu, S. O’Brien, and V . Sharma, “Distill clip (dclip): Enhancing image-text retrieval via cross-modal transformer distillation,”arXiv:2505.21549, 2025

work page arXiv 2025

[12] [12]

Building a multi-modal spatiotemporal expert for zero-shot action recognition with clip,

Y . Yu, C. Cao, Y . Zhang, Q. Lv, L. Min, and Y . Zhang, “Building a multi-modal spatiotemporal expert for zero-shot action recognition with clip,” inProc. AAAI Conf. Artif. Intell., vol. 39, no. 9, 2025, pp. 9689– 9697

2025

[13] [13]

Leveraging temporal contextu- alization for video action recognition,

M. Kim, D. Han, T. Kim, and B. Han, “Leveraging temporal contextu- alization for video action recognition,” inProc. Eur. Conf. Comput. Vis., 2024, pp. 74–91

2024

[14] [14]

Open-vclip: Transforming clip to an open-vocabulary video model via interpolated weight optimization,

Z. Weng, X. Yang, A. Li, Z. Wu, and Y .-G. Jiang, “Open-vclip: Transforming clip to an open-vocabulary video model via interpolated weight optimization,” inProc. Int. Conf. Mach. Learn, 2023, pp. 36 978– 36 989

2023

[15] [16]

Learning to prompt for vision- language models,

K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision- language models,”Int. J. Comput. Vis., vol. 130, no. 9, pp. 2337–2348, 2022

2022

[16] [17]

Conditional prompt learning for vision-language models,

——, “Conditional prompt learning for vision-language models,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 16 816–16 825

2022

[17] [18]

Visual-language prompt tuning with knowledge-guided context optimization,

H. Yao, R. Zhang, and C. Xu, “Visual-language prompt tuning with knowledge-guided context optimization,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2023, pp. 6757–6767

2023

[18] [19]

Self-regulating prompts: Foundational model adaptation without forgetting,

M. U. Khattak, S. T. Wasim, M. Naseer, S. Khan, M.-H. Yang, and F. S. Khan, “Self-regulating prompts: Foundational model adaptation without forgetting,” inProc. IEEE/CVF Int. Conf. Comput. Vis., 2023, pp. 15 190–15 200

2023

[19] [20]

Maple: Multi-modal prompt learning,

M. U. Khattak, H. Rasheed, M. Maaz, S. Khan, and F. S. Khan, “Maple: Multi-modal prompt learning,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2023, pp. 19 113–19 122

2023

[20] [21]

Mmrl: Multi-modal representation learning for vision-language models,

Y . Guo and X. Gu, “Mmrl: Multi-modal representation learning for vision-language models,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2025, pp. 25 015–25 025

2025

[21] [22]

The Power of Scale for Parameter-Efficient Prompt Tuning

B. Lester, R. Al-Rfou, and N. Constant, “The power of scale for parameter-efficient prompt tuning,”arXiv:2104.08691, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[22] [23]

Tip-adapter: Training-free adaption of clip for few-shot classification,

R. Zhang, W. Zhang, R. Fang, P. Gao, K. Li, J. Dai, Y . Qiao, and H. Li, “Tip-adapter: Training-free adaption of clip for few-shot classification,” inProc. Eur. Conf. Comput. Vis., 2022, pp. 493–510

2022

[23] [24]

Clip-adapter: Better vision-language models with feature adapters,

P. Gao, S. Geng, R. Zhang, T. Ma, R. Fang, Y . Zhang, H. Li, and Y . Qiao, “Clip-adapter: Better vision-language models with feature adapters,”Int. J. Comput. Vis., vol. 132, no. 2, pp. 581–595, 2024

2024

[24] [25]

Mma: Multi-modal adapter for vision-language models,

L. Yang, R.-Y . Zhang, Y . Wang, and X. Xie, “Mma: Multi-modal adapter for vision-language models,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2024, pp. 23 826–23 837

2024

[25] [26]

Scaling & shifting your features: A new baseline for efficient model tuning,

D. Lian, D. Zhou, J. Feng, and X. Wang, “Scaling & shifting your features: A new baseline for efficient model tuning,” inAdv. Neural Inf. Process. Syst., vol. 35, 2022, pp. 109–123

2022

[26] [27]

Multi-modal interactive agent layer for few-shot universal cross-domain retrieval and beyond,

K. Chen, P. Fang, and H. Xue, “Multi-modal interactive agent layer for few-shot universal cross-domain retrieval and beyond,” inAdv. Neural Inf. Process. Syst., vol. 38, 2025

2025

[27] [28]

Vilt: Vision-and-language transformer without convolution or region supervision,

W. Kim, B. Son, and I. Kim, “Vilt: Vision-and-language transformer without convolution or region supervision,” inProc. Int. Conf. Mach. Learn, 2021, pp. 5583–5594

2021

[28] [29]

Align before fuse: Vision and language representation learning with momentum distillation,

J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, and S. C. H. Hoi, “Align before fuse: Vision and language representation learning with momentum distillation,”Adv. Neural Inf. Process. Syst., vol. 34, pp. 9694–9705, 2021

2021

[29] [30]

Vlmo: Unified vision-language pre- 12 training with mixture-of-modality-experts,

H. Bao, W. Wang, L. Dong, Q. Liu, O. K. Mohammed, K. Aggarwal, S. Som, S. Piao, and F. Wei, “Vlmo: Unified vision-language pre- 12 training with mixture-of-modality-experts,”Adv. Neural Inf. Process. Syst., vol. 35, pp. 32 897–32 912, 2022

2022

[30] [31]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,

J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” inProc. Int. Conf. Mach. Learn, 2022, pp. 12 888–12 900

2022

[31] [32]

Visual instruction tuning,

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,”Adv. Neural Inf. Process. Syst., vol. 36, pp. 34 892–34 916, 2023

2023

[32] [33]

Parameter-efficient transfer learning for nlp,

N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-efficient transfer learning for nlp,” inProc. Int. Conf. Mach. Learn, 2019, pp. 2790–2799

2019

[33] [34]

Lora: Low-rank adaptation of large language models,

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” inProc. Int. Conf. Learn. Representations, 2022

2022

[34] [35]

Bitfit: Simple parameter- efficient fine-tuning for transformer-based masked language-models,

E. B. Zaken, Y . Goldberg, and S. Ravfogel, “Bitfit: Simple parameter- efficient fine-tuning for transformer-based masked language-models,” in Proc. Annu. Meet. Assoc. Comput. Linguist., 2022, pp. 1–9

2022

[35] [36]

Learning with enriched inductive biases for vision-language models,

L. Yang, R.-Y . Zhang, Q. Chen, and X. Xie, “Learning with enriched inductive biases for vision-language models,”Int. J. Comput. Vis., vol. 133, no. 6, pp. 3746–3761, 2025

2025

[36] [37]

Not all features matter: Enhancing few-shot CLIP with adaptive prior refinement,

X. Zhu, R. Zhang, B. He, A. Zhou, D. Wang, B. Zhao, and P. Gao, “Not all features matter: Enhancing few-shot CLIP with adaptive prior refinement,” inProc. IEEE/CVF Int. Conf. Comput. Vis., 2023, pp. 2605– 2615

2023

[37] [38]

Task residual for tuning vision-language models,

T. Yu, Z. Lu, X. Jin, Z. Chen, and X. Wang, “Task residual for tuning vision-language models,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2023, pp. 10 899–10 909

2023

[38] [39]

Prompt, generate, then cache: Cascade of foundation models makes strong few-shot learners,

R. Zhang, X. Hu, B. Li, S. Huang, H. Deng, Y . Qiao, P. Gao, and H. Li, “Prompt, generate, then cache: Cascade of foundation models makes strong few-shot learners,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2023, pp. 15 211–15 222

2023

[39] [40]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdv. Neural Inf. Process. Syst., vol. 30, 2017, pp. 5998–6008

2017

[40] [41]

Imagenet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2009, pp. 248–255

2009

[41] [42]

Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,

L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshop, 2004, pp. 178–178

2004

[42] [43]

Cats and dogs,

O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. Jawahar, “Cats and dogs,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2012, pp. 3498–3505

2012

[43] [44]

3d object representations for fine-grained categorization,

J. Krause, M. Stark, J. Deng, and L. Fei-Fei, “3d object representations for fine-grained categorization,” inProc. IEEE/CVF Int. Conf. Comput. Vis. Workshops, 2013, pp. 554–561

2013

[44] [45]

Automated flower classification over a large number of classes,

M.-E. Nilsback and A. Zisserman, “Automated flower classification over a large number of classes,” inProc. Indian Conf. Comput. Vis. Graph. Image Process., 2008, pp. 722–729

2008

[45] [46]

Food-101–mining discriminative components with random forests,

L. Bossard, M. Guillaumin, and L. Van Gool, “Food-101–mining discriminative components with random forests,” inProc. Eur. Conf. Comput. Vis., 2014, pp. 446–461

2014

[46] [47]

Fine-Grained Visual Classification of Aircraft

S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi, “Fine- grained visual classification of aircraft,”arXiv:1306.5151, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[47] [48]

Sun database: Large-scale scene recognition from abbey to zoo,

J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “Sun database: Large-scale scene recognition from abbey to zoo,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2010, pp. 3485–3492

2010

[48] [49]

A dataset of 101 human action classes from videos in the wild,

K. Soomro, A. R. Zamir, and M. Shah, “A dataset of 101 human action classes from videos in the wild,”Center for Research in Computer Vision, vol. 2, no. 11, pp. 1–7, 2012

2012

[49] [50]

Describing textures in the wild,

M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi, “Describing textures in the wild,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2014, pp. 3606–3613

2014

[50] [51]

Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification,

P. Helber, B. Bischke, A. Dengel, and D. Borth, “Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification,”IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., vol. 12, no. 7, pp. 2217–2226, 2019

2019

[51] [52]

Do imagenet classifiers generalize to imagenet?

B. Recht, R. Roelofs, L. Schmidt, and V . Shankar, “Do imagenet classifiers generalize to imagenet?” inProc. Int. Conf. Mach. Learn, 2019, pp. 5389–5400

2019

[52] [53]

Learning robust global representations by penalizing local predictive power,

H. Wang, S. Ge, Z. Lipton, and E. P. Xing, “Learning robust global representations by penalizing local predictive power,”Adv. Neural Inf. Process. Syst., vol. 32, pp. 10 506–10 518, 2019

2019

[53] [54]

Natural adversarial examples,

D. Hendrycks, K. Zhao, S. Basart, J. Steinhardt, and D. Song, “Natural adversarial examples,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 15 262–15 271

2021

[54] [55]

The many faces of robustness: A critical analysis of out-of-distribution generalization,

D. Hendrycks, S. Basart, N. Mu, S. Kadavath, F. Wang, E. Dorundo, R. Desai, T. Zhu, S. Parajuli, M. Guoet al., “The many faces of robustness: A critical analysis of out-of-distribution generalization,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 8340–8349

2021

[55] [56]

Moment matching for multi-source domain adaptation,

X. Peng, Q. Bai, X. Xia, Z. Huang, K. Saenko, and B. Wang, “Moment matching for multi-source domain adaptation,” inProc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 1406–1415

2019

[56] [57]

The sketchy database: learning to retrieve badly drawn bunnies,

P. Sangkloy, N. Burnell, C. Ham, and J. Hays, “The sketchy database: learning to retrieve badly drawn bunnies,”ACM Trans. Graph., vol. 35, no. 4, pp. 1–12, 2016

2016

[57] [58]

Deep sketch hashing: Fast free-hand sketch-based image retrieval,

L. Liu, F. Shen, Y . Shen, X. Liu, and L. Shao, “Deep sketch hashing: Fast free-hand sketch-based image retrieval,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2017, pp. 2862–2871

2017

[58] [59]

How do humans sketch objects?

M. Eitz, J. Hays, and M. Alexa, “How do humans sketch objects?”ACM Trans. Graph., vol. 31, no. 4, pp. 1–10, 2012

2012

[59] [60]

Sketchnet: Sketch classification with web images,

H. Zhang, S. Liu, C. Zhang, W. Ren, R. Wang, and X. Cao, “Sketchnet: Sketch classification with web images,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2016, pp. 1105–1113

2016

[60] [61]

Tcp: Textual-based class-aware prompt tuning for visual-language model,

H. Yao, R. Zhang, and C. Xu, “Tcp: Textual-based class-aware prompt tuning for visual-language model,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2024, pp. 23 438–23 448

2024

[61] [62]

Divergence-enhanced knowledge-guided context optimization for visual-language prompt tun- ing,

Y . Li, M. Cheng, X. Han, and W. Song, “Divergence-enhanced knowledge-guided context optimization for visual-language prompt tun- ing,” inProc. Int. Conf. Learn. Representations, 2025

2025

[62] [63]

Bi-modality individual- aware prompt tuning for visual-language model,

H. Yao, R. Zhang, H. Lyu, Y . Zhang, and C. Xu, “Bi-modality individual- aware prompt tuning for visual-language model,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 47, no. 8, pp. 6352–6368, 2025

2025

[63] [64]

Frequency-based comprehensive prompt learning for vision-language models,

L. Liu, N. Wang, C. Chen, D. Liu, X. Yang, X. Gao, and T. Liu, “Frequency-based comprehensive prompt learning for vision-language models,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 47, no. 12, pp. 11 974–11 989, 2025

2025

[64] [65]

Hierarchical cross- modal prompt learning for vision-language models,

H. Zheng, S. Yang, Z. He, J. Yang, and Z. Huang, “Hierarchical cross- modal prompt learning for vision-language models,” inProc. IEEE/CVF Int. Conf. Comput. Vis., 2025, pp. 1891–1901

2025

[65] [66]

Promptkd: Unsupervised prompt distillation for vision-language mod- els,

Z. Li, X. Li, X. Fu, X. Zhang, W. Wang, S. Chen, and J. Yang, “Promptkd: Unsupervised prompt distillation for vision-language mod- els,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2024, pp. 26 617–26 626

2024

[66] [67]

Adapt- former: Adapting vision transformers for scalable visual recognition,

S. Chen, C. Ge, Z. Tong, J. Wang, Y . Song, J. Wang, and P. Luo, “Adapt- former: Adapting vision transformers for scalable visual recognition,” in Adv. Neural Inf. Process. Syst., vol. 35, 2022, pp. 16 664–16 678

2022

[67] [68]

Visual prompt tuning,

M. Jia, L. Tang, B.-C. Chen, C. Cardie, S. Belongie, B. Hariharan, and S.-N. Lim, “Visual prompt tuning,” inProc. Eur. Conf. Comput. Vis., 2022, pp. 709–727. Kaixiang Chenreceived the BSc degree from Zhejiang University of Technology, Hangzhou, China, in 2021, the MEng degree from the Nanjing University of Aeronautics and Astronautics, Nanjing, China, in ...

2022