pith. machine review for the scientific record. sign in

arxiv: 2604.22595 · v1 · submitted 2026-04-24 · 💻 cs.CV

Recognition: unknown

EV-CLIP: Efficient Visual Prompt Adaptation for CLIP in Few-shot Action Recognition under Visual Challenges

Authors on Pith no claims yet

Pith reviewed 2026-05-08 12:26 UTC · model grok-4.3

classification 💻 cs.CV
keywords few-shot action recognitionCLIP adaptationvisual promptsparameter-efficientvideo understandingdomain shift
0
0 comments X

The pith

Mask prompts reweight pixels to action-relevant areas while context prompts compress temporal features, allowing CLIP to perform few-shot action recognition effectively under visual challenges such as low light and egocentric views.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes adapting CLIP for video action recognition in difficult visual conditions using efficient visual prompts rather than focusing only on time modeling. It introduces mask prompts that reweight pixels to emphasize important regions and context prompts that summarize frame features into a compact form for temporal understanding. This matters because real-world videos often suffer from poor lighting or unusual camera angles that break standard models, and few-shot learning is needed when labeled data is scarce. The approach keeps the number of new parameters small and independent of the underlying model size, making it practical for limited hardware.

Core claim

EV-CLIP adapts CLIP using two visual prompts: mask prompts that guide attention by reweighting pixels to action-relevant regions, and context prompts that perform lightweight temporal modeling by compressing frame-wise features into a compact representation, enabling strong performance in few-shot settings across diverse scenes and viewpoints.

What carries the argument

The pair of mask prompts for spatial reweighting and context prompts for temporal compression that together address spatial perception deficits in CLIP for video tasks.

If this is right

  • EV-CLIP achieves higher overall accuracy than other parameter-efficient adaptation methods on five benchmark datasets with visual and semantic domain shifts.
  • Its added computational cost does not increase with larger backbone models.
  • The framework is suitable for deployment in resource-constrained real-world applications.
  • Analysis of domain shifts shows how visual factors like lighting and viewpoints affect recognition performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar prompting strategies could extend to other vision-language models facing spatial degradation in videos.
  • Testing on more extreme visual challenges such as heavy occlusion or motion blur would further validate the prompts' robustness.
  • The independence from backbone scale suggests potential for scaling to very large models without proportional efficiency loss.

Load-bearing premise

That the mask prompts successfully reweight pixels to highlight action-relevant regions and the context prompts compress frame features without losing critical temporal information, even under severe visual degradations like low light or egocentric viewpoints.

What would settle it

A direct comparison on the curated five benchmark datasets where EV-CLIP fails to exceed the accuracy of competing efficient methods while maintaining similar or lower parameter counts and inference speed.

Figures

Figures reproduced from arXiv: 2604.22595 by Eun Yi Kim, Hyo Jin Jon, Longbin Jin.

Figure 1
Figure 1. Figure 1: (a) Main approach of EV-CLIP. Few-shot samples from the target domain are adapted via two visual prompts, pixel reweighting for spatial focus view at source ↗
Figure 2
Figure 2. Figure 2: Analysis of domain shift factors and their relationship with the zero-shot performance of Vanilla CLIP across video datasets. Each subfigure illustrates view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the EV-CLIP framework for video action recognition (left) and a detailed component breakdown (right). EV-CLIP enhances CLIP’s view at source ↗
Figure 4
Figure 4. Figure 4: Performance comparison across five benchmark datasets from two-shot to 16-shot settings using the ViT-B/16 backbone, with the average plot view at source ↗
Figure 5
Figure 5. Figure 5: Training efficiency comparison in terms of trainable parameters versus performance on UCF101 and HMDB51 under eight-shot settings. view at source ↗
Figure 6
Figure 6. Figure 6: Impact of frame length on performance across all evaluation datasets. view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of frames and attention heatmaps comparing Vanilla CLIP and EVo-CLIP on two examples from HMDB51 and ARID evaluation sets. view at source ↗
Figure 8
Figure 8. Figure 8: Visualization of original and reweighted frames on HMDB51 and ARID validation sets. view at source ↗
read the original abstract

CLIP has demonstrated strong generalization in visual domains through natural language supervision, even for video action recognition. However, most existing approaches that adapt CLIP for action recognition have primarily focused on temporal modeling, often overlooking spatial perception. In real-world scenarios, visual challenges such as low-light environments or egocentric viewpoints can severely impair spatial understanding, an essential precursor for effective temporal reasoning. To address this limitation, we propose Efficient Visual Prompting for CLIP (EV-CLIP), an efficient adaptation framework designed for few-shot video action recognition across diverse scenes and viewpoints. EV-CLIP introduces two visual prompts: mask prompts, which guide the model's attention to action-relevant regions by reweighting pixels, and context prompts, which perform lightweight temporal modeling by compressing frame-wise features into a compact representation. For a comprehensive evaluation, we curate five benchmark datasets and analyze domain shifts to quantify the influence of diverse visual and semantic factors on action recognition. Experimental results demonstrate that EV-CLIP outperforms existing parameter-efficient methods in overall performance. Moreover, its efficiency remains independent of the backbone scale, making it well-suited for deployment in real-world, resource-constrained scenarios. The code is available at https://github.com/AI-CV-Lab/EV-CLIP.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes EV-CLIP, an efficient visual prompting framework for adapting CLIP to few-shot video action recognition under visual challenges such as low-light environments and egocentric viewpoints. It introduces mask prompts to reweight pixels and guide attention to action-relevant regions, along with context prompts to compress frame-wise features for lightweight temporal modeling. The method is evaluated on five curated benchmark datasets analyzing domain shifts, with claims of outperforming existing parameter-efficient methods and maintaining efficiency independent of backbone scale.

Significance. If the performance and efficiency claims hold with proper verification, this could enable practical deployment of large CLIP models in resource-constrained real-world video scenarios involving visual degradations, by providing a prompting approach that avoids scaling costs with model size.

major comments (2)
  1. Abstract: The central claim that 'its efficiency remains independent of the backbone scale' lacks any described mechanism, equations, or experimental support. Standard visual prompt designs incur parameter counts scaling with hidden_dim and sequence length (e.g., prompt_length × hidden_dim), which increase for larger backbones such as ViT-L versus ViT-B; without an explicit non-scaling design (e.g., shared or frozen components), this undermines the efficiency independence assertion.
  2. Abstract: The assertion that 'EV-CLIP outperforms existing parameter-efficient methods in overall performance' is presented without quantitative metrics, ablation details, statistical tests, or error analysis. This is load-bearing for the primary experimental claim and requires concrete results tables and comparisons in the evaluation section to be verifiable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the two major comments point by point below, clarifying the design rationale and committing to revisions that strengthen the abstract and supporting sections without altering the core contributions.

read point-by-point responses
  1. Referee: Abstract: The central claim that 'its efficiency remains independent of the backbone scale' lacks any described mechanism, equations, or experimental support. Standard visual prompt designs incur parameter counts scaling with hidden_dim and sequence length (e.g., prompt_length × hidden_dim), which increase for larger backbones such as ViT-L versus ViT-B; without an explicit non-scaling design (e.g., shared or frozen components), this undermines the efficiency independence assertion.

    Authors: The mask prompts are generated as a fixed-resolution spatial reweighting map applied directly to the input frames prior to the visual encoder, using a lightweight convolutional head whose parameters are independent of the backbone hidden dimension. The context prompts operate on aggregated frame features via a dimension-agnostic compression module (global pooling followed by a small fixed-size MLP) that produces a compact temporal representation without scaling parameters with model width or depth. We will insert the corresponding equations and a parameter-count comparison table (ViT-B vs. ViT-L) into Section 3 and update the abstract to reference this non-scaling property explicitly. revision: yes

  2. Referee: Abstract: The assertion that 'EV-CLIP outperforms existing parameter-efficient methods in overall performance' is presented without quantitative metrics, ablation details, statistical tests, or error analysis. This is load-bearing for the primary experimental claim and requires concrete results tables and comparisons in the evaluation section to be verifiable.

    Authors: The full manuscript already contains quantitative comparisons in Tables 1 and 2 across the five curated datasets, reporting mean top-1 accuracy and standard deviation over three random seeds for EV-CLIP versus CoOp, MaPLe, and other baselines under identical few-shot protocols. Ablation results appear in Section 4.3. We will revise the abstract to include specific average accuracy gains (e.g., +X% over the strongest baseline) and ensure all performance claims are cross-referenced to these tables; we did not conduct formal statistical significance tests beyond reporting standard deviations but can add them if the referee deems necessary. revision: partial

Circularity Check

0 steps flagged

No circularity: paper presents empirical method without derivations or predictions that reduce to inputs

full rationale

The manuscript describes EV-CLIP via two prompt types (mask prompts for spatial reweighting, context prompts for temporal compression) and reports experimental outperformance plus efficiency claims on five datasets. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. All central claims rest on empirical evaluation rather than any algebraic reduction to the method's own definitions or prior self-citations. The efficiency-independence statement is an empirical observation, not a derived result that collapses by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are explicitly declared in the abstract; the method relies on standard CLIP components plus the two new prompt types whose internal hyperparameters are not detailed here.

pith-pipeline@v0.9.0 · 5525 in / 1154 out tokens · 46020 ms · 2026-05-08T12:26:09.013157+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

76 extracted references · 7 canonical work pages · 3 internal anchors

  1. [1]

    Multimodal human action recognition in assistive human-robot interaction,

    I. Rodomagoulakis, N. Kardaris, V . Pitsikalis, E. Mavroudi, A. Katsama- nis, A. Tsiami, and P. Maragos, “Multimodal human action recognition in assistive human-robot interaction,” in2016 IEEE international con- ference on acoustics, speech and signal processing (ICASSP). IEEE, 2016, pp. 2702–2706

  2. [2]

    Human activity recognition for video surveillance,

    W. Lin, M.-T. Sun, R. Poovandran, and Z. Zhang, “Human activity recognition for video surveillance,” in2008 IEEE international sym- posium on circuits and systems (ISCAS). IEEE, 2008, pp. 2737–2740

  3. [3]

    U-har: A convolu- tional approach to human activity recognition combining head and eye movements for context-aware smart glasses,

    J. Meyer, A. Frank, T. Schlebusch, and E. Kasneci, “U-har: A convolu- tional approach to human activity recognition combining head and eye movements for context-aware smart glasses,”Proceedings of the ACM on Human-Computer Interaction, vol. 6, no. ETRA, pp. 1–19, 2022

  4. [4]

    Learning spatiotemporal features with 3d convolutional networks,

    D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” inProceedings of the IEEE international conference on computer vision, 2015, pp. 4489–4497

  5. [5]

    Learning spatio-temporal representation with pseudo-3d residual networks,

    Z. Qiu, T. Yao, and T. Mei, “Learning spatio-temporal representation with pseudo-3d residual networks,” inproceedings of the IEEE Interna- tional Conference on Computer Vision, 2017, pp. 5533–5541

  6. [6]

    Quo vadis, action recognition? a new model and the kinetics dataset,

    J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” inproceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308

  7. [7]

    A closer look at spatiotemporal convolutions for action recognition,

    D. Tran, H. Wang, L. Torresani, J. Ray, Y . LeCun, and M. Paluri, “A closer look at spatiotemporal convolutions for action recognition,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018, pp. 6450–6459

  8. [8]

    Non-local neural net- works,

    X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural net- works,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7794–7803

  9. [9]

    Tsm: Temporal shift module for efficient video understanding,

    J. Lin, C. Gan, and S. Han, “Tsm: Temporal shift module for efficient video understanding,” inProceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 7083–7093

  10. [10]

    Tea: Temporal excitation and aggregation for action recognition,

    Y . Li, B. Ji, X. Shi, J. Zhang, B. Kang, and L. Wang, “Tea: Temporal excitation and aggregation for action recognition,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 909–918

  11. [11]

    Tdn: Temporal difference networks for efficient action recognition,

    L. Wang, Z. Tong, B. Ji, and G. Wu, “Tdn: Temporal difference networks for efficient action recognition,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 1895– 1904

  12. [12]

    Vivit: A video vision transformer,

    A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lu ˇci´c, and C. Schmid, “Vivit: A video vision transformer,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 6836–6846

  13. [13]

    Is space-time attention all you need for video understanding?

    G. Bertasius, H. Wang, and L. Torresani, “Is space-time attention all you need for video understanding?” inProceedings of the International Conference on Machine Learning (ICML), vol. 2, no. 3, 2021, p. 4

  14. [14]

    Video transformer network,

    D. Neimark, O. Bar, M. Zohar, and D. Asselmann, “Video transformer network,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 3163–3172

  15. [15]

    Video swin transformer,

    Z. Liu, J. Ning, Y . Cao, Y . Wei, Z. Zhang, S. Lin, and H. Hu, “Video swin transformer,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 3202–3211

  16. [16]

    Multiview transformers for video recognition,

    S. Yan, X. Xiong, A. Arnab, Z. Lu, M. Zhang, C. Sun, and C. Schmid, “Multiview transformers for video recognition,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 3333–3343

  17. [17]

    Uniformer: Unified transformer for efficient spatiotemporal representation learning

    K. Li, Y . Wang, P. Gao, G. Song, Y . Liu, H. Li, and Y . Qiao, “Uniformer: Unified transformer for efficient spatiotemporal representation learning,” arXiv preprint arXiv:2201.04676, 2022

  18. [18]

    Temporal attentive alignment for large-scale video domain adaptation,

    M.-H. Chen, Z. Kira, G. AlRegib, J. Yoo, R. Chen, and J. Zheng, “Temporal attentive alignment for large-scale video domain adaptation,” inProceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 6321–6330

  19. [19]

    Shuffle and attend: Video domain adaptation,

    J. Choi, G. Sharma, S. Schulter, and J.-B. Huang, “Shuffle and attend: Video domain adaptation,” inComputer Vision–ECCV 2020: 16th Euro- pean Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XII 16. Springer, 2020, pp. 678–695. 13

  20. [20]

    Adversarial cross- domain action recognition with co-attention,

    B. Pan, Z. Cao, E. Adeli, and J. C. Niebles, “Adversarial cross- domain action recognition with co-attention,” inProceedings of the AAAI conference on artificial intelligence, vol. 34, no. 07, 2020, pp. 11 815– 11 822

  21. [21]

    Multi-modal domain adaptation for fine- grained action recognition,

    J. Munro and D. Damen, “Multi-modal domain adaptation for fine- grained action recognition,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 122–132

  22. [22]

    Deep domain adaptation in action space

    A. Jamal, V . P. Namboodiri, D. Deodhare, and K. Venkatesh, “Deep domain adaptation in action space.” inBMVC, vol. 2, no. 3. Newcastle, 2018, p. 5

  23. [23]

    Pairwise two-stream convnets for cross-domain action recognition with small data,

    Z. Gao, L. Guo, T. Ren, A.-A. Liu, Z.-Y . Cheng, and S. Chen, “Pairwise two-stream convnets for cross-domain action recognition with small data,”Pairwise two-stream convnets for cross-domain action recognition with small data, vol. 33, no. 3, pp. 1147–1161, 2020

  24. [24]

    Bridging domain spaces for unsupervised domain adaptation,

    J. Na, H. Jung, H. J. Chang, and W. Hwang, “Bridging domain spaces for unsupervised domain adaptation,”Pattern Recognition, vol. 164, p. 111537, 2025

  25. [25]

    Spatio-temporal contrastive domain adaptation for action recognition,

    X. Song, S. Zhao, J. Yang, H. Yue, P. Xu, R. Hu, and H. Chai, “Spatio-temporal contrastive domain adaptation for action recognition,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 9787–9795

  26. [26]

    Learning cross-modal contrastive features for video domain adaptation,

    D. Kim, Y .-H. Tsai, B. Zhuang, X. Yu, S. Sclaroff, K. Saenko, and M. Chandraker, “Learning cross-modal contrastive features for video domain adaptation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 13 618–13 627

  27. [27]

    Contrast and mix: Temporal contrastive video domain adaptation with background mixing,

    A. Sahoo, R. Shah, R. Panda, K. Saenko, and A. Das, “Contrast and mix: Temporal contrastive video domain adaptation with background mixing,”Advances in Neural Information Processing Systems, vol. 34, pp. 23 386–23 400, 2021

  28. [28]

    Dual-head contrastive domain adaptation for video action recognition,

    V . G. T. Da Costa, G. Zara, P. Rota, T. Oliveira-Santos, N. Sebe, V . Murino, and E. Ricci, “Dual-head contrastive domain adaptation for video action recognition,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2022, pp. 1181–1190

  29. [29]

    A pairwise attentive adversarial spatiotemporal network for cross-domain few-shot action recognition-r2,

    Z. Gao, L. Guo, W. Guan, A.-A. Liu, T. Ren, and S. Chen, “A pairwise attentive adversarial spatiotemporal network for cross-domain few-shot action recognition-r2,”A pairwise attentive adversarial spatiotemporal network for cross-domain few-shot action recognition-R2, vol. 30, pp. 767–782, 2020

  30. [30]

    Supervised domain adaptation for few-shot radar-based human activity recognition,

    X. Li, Y . He, J. A. Zhang, and X. Jing, “Supervised domain adaptation for few-shot radar-based human activity recognition,”IEEE Sensors Journal, vol. 21, no. 22, pp. 25 880–25 890, 2021

  31. [31]

    Augmenting and aligning snippets for few-shot video domain adaptation,

    Y . Xu, J. Yang, Y . Zhou, Z. Chen, M. Wu, and X. Li, “Augmenting and aligning snippets for few-shot video domain adaptation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 13 445–13 456

  32. [32]

    Learning transferable visual models from natural language supervi- sion,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervi- sion,” inProceedings of the 38th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, ...

  33. [33]

    Fine- tuned clip models are efficient video learners,

    H. Rasheed, M. U. Khattak, M. Maaz, S. Khan, and F. S. Khan, “Fine- tuned clip models are efficient video learners,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 6545–6554

  34. [34]

    Expanding language-image pretrained models for gen- eral video recognition,

    B. Ni, H. Peng, M. Chen, S. Zhang, G. Meng, J. Fu, S. Xiang, and H. Ling, “Expanding language-image pretrained models for gen- eral video recognition,” inEuropean conference on computer vision. Springer, 2022, pp. 1–18

  35. [35]

    Actionclip: A new paradigm for video action recognition,

    M. Wang, J. Xing, and Y . Liu, “Actionclip: A new paradigm for video action recognition,”arXiv:2109.08472, 2021

  36. [36]

    Kronecker mask and inter- pretive prompts are language-action video learners,

    J. Yang, Z. Yu, X. Ni, J. He, and H. Li, “Kronecker mask and inter- pretive prompts are language-action video learners,” inThe Thirteenth International Conference on Learning Representations, 2025

  37. [37]

    Match, expand and improve: Unsupervised finetuning for zero-shot action recognition with language knowledge,

    W. Lin, L. Karlinsky, N. Shvetsova, H. Possegger, M. Kozinski, R. Panda, R. Feris, H. Kuehne, and H. Bischof, “Match, expand and improve: Unsupervised finetuning for zero-shot action recognition with language knowledge,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 2851–2862

  38. [38]

    Ost: Refining text knowledge with optimal spatio-temporal descriptor for general video recognition,

    T. Chen, H. Yu, Z. Yang, Z. Li, W. Sun, and C. Chen, “Ost: Refining text knowledge with optimal spatio-temporal descriptor for general video recognition,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 18 888–18 898

  39. [39]

    Disentangling spatial and temporal learning for efficient image-to- video transfer learning,

    Z. Qing, S. Zhang, Z. Huang, Y . Zhang, C. Gao, D. Zhao, and N. Sang, “Disentangling spatial and temporal learning for efficient image-to- video transfer learning,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 13 934–13 944

  40. [40]

    Frozen clip models are efficient video learners,

    Z. Lin, S. Geng, R. Zhang, P. Gao, G. De Melo, X. Wang, J. Dai, Y . Qiao, and H. Li, “Frozen clip models are efficient video learners,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 388–404

  41. [41]

    Froster: Frozen clip is a strong teacher for open-vocabulary action recognition,

    X. Huang, H. Zhou, K. Yao, and K. Han, “Froster: Frozen clip is a strong teacher for open-vocabulary action recognition,” inInternational Conference on Learning Representations, 2024

  42. [42]

    Dual-path adaptation from image to video transformers,

    J. Park, J. Lee, and K. Sohn, “Dual-path adaptation from image to video transformers,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2203–2213

  43. [43]

    Aim: Adapt- ing image models for efficient video understanding,

    T. Yang, Y . Zhu, Y . Xie, A. Zhang, C. Chen, and M. Li, “Aim: Adapt- ing image models for efficient video understanding,” inInternational Conference on Learning Representations, 2023

  44. [44]

    St-adapter: Parameter- efficient image-to-video transfer learning,

    J. Pan, Z. Lin, X. Zhu, J. Shao, and H. Li, “St-adapter: Parameter- efficient image-to-video transfer learning,”Advances in Neural Infor- mation Processing Systems, vol. 35, pp. 26 462–26 477, 2022

  45. [45]

    V op: Text-video co-operative prompt tuning for cross-modal retrieval,

    S. Huang, B. Gong, Y . Pan, J. Jiang, Y . Lv, Y . Li, and D. Wang, “V op: Text-video co-operative prompt tuning for cross-modal retrieval,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 6565–6574

  46. [46]

    Vilt-clip: Video and language tuning clip with multimodal prompt learning and scenario-guided optimization,

    H. Wang, F. Liu, L. Jiao, J. Wang, Z. Hao, S. Li, L. Li, P. Chen, and X. Liu, “Vilt-clip: Video and language tuning clip with multimodal prompt learning and scenario-guided optimization,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 6, 2024, pp. 5390–5400

  47. [47]

    Vita-clip: Video and text adaptive clip via multimodal prompting,

    S. T. Wasim, M. Naseer, S. Khan, F. S. Khan, and M. Shah, “Vita-clip: Video and text adaptive clip via multimodal prompting,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2023, pp. 23 034–23 044

  48. [48]

    Prompting visual- language models for efficient video understanding,

    C. Ju, T. Han, K. Zheng, Y . Zhang, and W. Xie, “Prompting visual- language models for efficient video understanding,” inEuropean Con- ference on Computer Vision, 2022, pp. 105–124

  49. [49]

    Ez-clip: Efficient zeroshot video action recognition,

    S. Ahmad, S. Chanda, and Y . S. Rawat, “Ez-clip: Efficient zeroshot video action recognition,”arXiv:2312.08010, 2023

  50. [50]

    Open-vclip: Transforming clip to an open-vocabulary video model via interpolated weight optimization,

    Z. Weng, X. Yang, A. Li, Z. Wu, and Y .-G. Jiang, “Open-vclip: Transforming clip to an open-vocabulary video model via interpolated weight optimization,” inInternational Conference on Machine Learning, 2023, pp. 36 978–36 989

  51. [51]

    Building an open-vocabulary video clip model with better architectures, optimization and data,

    Z. Wu, Z. Weng, W. Peng, X. Yang, A. Li, L. S. Davis, and Y .-G. Jiang, “Building an open-vocabulary video clip model with better architectures, optimization and data,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 7, pp. 4747–4762, 2024

  52. [52]

    The Kinetics Human Action Video Dataset

    W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijaya- narasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, and A. Zisserman, “The kinetics human action video dataset,”arXiv preprint arXiv:1705.06950, 2017

  53. [53]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020

  54. [54]

    Swin transformer: Hierarchical vision transformer using shifted windows,

    Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 012–10 022

  55. [55]

    Omnivore: A single model for many visual modalities,

    R. Girdhar, M. Singh, N. Ravi, L. Van Der Maaten, A. Joulin, and I. Misra, “Omnivore: A single model for many visual modalities,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16 102–16 112

  56. [56]

    Source-free video domain adaptation by learning temporal consistency for action recognition,

    Y . Xu, J. Yang, H. Cao, K. Wu, M. Wu, and Z. Chen, “Source-free video domain adaptation by learning temporal consistency for action recognition,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 147–164

  57. [57]

    Anomize: Better open vocabulary video anomaly detection,

    F. Li, W. Liu, J. Chen, R. Zhang, Y . Wang, X. Zhong, and Z. Wang, “Anomize: Better open vocabulary video anomaly detection,” inPro- ceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 29 203–29 212

  58. [58]

    Sato: Stable text-to-motion framework,

    W. Chen, H. Xiao, E. Zhang, L. Hu, L. Wang, M. Liu, and C. Chen, “Sato: Stable text-to-motion framework,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 6989–6997

  59. [59]

    Scaling up visual and vision-language representation learning with noisy text supervision,

    C. Jia, Y . Yang, Y . Xia, Y .-T. Chen, Z. Parekh, H. Pham, Q. Le, Y .-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” inInternational conference on machine learning. PMLR, 2021, pp. 4904–4916

  60. [60]

    Visual prompt tuning,

    M. Jia, L. Tang, B.-C. Chen, C. Cardie, S. Belongie, B. Hariharan, and S.-N. Lim, “Visual prompt tuning,” inEuropean conference on computer vision. Springer, 2022, pp. 709–727. 14

  61. [61]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

  62. [62]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778

  63. [63]

    UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

    K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,”arXiv:1212.0402, 2012

  64. [64]

    Hmdb: a large video database for human motion recognition,

    H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “Hmdb: a large video database for human motion recognition,” in2011 Interna- tional conference on computer vision. IEEE, 2011, pp. 2556–2563

  65. [65]

    The ”something something

    R. Goyal, S. Ebrahimi Kahou, V . Michalski, J. Materzynska, S. West- phal, H. Kim, V . Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag, F. Hoppe, C. Thurau, I. Bax, and R. Memisevic, “The ”something something” video database for learning and evaluating visual common sense,” inProceedings of the IEEE International Conference on Com- puter Vision (ICCV),...

  66. [66]

    Arid: A new dataset for recognizing action in the dark,

    Y . Xu, J. Yang, H. Cao, K. Mao, J. Yin, and S. See, “Arid: A new dataset for recognizing action in the dark,” inInternational workshop on deep learning for human activity recognition. Springer, 2021, pp. 70–84

  67. [67]

    Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens- 100,

    D. Damen, H. Doughty, G. M. Farinella, A. Furnari, E. Kazakos, J. Ma, D. Moltisanti, J. Munro, T. Perrett, W. Price, and M. Wray, “Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens- 100,”International Journal of Computer Vision, pp. 1–23, 2022

  68. [68]

    Dreambooth: Fine tuning text-to-image diffusion models for subject- driven generation,

    N. Ruiz, Y . Li, V . Jampani, Y . Pritch, M. Rubinstein, and K. Aberman, “Dreambooth: Fine tuning text-to-image diffusion models for subject- driven generation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 22 500–22 510

  69. [69]

    Swin-unet: Unet-like pure transformer for medical image segmenta- tion,

    H. Cao, Y . Wang, J. Chen, D. Jiang, X. Zhang, Q. Tian, and M. Wang, “Swin-unet: Unet-like pure transformer for medical image segmenta- tion,” inEuropean conference on computer vision. Springer, 2022, pp. 205–218

  70. [70]

    Learning to prompt for vision- language models,

    K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision- language models,”International Journal of Computer Vision, vol. 130, no. 9, pp. 2337–2348, 2022

  71. [71]

    X3d: Expanding architectures for efficient video recognition,

    C. Feichtenhofer, “X3d: Expanding architectures for efficient video recognition,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 203–213

  72. [72]

    Maple: Multi-modal prompt learning,

    M. U. Khattak, H. Rasheed, M. Maaz, S. Khan, and F. S. Khan, “Maple: Multi-modal prompt learning,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 19 113–19 122

  73. [73]

    Ta-adapter: Enhancing few-shot clip with task-aware encoders,

    W. Zhang, Y . Zhang, Y . Deng, W. Zhang, J. Lin, B. Huang, J. Zhang, and W. Yu, “Ta-adapter: Enhancing few-shot clip with task-aware encoders,” Pattern Recognition, vol. 153, p. 110559, 2024

  74. [74]

    Grad-cam: Visual explanations from deep networks via gradient-based localization,

    R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” inProceedings of the IEEE international conference on computer vision, 2017, pp. 618–626

  75. [75]

    Watermark and trademark prompts boost video action recognition in visual-language models,

    L. Jin, H. Jung, H. J. Jon, and E. Y . Kim, “Watermark and trademark prompts boost video action recognition in visual-language models,” Mathematics, vol. 13, no. 9, p. 1365, 2025

  76. [76]

    Aa-clip: Enhancing zero-shot anomaly detection via anomaly-aware clip,

    W. Ma, X. Zhang, Q. Yao, F. Tang, C. Wu, Y . Li, R. Yan, Z. Jiang, and S. K. Zhou, “Aa-clip: Enhancing zero-shot anomaly detection via anomaly-aware clip,”arXiv preprint arXiv:2503.06661, 2025