pith. machine review for the scientific record. sign in

arxiv: 2604.17062 · v1 · submitted 2026-04-18 · 💻 cs.CV

Recognition: unknown

Motion-Guided Semantic Alignment with Negative Prompts for Zero-Shot Video Action Recognition

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:37 UTC · model grok-4.3

classification 💻 cs.CV
keywords zero-shot action recognitionCLIP adaptationmotion separationnegative promptssemantic alignmentvideo understandingdisentangled features
0
0 comments X

The pith

Separating motion features from static content and aligning videos with negative text prompts lets CLIP recognize actions never seen in training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to prove that CLIP can be made more effective for zero-shot video action recognition by first disentangling motion-sensitive features from global static ones, then refining the motion signal through gated attention, and finally aligning the resulting video embeddings to both positive class prompts and negative non-class prompts. A reader would care if true because this would let recognition systems identify new actions in videos without collecting labeled examples for every possible category, addressing the core limitation that prevents current models from working on evolving or rare activities. The approach relies on two new modules to keep motion information clean and uses negative prompts to explicitly teach the model what the video is not, which the authors show produces stronger results than earlier CLIP adaptations on both coarse and fine-grained benchmarks.

Core claim

We present a novel framework that enhances CLIP with disentangled embeddings and semantic-guided interaction. A Motion Separation Module (MSM) separates motion-sensitive and global-static features, while a Motion Aggregation Block (MAB) employs gated cross-attention to refine motion representation without re-coupling redundant information. To facilitate generalization to unseen categories, we enforce semantic alignment between video features and textual representations by aligning projected embeddings with positive textual prompts, while leveraging negative prompts to explicitly model non-class semantics.

What carries the argument

The Motion Separation Module (MSM) that isolates motion-sensitive features, the Motion Aggregation Block (MAB) that performs gated cross-attention on motion, and the dual use of positive and negative textual prompts to enforce semantic alignment.

If this is right

  • The framework produces consistent gains over prior CLIP-based zero-shot methods on standard coarse and fine-grained action benchmarks.
  • Negative prompts allow the model to represent what an action is not, supporting better transfer to classes absent from training.
  • Gated cross-attention in the aggregation block keeps motion features clean without reintroducing static redundancy.
  • The same alignment strategy works across both broad category sets and detailed action distinctions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The motion-disentanglement idea could transfer to other video-text tasks such as dense captioning or temporal localization.
  • Negative prompts might reduce overconfident predictions on ambiguous or out-of-distribution video clips.
  • Independent ablations of the separation and aggregation steps would clarify which part drives most of the reported improvement.
  • The approach points toward lightweight adaptation techniques that avoid full model retraining when new action classes appear.

Load-bearing premise

That isolating motion features and aligning them with negative prompts will reliably shrink the semantic gap to unseen actions without creating new errors on real video distributions.

What would settle it

Running the method on a fresh fine-grained action dataset where it fails to beat the strongest prior CLIP zero-shot baseline would falsify the central claim.

read the original abstract

Zero-shot action recognition is challenging due to the semantic gap between seen and unseen classes. We present a novel framework that enhances CLIP with disentangled embeddings and semantic-guided interaction. A Motion Separation Module (MSM) separates motion-sensitive and global-static features, while a Motion Aggregation Block (MAB) employs gated cross-attention to refine motion representation without re-coupling redundant information. To facilitate generalization to unseen categories, we enforce semantic alignment between video features and textual representations by aligning projected embeddings with positive textual prompts, while leveraging negative prompts to explicitly model "non-class" semantics. Experiments on standard benchmarks demonstrate that our method consistently outperforms prior CLIP-based approaches, achieving robust zero-shot action recognition across both coarse and fine-grained datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a framework for zero-shot video action recognition that augments CLIP with a Motion Separation Module (MSM) to disentangle motion-sensitive from global-static features and a Motion Aggregation Block (MAB) that uses gated cross-attention to refine motion representations. Semantic alignment is enforced by projecting video embeddings to match positive textual prompts while using negative prompts to explicitly capture non-class semantics, with the goal of closing the semantic gap for unseen action classes. The central claim is that this yields consistent outperformance over prior CLIP-based methods on standard benchmarks for both coarse- and fine-grained datasets.

Significance. If the empirical results hold after proper validation, the work offers a concrete mechanism for improving generalization in zero-shot settings by combining motion disentanglement with explicit negative-prompt modeling of non-class semantics. This could be useful for fine-grained actions where motion overlap is high. The approach is grounded in existing CLIP architectures and does not appear to introduce new free parameters beyond standard training, which is a positive attribute. However, the absence of any quantitative metrics, baselines, or ablations in the abstract limits immediate assessment of impact.

major comments (2)
  1. [Abstract] Abstract: The claim that the method 'consistently outperforms prior CLIP-based approaches' is presented without any accuracy numbers, dataset names, splits, error bars, or comparison tables. This is load-bearing for the central empirical claim; without these elements the soundness of the outperformance assertion cannot be evaluated.
  2. [Methods] Methods section (negative prompt construction): The procedure for generating negative prompts for truly unseen classes is not specified (e.g., fixed 'not [class]' templates, vocabulary drawn from seen classes, or learned). This detail is critical because reliance on training-class priors would violate standard zero-shot protocols and could produce non-generalizable gains, especially on fine-grained datasets where motion overlap makes 'non-class' semantics ambiguous.
minor comments (1)
  1. [Abstract] Abstract: The phrase 'disentangled embeddings and semantic-guided interaction' is used without a one-sentence pointer to the MSM/MAB modules or the alignment objective; a brief clarification would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate the changes we will make in the revision.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that the method 'consistently outperforms prior CLIP-based approaches' is presented without any accuracy numbers, dataset names, splits, error bars, or comparison tables. This is load-bearing for the central empirical claim; without these elements the soundness of the outperformance assertion cannot be evaluated.

    Authors: We agree that the abstract would benefit from greater specificity to support the central empirical claim. In the revised manuscript, we will update the abstract to explicitly name the standard benchmarks (UCF101, HMDB51, and a fine-grained dataset such as Something-Something V2), the zero-shot splits used, and direct reference to the quantitative comparisons and tables in the experiments section. This will allow immediate evaluation of the outperformance while respecting abstract length constraints. revision: yes

  2. Referee: [Methods] Methods section (negative prompt construction): The procedure for generating negative prompts for truly unseen classes is not specified (e.g., fixed 'not [class]' templates, vocabulary drawn from seen classes, or learned). This detail is critical because reliance on training-class priors would violate standard zero-shot protocols and could produce non-generalizable gains, especially on fine-grained datasets where motion overlap makes 'non-class' semantics ambiguous.

    Authors: We thank the referee for this important clarification request. The negative prompts are constructed via the fixed template 'not [class]' using the name of the target (unseen) action class at evaluation time. This is standard in zero-shot settings where test class names are provided for prompt construction, and no vocabulary or information from the training classes is used. No learned components are involved. We will add an explicit paragraph in the Methods section describing this construction process, including an example, to confirm adherence to zero-shot protocols. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method validated on benchmarks without self-referential derivations.

full rationale

The paper proposes architectural components (MSM for motion separation, MAB for gated cross-attention, and positive/negative prompt alignment) to address the semantic gap in zero-shot action recognition. All load-bearing claims reduce to experimental outperformance on standard coarse and fine-grained datasets rather than any first-principles derivation, fitted-parameter prediction, or self-citation chain. No equations appear in the abstract, and the described framework introduces new modules whose effectiveness is measured externally against prior CLIP baselines; nothing reduces to its own inputs by construction. This is a standard empirical contribution whose validity rests on benchmark results, not internal tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities can be extracted beyond the high-level module names.

pith-pipeline@v0.9.0 · 5423 in / 1129 out tokens · 30937 ms · 2026-05-10T06:37:27.332989+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 4 canonical work pages · 2 internal anchors

  1. [1]

    INTRODUCTION In recent years, large-scale vision–language pre-trained models such as CLIP [1] have shown remarkable success in cross-modal learning, driving significant advances in zero-shot learning (ZSL) [2]. Extend- ing this paradigm to the video domain, zero-shot action recognition (ZSAR) [3] seeks to classify unseen actions by transferring knowl- edg...

  2. [2]

    Motion-Guided Semantic Alignment with Negative Prompts for Zero-Shot Video Action Recognition

    PROPOSED METHOD 2.1. CLIP-based Video-Text Representation As shown in Fig. 1, given an input videoF v ∈R T×C×H×W with Tframes, we first extract visual representations using a frozen CLIP visual encoder. To adapt CLIP to the video domain while preserving zero-shot generalization, we introduce a lightweight Dual Adapter (DA) [17], which is shared across bot...

  3. [3]

    All results are reported in accuracy (%). Method Publication HMDB-51 UCF-101 K-600 Methods with Vision Training ER-ZSAR [3] ICCV’21 35.3±4.6 51.8±2.9 42.1±1.4 JigSawNet [4] TIP’19 39.3±3.9 56.8±2.8 — Methods with Vision-Language Training A5 [5] ECCV’22 44.3±2.2 69.3±4.2 — X-CLIP [20] ECCV’22 46.3±0.6 70.3±2.3 67.1±1.0 Vita-CLIP [7] CVPR’23 48.6±0.6 75.0±0...

  4. [4]

    Accuracy in %

    HM = harmonic mean of Base and Novel. Accuracy in %. Kinetics400 HMDB-51 Method Base Novel HM Base Novel HM Vanilla CLIP B/16 [1] 53.3 46.8 49.8 53.3 46.8 49.8 ActionCLIP B/16 [6] 69.0 57.2 62.6 69.1 37.3 48.5 XCLIP B/16 [20] 74.1 56.4 64.0 69.4 45.5 55.0 A5 [5] 74.1 56.4 64.0 46.2 16.0 23.8 ViFi-CLIP B/16 [21] 76.461.167.9 73.8 53.3 61.9 ZAR B/16 [11] 75...

  5. [5]

    EXPERIMENTS 3.1. Experimental Setup To evaluate our method, we conduct experiments on five widely used benchmarks: Kinetics-400 [22], Kinetics-600 [23], HMDB51 [24], UCF101 [25], and Something-Something V2 (SSv2) [26]. Kinetics- 400 serves as the training set, while HMDB51, UCF101, SSv2, and Kinetics-600 (excluding overlaps with K400) are used for zero-sh...

  6. [6]

    These designs jointly enhance the model’s ability to gener- alize from base to novel classes, providing a principled step toward zero-shot video action recognition

    CONCLUSION In conclusion, our motion-guided framework effectively disentan- gles motion and global cues, integrates them into semantically aligned representations, and leverages negative prompts for robust learning. These designs jointly enhance the model’s ability to gener- alize from base to novel classes, providing a principled step toward zero-shot vi...

  7. [7]

    Learning trans- ferable visual models from natural language supervision,

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al., “Learning trans- ferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763

  8. [8]

    Zero-shot learning with semantic output codes,

    Mark Palatucci, Dean Pomerleau, Geoffrey E Hinton, and Tom M Mitchell, “Zero-shot learning with semantic output codes,”Advances in neural information processing systems, vol. 22, 2009

  9. [9]

    Elaborative rehearsal for zero- shot action recognition,

    Shizhe Chen and Dong Huang, “Elaborative rehearsal for zero- shot action recognition,” inProceedings of the IEEE/CVF in- ternational conference on computer vision, 2021, pp. 13638– 13647

  10. [10]

    Jigsawnet: Shredded image reassembly using convolutional neural network and loop-based composi- tion,

    Canyu Le and Xin Li, “Jigsawnet: Shredded image reassembly using convolutional neural network and loop-based composi- tion,”IEEE Transactions on Image Processing, vol. 28, no. 8, pp. 4000–4015, 2019

  11. [11]

    Prompting visual-language models for efficient video understanding,

    Chen Ju, Tengda Han, Kunhao Zheng, Ya Zhang, and Weidi Xie, “Prompting visual-language models for efficient video understanding,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 105–124

  12. [12]

    Actionclip: Adapting language-image pre- trained models for video action recognition,

    Mengmeng Wang, Jiazheng Xing, Jianbiao Mei, Yong Liu, and Yunliang Jiang, “Actionclip: Adapting language-image pre- trained models for video action recognition,”IEEE transac- tions on neural networks and learning systems, vol. 36, no. 1, pp. 625–637, 2023

  13. [13]

    Vita-clip: Video and text adaptive clip via multimodal prompting,

    Syed Talal Wasim, Muzammal Naseer, Salman Khan, Fa- had Shahbaz Khan, and Mubarak Shah, “Vita-clip: Video and text adaptive clip via multimodal prompting,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 23034–23044

  14. [14]

    Ez-clip: Efficient zeroshot video action recognition,

    Shahzad Ahmad, Sukalpa Chanda, and Yogesh S Rawat, “Ez- clip: Efficient zeroshot video action recognition,”arXiv preprint arXiv:2312.08010, 2023

  15. [15]

    Is temporal prompting all we need for limited labeled action recognition?,

    Shreyank Gowda, Boyan Gao, Xiao Gu, and Xiabo Jin, “Is temporal prompting all we need for limited labeled action recognition?,” inProceedings of the Computer Vision and Pat- tern Recognition Conference, 2025, pp. 682–692

  16. [16]

    Kronecker mask and interpretive prompts are language-action video learners,

    Yang JingYi, Zitong YU, Nixiuming, He Jia, and Hui Li, “Kronecker mask and interpretive prompts are language-action video learners,” inThe Thirteenth International Conference on Learning Representations, 2025

  17. [17]

    Zar: Zero-shot action recognition with dynamic prompt tuning,

    Qiyue Liang, Cheng Lu, Chun Tao, and Jan P Allebach, “Zar: Zero-shot action recognition with dynamic prompt tuning,” Electronic Imaging, vol. 37, pp. 1–10, 2025

  18. [18]

    Match, expand and improve: Un- supervised finetuning for zero-shot action recognition with lan- guage knowledge,

    Wei Lin, Leonid Karlinsky, Nina Shvetsova, Horst Posseg- ger, Mateusz Kozinski, Rameswar Panda, Rogerio Feris, Hilde Kuehne, and Horst Bischof, “Match, expand and improve: Un- supervised finetuning for zero-shot action recognition with lan- guage knowledge,” inICCV, 2023

  19. [19]

    Telling stories for common sense zero-shot action recognition,

    Shreyank N Gowda and Laura Sevilla-Lara, “Telling stories for common sense zero-shot action recognition,” inProceedings of the Asian Conference on Computer Vision, 2024, pp. 4577– 4594

  20. [20]

    Building a multi-modal spatiotem- poral expert for zero-shot action recognition with clip,

    Yating Yu, Congqi Cao, Yueran Zhang, Qinyi Lv, Lingtong Min, and Yanning Zhang, “Building a multi-modal spatiotem- poral expert for zero-shot action recognition with clip,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2025, vol. 39, pp. 9689–9697

  21. [21]

    FROSTER: Frozen CLIP is a strong teacher for open- vocabulary action recognition,

    Xiaohu Huang, Hao Zhou, Kun Yao, and Kai Han, “FROSTER: Frozen CLIP is a strong teacher for open- vocabulary action recognition,” inThe Twelfth International Conference on Learning Representations, 2024

  22. [22]

    Continual learning improves zero-shot action recog- nition,

    Shreyank N Gowda, Davide Moltisanti, and Laura Sevilla- Lara, “Continual learning improves zero-shot action recog- nition,” inProceedings of the Asian Conference on Computer Vision, 2024, pp. 3239–3256

  23. [23]

    Adapterhub: A framework for adapting transform- ers,

    Jonas Pfeiffer, Andreas R ¨uckl´e, Clifton Poth, Aishwarya Ka- math, Ivan Vuli´c, Sebastian Ruder, Kyunghyun Cho, and Iryna Gurevych, “Adapterhub: A framework for adapting transform- ers,” inProceedings of the 2020 conference on empirical meth- ods in natural language processing: system demonstrations, 2020, pp. 46–54

  24. [24]

    Understanding the impact of negative prompts: When and how do they take effect?,

    Yuanhao Ban, Ruochen Wang, Tianyi Zhou, Minhao Cheng, Boqing Gong, and Cho-Jui Hsieh, “Understanding the impact of negative prompts: When and how do they take effect?,” in european conference on computer vision. Springer, 2024, pp. 190–206

  25. [25]

    Deep adaptive wavelet network,

    Maria Ximena Bastidas Rodriguez, Adrien Gruson, Luisa Polania, Shin Fujieda, Flavio Prieto, Kohei Takayama, and Toshiya Hachisuka, “Deep adaptive wavelet network,” inPro- ceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2020, pp. 3111–3119

  26. [26]

    Expanding language-image pretrained models for gen- eral video recognition,

    Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, and Haibin Ling, “Expanding language-image pretrained models for gen- eral video recognition,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 1–18

  27. [27]

    Fine-tuned clip models are efficient video learners,

    Hanoona Rasheed, Muhammad Uzair Khattak, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan, “Fine-tuned clip models are efficient video learners,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, 2023, pp. 6545–6554

  28. [28]

    The kinetics human action video dataset,

    Andrew Zisserman, Joao Carreira, Karen Simonyan, Will Kay, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, et al., “The kinetics human action video dataset,”arXiv preprint arXiv, vol. 1705, 2017

  29. [29]

    A Short Note about Kinetics-600

    Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Zisserman, “A short note about kinetics- 600,”arXiv preprint arXiv:1808.01340, 2018

  30. [30]

    Hmdb: a large video database for human motion recognition,

    Hildegard Kuehne, Hueihan Jhuang, Est´ıbaliz Garrote, Tomaso Poggio, and Thomas Serre, “Hmdb: a large video database for human motion recognition,” in2011 International conference on computer vision. IEEE, 2011, pp. 2556–2563

  31. [31]

    UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

    Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,”arXiv preprint arXiv:1212.0402, 2012

  32. [32]

    The” something something

    Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al., “The” something something” video database for learning and evaluating visual common sense,” inProceedings of the IEEE international conference on computer vision, 2017, pp. 5842–5850