pith. machine review for the scientific record. sign in

arxiv: 2605.01165 · v1 · submitted 2026-05-01 · 💻 cs.CV

Recognition: unknown

CEZSAR: A Contrastive Embedding Method for Zero-Shot Action Recognition

David Menotti, Helio Pedrini, Rayson Laroca, Valter Estevam

Authors on Pith no claims yet

Pith reviewed 2026-05-09 18:46 UTC · model grok-4.3

classification 💻 cs.CV
keywords zero-shot action recognitioncontrastive learningjoint video-text embeddingautomatic negative samplingUCF-101Kinetics-400multimodal alignment
0
0 comments X

The pith

A contrastive method aligns videos with text descriptions in a joint space to recognize actions never seen in training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes training a model to place video clips and their matching sentences close together in one embedding space while separating mismatched pairs. Automatic negative sampling creates those mismatched examples on the fly so the system learns without needing extra manual labels. This joint space is meant to close the difference between visual features and language features and to handle the fact that test actions differ from training ones. If the alignment holds, a model could label a new action simply by comparing its video embedding to the embedding of a text description.

Core claim

The central claim is that a contrastive model encoding both videos and sentences into a shared embedding space, trained by pulling matching video-description pairs together and pushing apart automatically generated unpaired examples, produces representations that generalize to action classes absent from the training set, reaching state-of-the-art accuracy on the UCF-101 and Kinetics-400 datasets across multiple zero-shot splits.

What carries the argument

The joint video-text embedding space trained with contrastive loss and automatic negative sampling to produce unpaired visual and textual examples.

If this is right

  • State-of-the-art zero-shot accuracy on UCF-101 and Kinetics-400 under several training-test splits.
  • Action classification at test time using only a textual description of the target class.
  • Reduced dependence on collecting labeled video examples for every possible action class.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same contrastive alignment procedure could be reused for zero-shot recognition of objects or events once suitable text descriptions exist.
  • Automatic negative sampling may cut the cost of curating large vision-language training sets for other multimodal tasks.
  • Applying the method to datasets with larger visual or linguistic differences would test how far the joint space generalizes.

Load-bearing premise

The alignment learned from seen actions and their descriptions will still match unseen actions to their textual descriptions despite differences in appearance and wording.

What would settle it

A controlled test on a fresh zero-shot split of a standard action dataset in which removing the automatic negative sampling step causes accuracy to drop below prior non-contrastive baselines.

Figures

Figures reproduced from arXiv: 2605.01165 by David Menotti, Helio Pedrini, Rayson Laroca, Valter Estevam.

Figure 1
Figure 1. Figure 1: T-SNE visualization for a subset with the classes Horse Riding (blue), Horse Race (orange), Pommel Horse (green), and Balance Beam (red). Dots are videos, and stars are label prototypes. the label encoding process usually produces one array5 for which we assume all required semantic information is encoded. Strategies to alleviate the semantic gap in label encoding include adding more descriptive texts and … view at source ↗
Figure 2
Figure 2. Figure 2: Our method is composed of the Visual Embedding and Sentence Embedding modules. Each module produces a dense representation that is expected to be close if the sentence describes the video and far otherwise. For training the model, we propose a hard negative sampling method. This method seeks negative alignments between videos and texts without human su￾pervision. Thus, we can generate triplets (video, posi… view at source ↗
Figure 3
Figure 3. Figure 3: (a) ZSARCAP [12] results encoded with SBERT; (b) CEZSAR (VE + O) view at source ↗
read the original abstract

This paper proposes a novel Zero-Shot Action Recognition~(ZSAR) method based on contrastive learning. In ZSAR, we aim to classify examples from classes that were missing during training. Two well-known problems remain in ZSAR: the semantic gap and the domain shift. A semantic gap occurs because label representations come from the textual domain (i.e., language models) and must be associated with visual representations (i.e., CNNs, RNNs, transformer-based). This multimodal nature implies that the semantic properties of the two spaces are not identical. On the other hand, the domain shift arises from differences between the training and test sets and is inherent to ZSAR once the test set is unknown. One of the most promising methods to address both issues is learning joint embedding spaces. Therefore, we propose a new model that encodes videos and sentences in a joint embedding space, trained by aligning videos with their natural-language descriptions. We design an automatic negative sampling procedure to augment the training dataset and generate unpaired data, i.e., visual appearance and unrelated descriptions. Our results are state-of-the-art on the UCF-101 and Kinetics-400 datasets under several split configurations. Our code is available at https://github.com/valterlej/cezsar.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript proposes CEZSAR, a contrastive embedding method for zero-shot action recognition (ZSAR). Videos and natural-language descriptions are encoded into a shared embedding space and aligned via contrastive learning; an automatic negative sampling procedure generates unpaired video-description pairs to augment training. The approach targets the semantic gap between visual and textual modalities as well as domain shift between seen and unseen action classes. The authors report state-of-the-art results on UCF-101 and Kinetics-400 under multiple train/test splits and release the code.

Significance. If the reported gains hold under rigorous evaluation, the work supplies a straightforward, reproducible contrastive baseline for ZSAR that avoids elaborate architectures while directly addressing modality alignment and negative sampling. Public code release strengthens the contribution by enabling direct verification and extension.

minor comments (3)
  1. §4 (Experiments): the exact train/test splits, number of runs, and full baseline comparisons (including recent contrastive and generative ZSAR methods) should be tabulated with mean and standard deviation to support the SOTA claim.
  2. §3.2 (Negative sampling): clarify whether the automatic procedure can inadvertently sample from classes that appear in the test set under any of the reported splits; a short ablation would strengthen the domain-shift argument.
  3. Figure 2 and §3.1: the joint embedding diagram would benefit from explicit notation for the video and text encoders and the temperature parameter used in the contrastive loss.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, the recognition of our straightforward contrastive baseline, and the recommendation for minor revision. We are grateful for the emphasis on reproducibility and code release.

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper presents a contrastive embedding method for zero-shot action recognition that trains a joint video-text space by aligning videos with their natural-language descriptions and using automatic negative sampling to generate unpaired data. This follows standard contrastive learning objectives without any self-definitional reductions, fitted parameters renamed as predictions, or load-bearing self-citations that collapse the central claims to tautologies. The SOTA empirical results on UCF-101 and Kinetics-400 under multiple splits are independent evaluations outside the training construction, with no equations or steps that reduce by construction to the inputs. The approach is self-contained as a direct application of multimodal alignment to address semantic gap and domain shift.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the approach implicitly relies on standard contrastive learning assumptions about embedding alignment.

pith-pipeline@v0.9.0 · 5532 in / 1021 out tokens · 31615 ms · 2026-05-09T18:46:59.186184+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 25 canonical work pages · 1 internal anchor

  1. [1]

    In: British Machine Vision Conference (BMVC)

    Balntas,V.,Riba,E.,Ponsa,D.,Mikolajczyk,K.:Learninglocalfeaturedescriptors with triplets and shallow convolutional neural networks. In: British Machine Vision Conference (BMVC). pp. 1–11 (2016).https://doi.org/10.5244/C.30.119

  2. [2]

    nuScenes: A multimodal dataset for autonomous driving,

    Brattoli, B., Tighe, J., Zhdanov, F., Perona, P., Chalupka, K.: Rethinking zero- shot video classification: End-to-end training for realistic applications. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4613–4623 (Jun 2020).https://doi.org/10.1109/CVPR42600.2020.00467

  3. [3]

    In: British Machine Vision Conference (BMVC)

    Bretti, C., Mettes, P.: Zero-shot action recognition from diverse object-scene com- positions. In: British Machine Vision Conference (BMVC). pp. 1–14 (Nov 2021)

  4. [4]

    Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kineticsdataset.In:IEEEConferenceonComputerVisionandPatternRecognition (CVPR). pp. 4724–4733 (Jul 2017).https://doi.org/10.1109/CVPR.2017.502

  5. [5]

    In: IEEE/CVF International Conference on Computer Vision (ICCV)

    Chen, S., Huang, D.: Elaborative rehearsal for zero-shot action recognition. In: IEEE/CVF International Conference on Computer Vision (ICCV). pp. 13638– 13647 (Oct 2021) 14 Estevam et al

  6. [6]

    In: 2024 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR)

    Chen, X., et al.: AnyDoor: Zero-shot object-level image customization. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6593–6602 (2024).https://doi.org/10.1109/CVPR52733.2024.00630

  7. [7]

    Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, withapplicationtofaceverification.In:IEEEComputerVisionandPatternRecog- nition (CVPR). pp. 539–546 (2005).https://doi.org/10.1109/CVPR.2005.202

  8. [8]

    Semantic generative augmentations for few-shot counting, in: IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2024, Waikoloa, HI, USA, January 3-8, 2024, IEEE

    Doshi, K., et al.: A Multimodal Benchmark and Improved Architecture for Zero Shot Learning . In: IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 2010–2019 (2024).https://doi.org/10.1109/WACV57701.2024.00202

  9. [9]

    IEEE Signal Processing Letters29, 1843–1847 (2022)

    Estevam, V., Laroca, R., Pedrini, H., Menotti, D.: Global semantic descriptors for zero-shot action recognition. IEEE Signal Processing Letters29, 1843–1847 (2022). https://doi.org/10.1109/LSP.2022.3200605

  10. [10]

    Journal of Visual Communication and Image Representation107, 104385 (2025).https://doi.org/10.1016/j.jvcir

    Estevam, V., Laroca, R., Pedrini, H., Menotti, D.: Dense video captioning us- ing unsupervised semantic information. Journal of Visual Communication and Image Representation107, 104385 (2025).https://doi.org/10.1016/j.jvcir. 2024.104385

  11. [11]

    2019.01.103

    Estevam,V.,Pedrini,H.,Menotti,D.:Zero-shotactionrecognitioninvideos:Asur- vey. Neurocomputing439, 159–175 (2021).https://doi.org/10.1016/j.neucom. 2021.01.036

  12. [12]

    Multimedia Tools and Applications83, 28147–28173 (2024).https://doi.org/10.1007/s11042-023-16566-5

    Estevam, V., et al.: Tell me what you see: A zero-shot action recognition method based on natural language descriptions. Multimedia Tools and Applications83, 28147–28173 (2024).https://doi.org/10.1007/s11042-023-16566-5

  13. [13]

    Implications of

    Gowda, S.N.: Synthetic sample selection for generalized zero-shot learning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).pp.58–67(2023).https://doi.org/10.1109/CVPRW59228.2023.00011

  14. [14]

    In: Asian Conference on Computer Vision (ACCV)

    Gowda, S.N., Moltisanti, D., Sevilla-Lara, L.: Continual learning improves zero- shot action recognition. In: Asian Conference on Computer Vision (ACCV). p. 403–421 (2024).https://doi.org/10.1007/978-981-96-0908-6_23

  15. [15]

    In: European Conf

    Gowda, S.N., Sevilla-Lara, L., Keller, F., Rohrbach, M.: CLASTER: Clustering with reinforcement learning for zero-shot action recognition. In: European Conf. on Computer Vision (ECCV). pp. 187–203 (2022)

  16. [16]

    2021 , url =

    Han, Z., Fu, Z., Chen, S., Yang, J.: Contrastive embedding for generalized zero-shot learning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2371–2381 (2021).https://doi.org/10.1109/CVPR46437.2021.00240

  17. [17]

    Huang, K., Miralles-Pechuán., L., Mckeever., S.: Combining text and image knowl- edgewithGANsforzero-shotactionrecognitioninvideos.In:InternationalConfer- ence on Computer Vision Theory and Applications (VISAPP). pp. 623–631 (2022). https://doi.org/10.5220/0010903100003124

  18. [18]

    SN Computer Science 4(4), 375 (2023).https://doi.org/10.1007/s42979-023-01803-3

    Huang, K., Miralles-Pechuán, L., Mckeever, S.: Enhancing zero-shot action recog- nition in videos by combining gans with text and images. SN Computer Science 4(4), 375 (2023).https://doi.org/10.1007/s42979-023-01803-3

  19. [19]

    In: International Conference on Neural Infor- mation Processing Systems (NeurIPS)

    Kerrigan, A., Duarte, K., Rawat, Y., Shah, M.: Reformulating zero-shot action recognition for multi-label actions. In: International Conference on Neural Infor- mation Processing Systems (NeurIPS). vol. 34, pp. 25566–25577 (2021)

  20. [20]

    In: AAAI Conference on Artificial Intelligence (2021)

    Kim, T.S., et al.: DASZL: Dynamic action signatures for zero-shot learning. In: AAAI Conference on Artificial Intelligence (2021)

  21. [21]

    In: Proceedings of the IEEE International Conference on Computer Vision (ICCV)

    Krishna, R., Hata, K., Ren, F., Fei-Fei, L., Niebles, J.C.: Dense-captioning events in videos. In: IEEE International Conference on Computer Vision (ICCV). pp. 706–715 (2017).https://doi.org/10.1109/ICCV.2017.83 A Contrastive Embedding Method for Zero-Shot Action Recognition 15

  22. [22]

    Expert Systems with Applications255, 124786 (2024).https://doi.org/10.1016/j.eswa.2024.124786

    Lee, J.C., Lee, D.G.: ESC-ZSAR: Expanded semantics from categories with cross- attention for zero-shot action recognition. Expert Systems with Applications255, 124786 (2024).https://doi.org/10.1016/j.eswa.2024.124786

  23. [23]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Li, X., Yang, X., Wei, K., Deng, C., Yang, M.: Siamese contrastive embedding net- work for compositional zero-shot learning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 9326–9335 (June 2022)

  24. [24]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Lin, C.C., et al.: Cross-modal representation learning for zero-shot action recog- nition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 19978–19988 (June 2022)

  25. [25]

    In: IEEEConferenceonComputerVisionandPatternRecognition(CVPR).pp.3337– 3344 (2011).https://doi.org/10.1109/CVPR.2011.5995353

    Liu, J., Kuipers, B., Savarese, S.: Recognizing human actions by attributes. In: IEEEConferenceonComputerVisionandPatternRecognition(CVPR).pp.3337– 3344 (2011).https://doi.org/10.1109/CVPR.2011.5995353

  26. [26]

    Pattern Recognition Let- ters155, 77–83 (2022).https://doi.org/10.1016/j.patrec.2022.02.002

    Ma, P., Lu, H., Yang, B., Ran, W.: GAN-MVAE: A discriminative latent feature generation framework for generalized zero-shot learning. Pattern Recognition Let- ters155, 77–83 (2022).https://doi.org/10.1016/j.patrec.2022.02.002

  27. [27]

    International Journal of Computer Vision129, 1954–1971 (2021).https: //doi.org/10.1007/s11263-021-01454-y

    Mettes,P.,Thong,W.,Snoek,C.:Objectpriorsforclassifyingandlocalizingunseen actions. International Journal of Computer Vision129, 1954–1971 (2021).https: //doi.org/10.1007/s11263-021-01454-y

  28. [28]

    In: IEEE International Conference on Computer Vision (ICCV)

    Mettes, P., Snoek, C.G.M.: Spatial-aware object embeddings for zero-shot localiza- tion and classification of actions. In: IEEE International Conference on Computer Vision (ICCV). pp. 4453–4462 (2017).https://doi.org/10.1109/ICCV.2017.476

  29. [29]

    In: International Conference on Machine Learning (ICML)

    Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (ICML). vol. 139, pp. 8748–8763 (2021)

  30. [30]

    In: Conference on Empirical Methods in Natural Language Pro- cessing (EMNLP)

    Reimers, N., Gurevych, I.: Sentence-BERT: Sentence embeddings using siamese BERT-networks. In: Conference on Empirical Methods in Natural Language Pro- cessing (EMNLP). pp. 3982–3992 (2019)

  31. [31]

    UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

    Soomro, K., Zamir, A.R., Shah, M.: UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprintarXiv:1212.0402, 1–6 (2012)

  32. [32]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Sun, S., et al.: CLIP as RNN: Segment countless visual concepts without training endeavor. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 13171–13182 (2024)

  33. [33]

    In: International Conf

    Vaswani, A., et al.: Attention is all you need. In: International Conf. on Neural Information Processing (NeurIPS). pp. 6000–6010 (2017)

  34. [34]

    Wang, Q., Chen, K.: Zero-shot visual recognition via bidirectional latent embed- ding. Intl. Journal of Computer Vision124(3), 356–383 (2017).https://doi.org/ 10.1007/s11263-017-1027-5

  35. [35]

    Vision transformers are parameter- efficient audio-visual learners

    Wu, W., et al.: Bidirectional cross-modal knowledge exploration for video recogni- tion with pre-trained vision-language models. In: IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR). pp. 6620–6630 (2023).https: //doi.org/10.1109/CVPR52729.2023.00640

  36. [36]

    Beyond black & white: Leveraging annotator disagreement via soft-label multi-task learning,

    Xu, H., et al.: VideoCLIP: Contrastive pre-training for zero-shot video-text un- derstanding. In: Conference on Empirical Methods in Natural Language Process- ing (EMNLP). pp. 6787–6800 (Nov 2021).https://doi.org/10.18653/v1/2021. emnlp-main.544

  37. [37]

    Xue, Y., Whitecross, K., Mirzasoleiman, B.: Investigating why contrastive learning benefitsrobustnessagainstlabelnoise.In:InternationalConf.onMachineLearning (ICML). vol. 162, pp. 24851–24871 (2022)