arxiv: 2605.01165 · v1 · submitted 2026-05-01 · 💻 cs.CV

Recognition: unknown

CEZSAR: A Contrastive Embedding Method for Zero-Shot Action Recognition

David Menotti, Helio Pedrini, Rayson Laroca, Valter Estevam

Authors on Pith no claims yet

Pith reviewed 2026-05-09 18:46 UTC · model grok-4.3

classification 💻 cs.CV

keywords zero-shot action recognitioncontrastive learningjoint video-text embeddingautomatic negative samplingUCF-101Kinetics-400multimodal alignment

0 comments

The pith

A contrastive method aligns videos with text descriptions in a joint space to recognize actions never seen in training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes training a model to place video clips and their matching sentences close together in one embedding space while separating mismatched pairs. Automatic negative sampling creates those mismatched examples on the fly so the system learns without needing extra manual labels. This joint space is meant to close the difference between visual features and language features and to handle the fact that test actions differ from training ones. If the alignment holds, a model could label a new action simply by comparing its video embedding to the embedding of a text description.

Core claim

The central claim is that a contrastive model encoding both videos and sentences into a shared embedding space, trained by pulling matching video-description pairs together and pushing apart automatically generated unpaired examples, produces representations that generalize to action classes absent from the training set, reaching state-of-the-art accuracy on the UCF-101 and Kinetics-400 datasets across multiple zero-shot splits.

What carries the argument

The joint video-text embedding space trained with contrastive loss and automatic negative sampling to produce unpaired visual and textual examples.

If this is right

State-of-the-art zero-shot accuracy on UCF-101 and Kinetics-400 under several training-test splits.
Action classification at test time using only a textual description of the target class.
Reduced dependence on collecting labeled video examples for every possible action class.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same contrastive alignment procedure could be reused for zero-shot recognition of objects or events once suitable text descriptions exist.
Automatic negative sampling may cut the cost of curating large vision-language training sets for other multimodal tasks.
Applying the method to datasets with larger visual or linguistic differences would test how far the joint space generalizes.

Load-bearing premise

The alignment learned from seen actions and their descriptions will still match unseen actions to their textual descriptions despite differences in appearance and wording.

What would settle it

A controlled test on a fresh zero-shot split of a standard action dataset in which removing the automatic negative sampling step causes accuracy to drop below prior non-contrastive baselines.

Figures

Figures reproduced from arXiv: 2605.01165 by David Menotti, Helio Pedrini, Rayson Laroca, Valter Estevam.

**Figure 1.** Figure 1: T-SNE visualization for a subset with the classes Horse Riding (blue), Horse Race (orange), Pommel Horse (green), and Balance Beam (red). Dots are videos, and stars are label prototypes. the label encoding process usually produces one array5 for which we assume all required semantic information is encoded. Strategies to alleviate the semantic gap in label encoding include adding more descriptive texts and … view at source ↗

**Figure 2.** Figure 2: Our method is composed of the Visual Embedding and Sentence Embedding modules. Each module produces a dense representation that is expected to be close if the sentence describes the video and far otherwise. For training the model, we propose a hard negative sampling method. This method seeks negative alignments between videos and texts without human supervision. Thus, we can generate triplets (video, posi… view at source ↗

**Figure 3.** Figure 3: (a) ZSARCAP [12] results encoded with SBERT; (b) CEZSAR (VE + O) view at source ↗

read the original abstract

This paper proposes a novel Zero-Shot Action Recognition~(ZSAR) method based on contrastive learning. In ZSAR, we aim to classify examples from classes that were missing during training. Two well-known problems remain in ZSAR: the semantic gap and the domain shift. A semantic gap occurs because label representations come from the textual domain (i.e., language models) and must be associated with visual representations (i.e., CNNs, RNNs, transformer-based). This multimodal nature implies that the semantic properties of the two spaces are not identical. On the other hand, the domain shift arises from differences between the training and test sets and is inherent to ZSAR once the test set is unknown. One of the most promising methods to address both issues is learning joint embedding spaces. Therefore, we propose a new model that encodes videos and sentences in a joint embedding space, trained by aligning videos with their natural-language descriptions. We design an automatic negative sampling procedure to augment the training dataset and generate unpaired data, i.e., visual appearance and unrelated descriptions. Our results are state-of-the-art on the UCF-101 and Kinetics-400 datasets under several split configurations. Our code is available at https://github.com/valterlej/cezsar.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript proposes CEZSAR, a contrastive embedding method for zero-shot action recognition (ZSAR). Videos and natural-language descriptions are encoded into a shared embedding space and aligned via contrastive learning; an automatic negative sampling procedure generates unpaired video-description pairs to augment training. The approach targets the semantic gap between visual and textual modalities as well as domain shift between seen and unseen action classes. The authors report state-of-the-art results on UCF-101 and Kinetics-400 under multiple train/test splits and release the code.

Significance. If the reported gains hold under rigorous evaluation, the work supplies a straightforward, reproducible contrastive baseline for ZSAR that avoids elaborate architectures while directly addressing modality alignment and negative sampling. Public code release strengthens the contribution by enabling direct verification and extension.

minor comments (3)

§4 (Experiments): the exact train/test splits, number of runs, and full baseline comparisons (including recent contrastive and generative ZSAR methods) should be tabulated with mean and standard deviation to support the SOTA claim.
§3.2 (Negative sampling): clarify whether the automatic procedure can inadvertently sample from classes that appear in the test set under any of the reported splits; a short ablation would strengthen the domain-shift argument.
Figure 2 and §3.1: the joint embedding diagram would benefit from explicit notation for the video and text encoders and the temperature parameter used in the contrastive loss.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, the recognition of our straightforward contrastive baseline, and the recommendation for minor revision. We are grateful for the emphasis on reproducibility and code release.

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper presents a contrastive embedding method for zero-shot action recognition that trains a joint video-text space by aligning videos with their natural-language descriptions and using automatic negative sampling to generate unpaired data. This follows standard contrastive learning objectives without any self-definitional reductions, fitted parameters renamed as predictions, or load-bearing self-citations that collapse the central claims to tautologies. The SOTA empirical results on UCF-101 and Kinetics-400 under multiple splits are independent evaluations outside the training construction, with no equations or steps that reduce by construction to the inputs. The approach is self-contained as a direct application of multimodal alignment to address semantic gap and domain shift.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the approach implicitly relies on standard contrastive learning assumptions about embedding alignment.

pith-pipeline@v0.9.0 · 5532 in / 1021 out tokens · 31615 ms · 2026-05-09T18:46:59.186184+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 25 canonical work pages · 1 internal anchor

[1]

In: British Machine Vision Conference (BMVC)

Balntas,V.,Riba,E.,Ponsa,D.,Mikolajczyk,K.:Learninglocalfeaturedescriptors with triplets and shallow convolutional neural networks. In: British Machine Vision Conference (BMVC). pp. 1–11 (2016).https://doi.org/10.5244/C.30.119

work page doi:10.5244/c.30.119 2016
[2]

nuScenes: A multimodal dataset for autonomous driving,

Brattoli, B., Tighe, J., Zhdanov, F., Perona, P., Chalupka, K.: Rethinking zero- shot video classification: End-to-end training for realistic applications. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4613–4623 (Jun 2020).https://doi.org/10.1109/CVPR42600.2020.00467

work page doi:10.1109/cvpr42600.2020.00467 2020
[3]

In: British Machine Vision Conference (BMVC)

Bretti, C., Mettes, P.: Zero-shot action recognition from diverse object-scene com- positions. In: British Machine Vision Conference (BMVC). pp. 1–14 (Nov 2021)

2021
[4]

Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kineticsdataset.In:IEEEConferenceonComputerVisionandPatternRecognition (CVPR). pp. 4724–4733 (Jul 2017).https://doi.org/10.1109/CVPR.2017.502

work page doi:10.1109/cvpr.2017.502 2017
[5]

In: IEEE/CVF International Conference on Computer Vision (ICCV)

Chen, S., Huang, D.: Elaborative rehearsal for zero-shot action recognition. In: IEEE/CVF International Conference on Computer Vision (ICCV). pp. 13638– 13647 (Oct 2021) 14 Estevam et al

2021
[6]

In: 2024 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR)

Chen, X., et al.: AnyDoor: Zero-shot object-level image customization. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6593–6602 (2024).https://doi.org/10.1109/CVPR52733.2024.00630

work page doi:10.1109/cvpr52733.2024.00630 2024
[7]

Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, withapplicationtofaceverification.In:IEEEComputerVisionandPatternRecog- nition (CVPR). pp. 539–546 (2005).https://doi.org/10.1109/CVPR.2005.202

work page doi:10.1109/cvpr.2005.202 2005
[8]

Semantic generative augmentations for few-shot counting, in: IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2024, Waikoloa, HI, USA, January 3-8, 2024, IEEE

Doshi, K., et al.: A Multimodal Benchmark and Improved Architecture for Zero Shot Learning . In: IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 2010–2019 (2024).https://doi.org/10.1109/WACV57701.2024.00202

work page doi:10.1109/wacv57701.2024.00202 2010
[9]

IEEE Signal Processing Letters29, 1843–1847 (2022)

Estevam, V., Laroca, R., Pedrini, H., Menotti, D.: Global semantic descriptors for zero-shot action recognition. IEEE Signal Processing Letters29, 1843–1847 (2022). https://doi.org/10.1109/LSP.2022.3200605

work page doi:10.1109/lsp.2022.3200605 2022
[10]

Journal of Visual Communication and Image Representation107, 104385 (2025).https://doi.org/10.1016/j.jvcir

Estevam, V., Laroca, R., Pedrini, H., Menotti, D.: Dense video captioning us- ing unsupervised semantic information. Journal of Visual Communication and Image Representation107, 104385 (2025).https://doi.org/10.1016/j.jvcir. 2024.104385

work page doi:10.1016/j.jvcir 2025
[11]

2019.01.103

Estevam,V.,Pedrini,H.,Menotti,D.:Zero-shotactionrecognitioninvideos:Asur- vey. Neurocomputing439, 159–175 (2021).https://doi.org/10.1016/j.neucom. 2021.01.036

work page doi:10.1016/j.neucom 2021
[12]

Multimedia Tools and Applications83, 28147–28173 (2024).https://doi.org/10.1007/s11042-023-16566-5

Estevam, V., et al.: Tell me what you see: A zero-shot action recognition method based on natural language descriptions. Multimedia Tools and Applications83, 28147–28173 (2024).https://doi.org/10.1007/s11042-023-16566-5

work page doi:10.1007/s11042-023-16566-5 2024
[13]

Implications of

Gowda, S.N.: Synthetic sample selection for generalized zero-shot learning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).pp.58–67(2023).https://doi.org/10.1109/CVPRW59228.2023.00011

work page doi:10.1109/cvprw59228.2023.00011 2023
[14]

In: Asian Conference on Computer Vision (ACCV)

Gowda, S.N., Moltisanti, D., Sevilla-Lara, L.: Continual learning improves zero- shot action recognition. In: Asian Conference on Computer Vision (ACCV). p. 403–421 (2024).https://doi.org/10.1007/978-981-96-0908-6_23

work page doi:10.1007/978-981-96-0908-6_23 2024
[15]

In: European Conf

Gowda, S.N., Sevilla-Lara, L., Keller, F., Rohrbach, M.: CLASTER: Clustering with reinforcement learning for zero-shot action recognition. In: European Conf. on Computer Vision (ECCV). pp. 187–203 (2022)

2022
[16]

2021 , url =

Han, Z., Fu, Z., Chen, S., Yang, J.: Contrastive embedding for generalized zero-shot learning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2371–2381 (2021).https://doi.org/10.1109/CVPR46437.2021.00240

work page doi:10.1109/cvpr46437.2021.00240 2021
[17]

Huang, K., Miralles-Pechuán., L., Mckeever., S.: Combining text and image knowl- edgewithGANsforzero-shotactionrecognitioninvideos.In:InternationalConfer- ence on Computer Vision Theory and Applications (VISAPP). pp. 623–631 (2022). https://doi.org/10.5220/0010903100003124

work page doi:10.5220/0010903100003124 2022
[18]

SN Computer Science 4(4), 375 (2023).https://doi.org/10.1007/s42979-023-01803-3

Huang, K., Miralles-Pechuán, L., Mckeever, S.: Enhancing zero-shot action recog- nition in videos by combining gans with text and images. SN Computer Science 4(4), 375 (2023).https://doi.org/10.1007/s42979-023-01803-3

work page doi:10.1007/s42979-023-01803-3 2023
[19]

In: International Conference on Neural Infor- mation Processing Systems (NeurIPS)

Kerrigan, A., Duarte, K., Rawat, Y., Shah, M.: Reformulating zero-shot action recognition for multi-label actions. In: International Conference on Neural Infor- mation Processing Systems (NeurIPS). vol. 34, pp. 25566–25577 (2021)

2021
[20]

In: AAAI Conference on Artificial Intelligence (2021)

Kim, T.S., et al.: DASZL: Dynamic action signatures for zero-shot learning. In: AAAI Conference on Artificial Intelligence (2021)

2021
[21]

In: Proceedings of the IEEE International Conference on Computer Vision (ICCV)

Krishna, R., Hata, K., Ren, F., Fei-Fei, L., Niebles, J.C.: Dense-captioning events in videos. In: IEEE International Conference on Computer Vision (ICCV). pp. 706–715 (2017).https://doi.org/10.1109/ICCV.2017.83 A Contrastive Embedding Method for Zero-Shot Action Recognition 15

work page doi:10.1109/iccv.2017.83 2017
[22]

Expert Systems with Applications255, 124786 (2024).https://doi.org/10.1016/j.eswa.2024.124786

Lee, J.C., Lee, D.G.: ESC-ZSAR: Expanded semantics from categories with cross- attention for zero-shot action recognition. Expert Systems with Applications255, 124786 (2024).https://doi.org/10.1016/j.eswa.2024.124786

work page doi:10.1016/j.eswa.2024.124786 2024
[23]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Li, X., Yang, X., Wei, K., Deng, C., Yang, M.: Siamese contrastive embedding net- work for compositional zero-shot learning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 9326–9335 (June 2022)

2022
[24]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Lin, C.C., et al.: Cross-modal representation learning for zero-shot action recog- nition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 19978–19988 (June 2022)

2022
[25]

In: IEEEConferenceonComputerVisionandPatternRecognition(CVPR).pp.3337– 3344 (2011).https://doi.org/10.1109/CVPR.2011.5995353

Liu, J., Kuipers, B., Savarese, S.: Recognizing human actions by attributes. In: IEEEConferenceonComputerVisionandPatternRecognition(CVPR).pp.3337– 3344 (2011).https://doi.org/10.1109/CVPR.2011.5995353

work page doi:10.1109/cvpr.2011.5995353 2011
[26]

Pattern Recognition Let- ters155, 77–83 (2022).https://doi.org/10.1016/j.patrec.2022.02.002

Ma, P., Lu, H., Yang, B., Ran, W.: GAN-MVAE: A discriminative latent feature generation framework for generalized zero-shot learning. Pattern Recognition Let- ters155, 77–83 (2022).https://doi.org/10.1016/j.patrec.2022.02.002

work page doi:10.1016/j.patrec.2022.02.002 2022
[27]

International Journal of Computer Vision129, 1954–1971 (2021).https: //doi.org/10.1007/s11263-021-01454-y

Mettes,P.,Thong,W.,Snoek,C.:Objectpriorsforclassifyingandlocalizingunseen actions. International Journal of Computer Vision129, 1954–1971 (2021).https: //doi.org/10.1007/s11263-021-01454-y

work page doi:10.1007/s11263-021-01454-y 1954
[28]

In: IEEE International Conference on Computer Vision (ICCV)

Mettes, P., Snoek, C.G.M.: Spatial-aware object embeddings for zero-shot localiza- tion and classification of actions. In: IEEE International Conference on Computer Vision (ICCV). pp. 4453–4462 (2017).https://doi.org/10.1109/ICCV.2017.476

work page doi:10.1109/iccv.2017.476 2017
[29]

In: International Conference on Machine Learning (ICML)

Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (ICML). vol. 139, pp. 8748–8763 (2021)

2021
[30]

In: Conference on Empirical Methods in Natural Language Pro- cessing (EMNLP)

Reimers, N., Gurevych, I.: Sentence-BERT: Sentence embeddings using siamese BERT-networks. In: Conference on Empirical Methods in Natural Language Pro- cessing (EMNLP). pp. 3982–3992 (2019)

2019
[31]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Soomro, K., Zamir, A.R., Shah, M.: UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprintarXiv:1212.0402, 1–6 (2012)

work page internal anchor Pith review arXiv 2012
[32]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Sun, S., et al.: CLIP as RNN: Segment countless visual concepts without training endeavor. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 13171–13182 (2024)

2024
[33]

In: International Conf

Vaswani, A., et al.: Attention is all you need. In: International Conf. on Neural Information Processing (NeurIPS). pp. 6000–6010 (2017)

2017
[34]

Wang, Q., Chen, K.: Zero-shot visual recognition via bidirectional latent embed- ding. Intl. Journal of Computer Vision124(3), 356–383 (2017).https://doi.org/ 10.1007/s11263-017-1027-5

work page doi:10.1007/s11263-017-1027-5 2017
[35]

Vision transformers are parameter- efficient audio-visual learners

Wu, W., et al.: Bidirectional cross-modal knowledge exploration for video recogni- tion with pre-trained vision-language models. In: IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR). pp. 6620–6630 (2023).https: //doi.org/10.1109/CVPR52729.2023.00640

work page doi:10.1109/cvpr52729.2023.00640 2023
[36]

Beyond black & white: Leveraging annotator disagreement via soft-label multi-task learning,

Xu, H., et al.: VideoCLIP: Contrastive pre-training for zero-shot video-text un- derstanding. In: Conference on Empirical Methods in Natural Language Process- ing (EMNLP). pp. 6787–6800 (Nov 2021).https://doi.org/10.18653/v1/2021. emnlp-main.544

work page doi:10.18653/v1/2021 2021
[37]

Xue, Y., Whitecross, K., Mirzasoleiman, B.: Investigating why contrastive learning benefitsrobustnessagainstlabelnoise.In:InternationalConf.onMachineLearning (ICML). vol. 162, pp. 24851–24871 (2022)

2022