pith. machine review for the scientific record. sign in

arxiv: 2605.09976 · v1 · submitted 2026-05-11 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

OZ-TAL: Online Zero-Shot Temporal Action Localization

Chaolei Han, Hongsong Wang, Jie Gui, Xin Gong

Pith reviewed 2026-05-12 02:43 UTC · model grok-4.3

classification 💻 cs.CV
keywords online zero-shot temporal action localizationvision-language modelstraining-free frameworktemporal action localizationzero-shot learningstreaming videoTHUMOS14ActivityNet
0
0 comments X

The pith

A training-free framework using off-the-shelf vision-language models detects previously unseen actions in streaming videos online.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines a new task called Online Zero-Shot Temporal Action Localization that requires spotting actions never encountered in training, and doing so in real time as video arrives. It offers a method that needs no task-specific training or fine-tuning, instead relying on existing vision-language models plus extra steps to sharpen their visual processing and reduce biases. Benchmarks created on THUMOS14 and ActivityNet-1.3 show the approach beats prior methods in both offline and online zero-shot conditions. This setup matters because most current action detectors are locked to the actions they saw during training and cannot handle live streams of novel content.

Core claim

We introduce the OZ-TAL task for detecting previously unseen actions in an online fashion from untrimmed streaming videos. We propose a training-free framework that leverages off-the-shelf Vision-Language Models while introducing additional mechanisms to enhance visual representations and mitigate their inherent biases, establishing new benchmarks on THUMOS14 and ActivityNet-1.3 where it substantially outperforms existing state-of-the-art approaches under both offline and online zero-shot settings.

What carries the argument

The training-free framework that integrates off-the-shelf vision-language models with additional mechanisms for enhancing visual representations and mitigating biases.

Load-bearing premise

Off-the-shelf vision-language models plus unspecified additional mechanisms can reliably improve visual representations, reduce biases, and detect unseen actions in a true online streaming setting without any task-specific training.

What would settle it

A live streaming video containing actions outside the vision-language model's training distribution where the method's temporal localization accuracy falls below a simple baseline that ignores visual input.

Figures

Figures reproduced from arXiv: 2605.09976 by Chaolei Han, Hongsong Wang, Jie Gui, Xin Gong.

Figure 1
Figure 1. Figure 1: Comparison between traditional On-TAL models and our method. (a) Traditional models are constrained to recognizing only seen actions from training data, whereas (b) our approach leverages VLMs to detect arbitrary unseen actions with enhanced generalization. modifications once action instances have been initially generated. One primary approach for On-TAL is to aggregate the classification results of each f… view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of OZ-TAL. The task involves predicting the start time, end time, and category of each action under two constraints: (1) the action categories in the training and test sets are completely disjoint, and (2) future frame information and post-processing are not available during inference. the n-th action, cn ∈ C is the action category, and pn is the associated confidence score. The proposed OZ-TA… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the VLM-Based Feature-Enhanced Action Localizer (VFEAL). The framework comprises four sequential components: (1) Feature Extraction with Prompting: Short-term visual features and text features are extracted using the video and text encoders of a VLM; (2) Memory-Guided Feature Enhancement: Long-term dependencies are modeled by enhancing current features with salient historical context; (3) Backg… view at source ↗
Figure 4
Figure 4. Figure 4: Analysis of hyperparameters: (a) memory bank length; (b) classi￾fication threshold. TABLE V ANALYSIS OF DIFFERENT CLASSIFICATION STRATEGIES. Row Text BG 0.3 0.4 0.5 0.6 0.7 Avg 1 fixed ✗ 12.96 8.49 5.13 2.64 1.24 6.09 2 class-specific ✗ 13.3 8.76 5.77 3.21 1.89 6.59 3 K + 1 ✓ 14.05 9.43 5.96 3.44 1.87 6.95 4 BAKC ✓ 17.69 11.6 7.04 3.88 1.86 8.41 the fixed-text prompt yields the lowest performance, with an … view at source ↗
Figure 5
Figure 5. Figure 5: Illustration of false positives: (a) five sources of false positive errors, where G denotes the total number of ground truth instances; and (b) their impact on average mAP improvement. TABLE VI ANALYSIS OF DIFFERENT LLMS. Row Descriptions 0.3 0.4 0.5 0.6 0.7 Avg 1 fixed 13.0 8.5 5.1 2.6 1.2 6.1 2 GPT-4o [64] 13.8 9.1 6.0 3.2 2.1 6.8 3 DeepSeek-R1 [60] 13.3 8.8 5.8 3.2 1.9 6.6 the largest portion, suggestin… view at source ↗
Figure 6
Figure 6. Figure 6: Illustration of class-specific descriptions generated by an LLM. The prompt provided to the LLM includes two generation requirements and one output format constraint. XS (<30s), S (30–60s), M (60–120s), L (120–180s), and XL (>180s). Instances represents the number of same-class action occurrences within a single video, grouped as: XS (1 instance), S (2–4 instances), M (5–8 instances), and L (>8 instances).… view at source ↗
Figure 7
Figure 7. Figure 7: Sensitivity analysis of VFEAL. (a) Sensivity of VFEAL’s average mAP to action characteristics. (b) The sensitivity profile summarizing the left figure. The difference between the max and min average-mAPN represents the sensitivity, while the difference between the max and the overall average-mAPN denotes the impact of the characteristic [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: False negative analysis. Average false negative rate of VFEAL across three action characteristics on the THUMOS14 dataset [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Visualization of TAL Results. Comparison of temporal localization outputs between Baseline-II and our method on two video examples from the THUMOS14 dataset. [9] H. Kang, J. Hyun, J. An, Y. Yu, and S. J. Kim, “Actionswitch: Class￾agnostic detection of simultaneous actions in streaming videos,” in European Conference on Computer Vision. Springer, 2024, pp. 383– 400. [10] Y. Zhao, H. Zhang, Z. Gao, W. Guan, … view at source ↗
read the original abstract

Online Temporal Action Localization (On-TAL) aims to detect the occurrence time and category of actions in untrimmed streaming videos immediately upon their completion. Recent advancements in this field focus on developing more sophisticated frameworks, shifting from Online Action Detection (OAD)-based aggregation paradigm to instance-level understanding. However, existing approaches are typically trained on specific domains and often exhibit limited generalization capabilities when applied to arbitrary videos, particularly in the presence of previously unseen actions. In this paper, we introduce a new task called Online Zero-shot Temporal Action Localization (OZ-TAL), which aims to detect previously unseen actions in an online fashion. Furthermore, we propose a training-free framework that leverages off-the-shelf Vision-Language Models (VLMs) while introducing additional mechanisms to enhance visual representations and mitigate their inherent biases. We establish new benchmarks and representative baselines for OZ-TAL on THUMOS14 and ActivityNet-1.3, and extensive experiments demonstrate that our method substantially outperforms existing state-of-the-art approaches under both offline and online zero-shot settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces a new task called Online Zero-Shot Temporal Action Localization (OZ-TAL) for detecting previously unseen actions in streaming videos. It proposes a training-free framework that uses off-the-shelf Vision-Language Models (VLMs) augmented with additional mechanisms to enhance visual representations and mitigate biases. New benchmarks and baselines are established on THUMOS14 and ActivityNet-1.3, with claims that the method substantially outperforms existing state-of-the-art approaches in both offline and online zero-shot settings.

Significance. If the training-free framework with the additional mechanisms can be shown to reliably handle unseen actions in a true online streaming setting, this would be a significant contribution to open-world temporal action localization by reducing reliance on task-specific training data and improving generalization. The introduction of the OZ-TAL task and associated benchmarks is a clear positive, as is the emphasis on leveraging existing VLMs to avoid training costs.

major comments (2)
  1. Abstract: The central claim that the method 'substantially outperforms existing state-of-the-art approaches under both offline and online zero-shot settings' is asserted without any quantitative metrics, tables, or specific results, which is load-bearing for evaluating the performance contribution.
  2. Methods (framework description): The 'additional mechanisms' to enhance visual representations and mitigate VLM biases are referenced but not detailed with algorithms, equations, pseudocode, or implementation specifics, preventing verification that the approach is strictly training-free and operates without lookahead in a streaming regime.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: Abstract: The central claim that the method 'substantially outperforms existing state-of-the-art approaches under both offline and online zero-shot settings' is asserted without any quantitative metrics, tables, or specific results, which is load-bearing for evaluating the performance contribution.

    Authors: The abstract is a concise summary; the full quantitative results, including mAP improvements over baselines on THUMOS14 and ActivityNet-1.3, appear in Tables 1-4 and the associated figures in the Experiments section. We will revise the abstract to incorporate a small number of key performance metrics to make the claim more self-contained. revision: yes

  2. Referee: Methods (framework description): The 'additional mechanisms' to enhance visual representations and mitigate VLM biases are referenced but not detailed with algorithms, equations, pseudocode, or implementation specifics, preventing verification that the approach is strictly training-free and operates without lookahead in a streaming regime.

    Authors: Section 3 details the visual enhancement and bias-mitigation components, which operate on frozen off-the-shelf VLMs with no task-specific training or parameter updates. The online pipeline processes the video stream frame-by-frame without access to future frames. We will add explicit pseudocode, equations, and implementation notes in the revision to facilitate verification. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper introduces the OZ-TAL task and proposes a training-free framework that leverages existing off-the-shelf VLMs plus unspecified additional mechanisms for bias mitigation. No equations, fitted parameters, self-referential definitions, or load-bearing self-citations appear in the provided text. Claims of outperformance rest on empirical benchmarks rather than any closed-loop mathematical reduction to the inputs themselves. The derivation is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; the approach rests on the unstated assumption that off-the-shelf VLMs plus unspecified mechanisms suffice for the new task.

pith-pipeline@v0.9.0 · 5478 in / 1151 out tokens · 73867 ms · 2026-05-12T02:43:48.306884+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 7 internal anchors

  1. [1]

    Multi-granularity con- trastive cross-modal collaborative generation for end-to-end long-term video question answering,

    T. Yu, K. Fu, J. Zhang, Q. Huang, and J. Yu, “Multi-granularity con- trastive cross-modal collaborative generation for end-to-end long-term video question answering,”IEEE Transactions on Image Processing, vol. 33, pp. 3115–3129, 2024

  2. [2]

    Video moment retrieval with cross-modal neural architecture search,

    X. Yang, S. Wang, J. Dong, J. Dong, M. Wang, and T.-S. Chua, “Video moment retrieval with cross-modal neural architecture search,”IEEE Transactions on Image Processing, vol. 31, pp. 1204–1216, 2022

  3. [3]

    Neighbor-guided pseudo- label generation and refinement for single-frame supervised temporal action localization,

    G. Li, D. Cheng, N. Wang, J. Li, and X. Gao, “Neighbor-guided pseudo- label generation and refinement for single-frame supervised temporal action localization,”IEEE Transactions on Image Processing, vol. 33, pp. 2419–2430, 2024

  4. [4]

    Adaptive prototype learning for weakly-supervised temporal action localization,

    W. Luo, H. Ren, T. Zhang, W. Yang, and Y . Zhang, “Adaptive prototype learning for weakly-supervised temporal action localization,”IEEE Transactions on Image Processing, vol. 34, pp. 3154–3168, 2024

  5. [5]

    Hypergraph multi-modal large language model: Exploiting eeg and eye-tracking modalities to evaluate heterogeneous responses for video understanding,

    M. Wu, C. Zhao, A. Su, D. Di, T. Fu, D. An, M. He, Y . Gao, M. Ma, K. Yanet al., “Hypergraph multi-modal large language model: Exploiting eeg and eye-tracking modalities to evaluate heterogeneous responses for video understanding,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 7316–7325

  6. [6]

    Efficient dual-confounding eliminating for weakly-supervised temporal action localization,

    A. Li, H. Liu, J. Sheng, Z. Chen, and Y . Ge, “Efficient dual-confounding eliminating for weakly-supervised temporal action localization,” inPro- ceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 8179–8188

  7. [7]

    Online temporal action local- ization with memory-augmented transformer,

    Y . Song, D. Kim, M. Cho, and S. Kwak, “Online temporal action local- ization with memory-augmented transformer,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 74–91

  8. [8]

    Hat: History- augmented anchor transformer for online temporal action localization,

    S. Reza, Y . Zhang, M. Moghaddam, and O. Camps, “Hat: History- augmented anchor transformer for online temporal action localization,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 205– 222. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 10 Fig. 7.Sensitivity analysis of VFEAL.(a) Sensivity of VFEAL’s average mAP to action charac...

  9. [9]

    Actionswitch: Class- agnostic detection of simultaneous actions in streaming videos,

    H. Kang, J. Hyun, J. An, Y . Yu, and S. J. Kim, “Actionswitch: Class- agnostic detection of simultaneous actions in streaming videos,” in European Conference on Computer Vision. Springer, 2024, pp. 383– 400

  10. [10]

    A temporal-aware relation and attention network for temporal action localization,

    Y . Zhao, H. Zhang, Z. Gao, W. Guan, J. Nie, A. Liu, M. Wang, and S. Chen, “A temporal-aware relation and attention network for temporal action localization,”IEEE Transactions on Image Processing, vol. 31, pp. 4746–4760, 2022

  11. [11]

    Fineaction: A fine- grained video dataset for temporal action localization,

    Y . Liu, L. Wang, Y . Wang, X. Ma, and Y . Qiao, “Fineaction: A fine- grained video dataset for temporal action localization,”IEEE transac- tions on image processing, vol. 31, pp. 6937–6950, 2022

  12. [12]

    Learnable feature augmentation framework for temporal action localization,

    Y . Tang, W. Wang, C. Zhang, J. Liu, and Y . Zhao, “Learnable feature augmentation framework for temporal action localization,”IEEE Trans- actions on Image Processing, vol. 33, pp. 4002–4015, 2024

  13. [13]

    Constructing semantical structure by segmentation integrated video embedding for temporal action detection,

    Z. Zhao, S. Liu, C. Zhao, and X. Zhao, “Constructing semantical structure by segmentation integrated video embedding for temporal action detection,”IEEE Transactions on Circuits and Systems for Video Technology, 2025

  14. [14]

    Brtal: Boundary refinement temporal action localization via offset-driven diffusion models,

    H. Liu, X. Li, B. Fan, and J. Xu, “Brtal: Boundary refinement temporal action localization via offset-driven diffusion models,”IEEE Transac- tions on Circuits and Systems for Video Technology, 2025

  15. [15]

    Temporal action localization in the deep learning era: A survey,

    B. Wang, Y . Zhao, L. Yang, T. Long, and X. Li, “Temporal action localization in the deep learning era: A survey,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 4, pp. 2171– 2190, 2023

  16. [16]

    Context-enhanced memory-refined trans- former for online action detection,

    Z. Pang, F. Sener, and A. Yao, “Context-enhanced memory-refined trans- former for online action detection,”arXiv preprint arXiv:2503.18359, 2025

  17. [17]

    Does video-text pretraining help open-vocabulary online action detec- tion?

    Y . Wang, J. Xu, Y . He, Z. Song, L. Wang, Y . Qiao, C. Zhaoet al., “Does video-text pretraining help open-vocabulary online action detec- tion?”Advances in Neural Information Processing Systems, vol. 37, pp. 47 908–47 930, 2024

  18. [18]

    A sliding window scheme for online temporal action localization,

    Y . H. Kim, H. Kang, and S. J. Kim, “A sliding window scheme for online temporal action localization,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 653–669

  19. [19]

    Zero-shot temporal action detection via vision-language prompting,

    S. Nag, X. Zhu, Y .-Z. Song, and T. Xiang, “Zero-shot temporal action detection via vision-language prompting,” inEuropean conference on computer vision. Springer, 2022, pp. 681–697

  20. [20]

    Prompting visual- language models for efficient video understanding,

    C. Ju, T. Han, K. Zheng, Y . Zhang, and W. Xie, “Prompting visual- language models for efficient video understanding,” inEuropean Con- ference on Computer Vision. Springer, 2022, pp. 105–124

  21. [21]

    Zero-shot temporal action detection by learning multimodal prompts and text-enhanced actionness,

    A. Raza, B. Yang, and Y . Zou, “Zero-shot temporal action detection by learning multimodal prompts and text-enhanced actionness,”IEEE Transactions on Circuits and Systems for Video Technology, 2024

  22. [22]

    Temporal action detection model compression by progressive block drop,

    X. Chen, Y . Guo, J. Liang, S. Zhuang, R. Zeng, and X. Hu, “Temporal action detection model compression by progressive block drop,”arXiv preprint arXiv:2503.16916, 2025

  23. [23]

    Boundary denoising for video activity localization,

    M. Xu, M. Soldan, J. Gao, S. Liu, J. P ´erez-R´ua, and B. Ghanem, “Boundary denoising for video activity localization,” inThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024, 2024

  24. [24]

    Benchmarking the robustness of temporal action detection models against temporal corruptions,

    R. Zeng, X. Chen, J. Liang, H. Wu, G. Cao, and Y . Guo, “Benchmarking the robustness of temporal action detection models against temporal corruptions,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 18 263–18 274

  25. [25]

    Unimd: Towards unifying mo- ment retrieval and temporal action detection,

    Y . Zeng, Y . Zhong, C. Feng, and L. Ma, “Unimd: Towards unifying mo- ment retrieval and temporal action detection,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 286–304. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 11

  26. [26]

    Adapting short-term trans- formers for action detection in untrimmed videos,

    M. Yang, H. Gao, P. Guo, and L. Wang, “Adapting short-term trans- formers for action detection in untrimmed videos,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 18 570–18 579

  27. [27]

    End-to-end temporal action detection with 1b parameters across 1000 frames,

    S. Liu, C.-L. Zhang, C. Zhao, and B. Ghanem, “End-to-end temporal action detection with 1b parameters across 1000 frames,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 18 591–18 601

  28. [28]

    Dual detrs for multi-label temporal action detection,

    Y . Zhu, G. Zhang, J. Tan, G. Wu, and L. Wang, “Dual detrs for multi-label temporal action detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 18 559–18 569

  29. [29]

    Dyfadet: Dynamic feature aggregation for temporal action detection,

    L. Yang, Z. Zheng, Y . Han, H. Cheng, S. Song, G. Huang, and F. Li, “Dyfadet: Dynamic feature aggregation for temporal action detection,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 305– 322

  30. [30]

    2pesnet: Towards online processing of temporal action localization,

    Y . H. Kim, S. Nam, and S. J. Kim, “2pesnet: Towards online processing of temporal action localization,”Pattern Recognition, vol. 131, p. 108871, 2022

  31. [31]

    E2e-load: end-to- end long-form online action detection,

    S. Cao, W. Luo, B. Wang, W. Zhang, and L. Ma, “E2e-load: end-to- end long-form online action detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 10 422–10 432

  32. [32]

    Miniroad: Minimal rnn framework for online action detection,

    J. An, H. Kang, S. H. Han, M.-H. Yang, and S. J. Kim, “Miniroad: Minimal rnn framework for online action detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 10 341–10 350

  33. [33]

    Colar: Effective and efficient on- line action detection by consulting exemplars,

    L. Yang, J. Han, and D. Zhang, “Colar: Effective and efficient on- line action detection by consulting exemplars,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 3160–3169

  34. [34]

    Cag-qil: Context-aware actionness grouping via q imitation learning for online temporal action localization,

    H. Kang, K. Kim, Y . Ko, and S. J. Kim, “Cag-qil: Context-aware actionness grouping via q imitation learning for online temporal action localization,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 13 729–13 738

  35. [35]

    Simon: a simple framework for online temporal action localization,

    T. N. Tang, J. Park, K. Kim, and K. Sohn, “Simon: a simple framework for online temporal action localization,”arXiv preprint arXiv:2211.04905, 2022

  36. [36]

    Zstad: Zero-shot temporal activity detection,

    L. Zhang, X. Chang, J. Liu, M. Luo, S. Wang, Z. Ge, and A. Hauptmann, “Zstad: Zero-shot temporal activity detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 879–888

  37. [37]

    Efficient Estimation of Word Representations in Vector Space

    T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,”arXiv preprint arXiv:1301.3781, 2013

  38. [38]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763

  39. [39]

    Tn-zstad: Transferable network for zero-shot temporal activity detec- tion,

    L. Zhang, X. Chang, J. Liu, M. Luo, Z. Li, L. Yao, and A. Hauptmann, “Tn-zstad: Transferable network for zero-shot temporal activity detec- tion,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 3, pp. 3848–3861, 2022

  40. [40]

    Test-time zero-shot temporal action localization,

    B. Liberatori, A. Conti, P. Rota, Y . Wang, and E. Ricci, “Test-time zero-shot temporal action localization,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 18 720–18 729

  41. [41]

    Coca: Con- trastive captioners are image-text foundation models

    J. Yu, Z. Wang, V . Vasudevan, L. Yeung, M. Seyedhosseini, and Y . Wu, “Coca: Contrastive captioners are image-text foundation models,”arXiv preprint arXiv:2205.01917, 2022

  42. [42]

    Palm: Scal- ing language modeling with pathways,

    A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmannet al., “Palm: Scal- ing language modeling with pathways,”Journal of Machine Learning Research, vol. 24, no. 240, pp. 1–113, 2023

  43. [43]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023

  44. [44]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,”arXiv preprint arXiv:2304.10592, 2023

  45. [45]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

  46. [46]

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,

    J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” inInternational conference on machine learning. PMLR, 2022, pp. 12 888–12 900

  47. [47]

    Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,

    J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,” inInternational conference on machine learning. PMLR, 2023, pp. 19 730–19 742

  48. [48]

    Improved baselines with visual instruction tuning,

    H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved baselines with visual instruction tuning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 26 296–26 306

  49. [49]

    Sharegpt4video: Improving video understanding and generation with better captions,

    L. Chen, X. Wei, J. Li, X. Dong, P. Zhang, Y . Zang, Z. Chen, H. Duan, Z. Tang, L. Yuanet al., “Sharegpt4video: Improving video understanding and generation with better captions,”Advances in Neural Information Processing Systems, vol. 37, pp. 19 472–19 495, 2024

  50. [50]

    VideoChat: Chat-Centric Video Understanding

    K. Li, Y . He, Y . Wang, Y . Li, W. Wang, P. Luo, Y . Wang, L. Wang, and Y . Qiao, “Videochat: Chat-centric video understanding,”arXiv preprint arXiv:2305.06355, 2023

  51. [51]

    Detal: open- vocabulary temporal action localization with decoupled networks,

    Z. Li, Y . Zhong, R. Song, T. Li, L. Ma, and W. Zhang, “Detal: open- vocabulary temporal action localization with decoupled networks,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

  52. [52]

    Text-infused attention and foreground-aware modeling for zero-shot temporal action detection,

    Y . Lee, H.-J. Kim, and S.-W. Lee, “Text-infused attention and foreground-aware modeling for zero-shot temporal action detection,” Advances in Neural Information Processing Systems, vol. 37, pp. 9864– 9884, 2024

  53. [53]

    Internvid: A large-scale video-text dataset for multimodal understanding and generation

    Y . Wang, Y . He, Y . Li, K. Li, J. Yu, X. Ma, X. Li, G. Chen, X. Chen, Y . Wanget al., “Internvid: A large-scale video-text dataset for multi- modal understanding and generation,”arXiv preprint arXiv:2307.06942, 2023

  54. [54]

    Vita-clip: Video and text adaptive clip via multimodal prompting,

    S. T. Wasim, M. Naseer, S. Khan, F. S. Khan, and M. Shah, “Vita-clip: Video and text adaptive clip via multimodal prompting,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2023, pp. 23 034–23 044

  55. [55]

    An information compensation framework for zero-shot skeleton-based action recognition,

    H. Xu, Y . Gao, J. Li, and X. Gao, “An information compensation framework for zero-shot skeleton-based action recognition,”IEEE Trans- actions on Multimedia, 2025

  56. [56]

    End-to-end spatio-temporal action localisation with video transformers,

    A. A. Gritsenko, X. Xiong, J. Djolonga, M. Dehghani, C. Sun, M. Lu- cic, C. Schmid, and A. Arnab, “End-to-end spatio-temporal action localisation with video transformers,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 18 373–18 383

  57. [57]

    Semantics guided contrastive learning of transformers for zero-shot temporal activity detection,

    S. Nag, O. Goldstein, and A. K. Roy-Chowdhury, “Semantics guided contrastive learning of transformers for zero-shot temporal activity detection,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 6243–6253

  58. [58]

    The thumos challenge on action recognition for videos “in the wild

    H. Idrees, A. R. Zamir, Y .-G. Jiang, A. Gorban, I. Laptev, R. Sukthankar, and M. Shah, “The thumos challenge on action recognition for videos “in the wild”,”Computer Vision and Image Understanding, vol. 155, pp. 1–23, 2017

  59. [59]

    Activitynet: A large-scale video benchmark for human activity under- standing,

    F. Caba Heilbron, V . Escorcia, B. Ghanem, and J. Carlos Niebles, “Activitynet: A large-scale video benchmark for human activity under- standing,” inProceedings of the ieee conference on computer vision and pattern recognition, 2015, pp. 961–970

  60. [60]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Biet al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025

  61. [61]

    Diagnosing error in temporal action detectors,

    H. Alwassel, F. C. Heilbron, V . Escorcia, and B. Ghanem, “Diagnosing error in temporal action detectors,” inProceedings of the European conference on computer vision (ECCV), 2018, pp. 256–272

  62. [62]

    A review of generalized zero-shot learning meth- ods,

    F. Pourpanah, M. Abdar, Y . Luo, X. Zhou, R. Wang, C. P. Lim, X.-Z. Wang, and Q. J. Wu, “A review of generalized zero-shot learning meth- ods,”IEEE transactions on pattern analysis and machine intelligence, vol. 45, no. 4, pp. 4051–4070, 2022

  63. [63]

    Towards open vocabulary learning: A survey,

    J. Wu, X. Li, S. Xu, H. Yuan, H. Ding, Y . Yang, X. Li, J. Zhang, Y . Tong, X. Jianget al., “Towards open vocabulary learning: A survey,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 7, pp. 5092–5113, 2024

  64. [64]

    GPT-4o System Card

    A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radfordet al., “Gpt-4o system card,”arXiv preprint arXiv:2410.21276, 2024