arxiv: 2604.18313 · v1 · submitted 2026-04-20 · 💻 cs.CV

Recognition: unknown

Denoise and Align: Diffusion-Driven Foreground Knowledge Prompting for Open-Vocabulary Temporal Action Detection

Sa Zhu , Wanqian Zhang , Lin Wang , Jinchao Zhang , Cong Wang , Bo Li

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:24 UTC · model grok-4.3

classification 💻 cs.CV

keywords open-vocabulary temporal action detectiondiffusion modelsforeground knowledgesemantic alignmentvideo promptingaction localizationbackground suppression

0 comments

The pith

Diffusion models extract foreground knowledge from videos to align them with unseen action labels for temporal detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that semantic imbalance between short action names and detailed video content creates noise that hurts open-vocabulary temporal action detection, and that diffusion denoising can solve it by producing clean foreground representations as anchors. These anchors then guide prompt-based alignment so the model focuses on relevant segments rather than background clutter. A reader would care because the approach enables detection of actions never encountered in training, where conventional methods lose accuracy due to mismatched semantics. If the method works, it turns generative diffusion into a practical tool for bridging video and text without extra tuning. The authors demonstrate this by reporting state-of-the-art results on two standard OV-TAD benchmarks.

Core claim

The central claim is that the DFAlign framework, operating in a conditioning-denoising-aligning sequence, uses diffusion to generate foreground knowledge that acts as an intermediate semantic anchor, unifies action semantics via the SUC module, suppresses background via the BSD module, and injects the resulting knowledge as prompt tokens via the FPA module to improve cross-modal alignment and discriminability for unseen action categories.

What carries the argument

The diffusion-based foreground knowledge prompting process that progressively removes background redundancy from video features to create semantic anchors between video segments and text labels.

If this is right

Semantic noise from mismatched labels and video content is reduced, yielding more precise localization of action segments.
Discriminability of action-relevant parts increases, supporting accurate classification of categories absent from training data.
Generative denoising becomes a reusable mechanism for background suppression in video-text tasks.
The overall pipeline achieves state-of-the-art detection performance on the two OV-TAD benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same foreground-extraction step could be adapted to improve open-vocabulary localization in static images or audio clips.
If the anchors prove stable, they might reduce the volume of labeled video needed for training new action categories.
Extensions to longer untrimmed videos would test whether the progressive denoising scales without accumulating alignment errors.

Load-bearing premise

The diffusion denoising process reliably extracts clean foreground knowledge that serves as an effective semantic anchor without introducing new artifacts or requiring task-specific tuning that could overfit the benchmarks.

What would settle it

A test set of videos with deliberately increased background complexity where the method's localization and classification accuracy falls below strong non-diffusion baselines would falsify the claim that the extracted foreground knowledge reliably improves alignment.

Figures

Figures reproduced from arXiv: 2604.18313 by Bo Li, Cong Wang, Jinchao Zhang, Lin Wang, Sa Zhu, Wanqian Zhang.

**Figure 2.** Figure 2: Overview of the proposed framework, which consists of three key modules: Semantic-Unify Conditioning (SUC), [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Similarity between action label and video segments, as well as between foreground knowledge of different steps and [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization of the detection results and corresponding similarity between text representation and video segments. [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: t-SNE visualization of the video segment embed [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

read the original abstract

Open-Vocabulary Temporal Action Detection (OV-TAD) aims to localize and classify action segments of unseen categories in untrimmed videos, where effective alignment between action semantics and video representations is critical for accurate detection. However, existing methods struggle to mitigate the semantic imbalance between concise, abstract action labels and rich, complex video contents, inevitably introducing semantic noise and misleading cross-modal alignment. To address this challenge, we propose DFAlign, the first framework that leverages diffusion-based denoising to generate foreground knowledge for the guidance of action-video alignment. Following the 'conditioning, denoising and aligning' manner, we first introduce the Semantic-Unify Conditioning (SUC) module, which unifies action-shared and action-specific semantics as conditions for diffusion denoising. Then, the Background-Suppress Denoising (BSD) module generates foreground knowledge by progressively removing background redundancy from videos through denoising process. This foreground knowledge serves as effective intermediate semantic anchor between video and text representations, mitigating the semantic gap and enhancing the discriminability of action-relevant segments. Furthermore, we introduce the Foreground-Prompt Alignment (FPA) module to inject extracted foreground knowledge as prompt tokens into text representations, guiding model's attention towards action-relevant segments and enabling precise cross-modal alignment. Extensive experiments demonstrate that our method achieves state-of-the-art performance on two OV-TAD benchmarks. The code repository is provided as follows: https://anonymous.4open.science/r/Code-2114/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DFAlign applies diffusion denoising to generate foreground prompts for OV-TAD alignment, with plausible but unverified gains from the background suppression step.

read the letter

The key takeaway is that DFAlign uses a diffusion process to create foreground knowledge prompts that help align action labels with video segments in open-vocabulary temporal action detection, and it reports state-of-the-art numbers on standard benchmarks. The new part is the three-module setup. Semantic-Unify Conditioning combines different levels of action semantics to guide the diffusion. Background-Suppress Denoising then runs a conditioned diffusion to strip out background elements step by step. Foreground-Prompt Alignment takes the output and adds it as prompt tokens to the text side. This is a distinct way to tackle the semantic gap that other methods have approached with different alignment losses or adapters. The paper does well in laying out the motivation clearly and showing how the modules connect in the overall flow. The code availability is a plus for anyone wanting to test it. Where it could be stronger is in validating the core assumption about the denoising. The stress test is right to flag that we don't see direct proof the denoised features are cleaner or preserve action cues better than the input. If the paper has feature visualizations or quantitative checks on the intermediate outputs, that would help a lot. Otherwise the gains might trace back to the prompting or training details instead. The abstract mentions extensive experiments but the full results section will need to include solid ablations to back the SOTA claim. This work is for vision researchers dealing with open-vocabulary video tasks and those exploring diffusion models beyond generation. A reader looking for practical improvements in action detection would get value from the method and results. It is solid enough in its proposal and claims to deserve a serious referee, with the expectation that reviewers will probe the diffusion component. I would recommend sending this to peer review.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes DFAlign, the first diffusion-based framework for open-vocabulary temporal action detection (OV-TAD). It follows a conditioning-denoising-aligning pipeline with three modules: Semantic-Unify Conditioning (SUC) to combine action-shared and action-specific semantics as diffusion conditions; Background-Suppress Denoising (BSD) to progressively remove background redundancy and produce foreground knowledge as an intermediate semantic anchor; and Foreground-Prompt Alignment (FPA) to inject the denoised foreground knowledge as prompt tokens into text representations for improved cross-modal alignment. The authors report that this approach achieves state-of-the-art performance on two OV-TAD benchmarks and release code.

Significance. If the central empirical claims hold, the work offers a novel architectural direction for handling semantic imbalance between concise action labels and rich video content in OV-TAD by leveraging diffusion denoising to create foreground anchors. The explicit code release is a strength that supports reproducibility and follow-up work in diffusion-based prompting for video understanding.

major comments (2)

[Abstract and §3] Abstract and §3 (Method): The central claim that BSD reliably extracts clean foreground knowledge serving as an effective semantic anchor rests on the assumption that the conditioned diffusion process (SUC + BSD) separates foreground without losing action cues or introducing artifacts. However, the manuscript provides no direct supporting evidence such as reconstruction metrics, feature visualizations, or ablation isolating the denoising step from FPA prompting, which is load-bearing for attributing SOTA gains to the diffusion mechanism rather than other components.
[§4] §4 (Experiments): The abstract asserts SOTA results on two OV-TAD benchmarks, yet the provided manuscript text contains no quantitative tables, baseline comparisons, ablation studies, or error analysis to substantiate the magnitude of improvement or rule out post-hoc design choices, making independent verification of the central performance claim impossible from the given material.

minor comments (2)

[Abstract] Abstract: The code link is provided but remains anonymous; while acceptable for review, the final version should include a permanent repository link to enable full reproducibility.
[§3.2] Notation: The description of the diffusion process in §3.2 would benefit from an explicit equation for the denoising objective or the conditioning mechanism to clarify how SUC and BSD interact.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to provide stronger empirical support.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Method): The central claim that BSD reliably extracts clean foreground knowledge serving as an effective semantic anchor rests on the assumption that the conditioned diffusion process (SUC + BSD) separates foreground without losing action cues or introducing artifacts. However, the manuscript provides no direct supporting evidence such as reconstruction metrics, feature visualizations, or ablation isolating the denoising step from FPA prompting, which is load-bearing for attributing SOTA gains to the diffusion mechanism rather than other components.

Authors: We appreciate this observation. The manuscript includes ablation studies in §4 comparing performance with and without BSD to show its contribution. However, we agree that more direct evidence is needed. In revision we will add: feature visualizations of progressive background suppression in BSD, quantitative reconstruction metrics (e.g., feature similarity to action-relevant regions), and an ablation isolating BSD by comparing FPA applied to raw vs. denoised features. This will better demonstrate that the diffusion process yields clean foreground anchors without losing cues or adding artifacts. revision: yes
Referee: [§4] §4 (Experiments): The abstract asserts SOTA results on two OV-TAD benchmarks, yet the provided manuscript text contains no quantitative tables, baseline comparisons, ablation studies, or error analysis to substantiate the magnitude of improvement or rule out post-hoc design choices, making independent verification of the central performance claim impossible from the given material.

Authors: We apologize for any lack of clarity in the reviewed version. The manuscript contains experimental results in §4, but to ensure full verifiability we will expand the section with prominent tables showing baseline comparisons and mAP gains on both benchmarks, detailed ablations on SUC/BSD/FPA, and error analysis. We will also include confidence intervals and statistical tests. Combined with the released code, this will allow independent reproduction and verification of the reported improvements. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper describes an architectural proposal (DFAlign with SUC, BSD, and FPA modules) that uses diffusion denoising to create foreground knowledge as a semantic anchor for alignment. No equations, derivations, or first-principles results are presented in the provided text that reduce any claimed output (e.g., foreground knowledge or SOTA performance) to fitted parameters, self-referential definitions, or self-citation chains by construction. The method is framed as an independent design choice validated by experiments on external benchmarks, with no load-bearing steps that collapse to renaming inputs or ansatzes imported from prior self-work. This is the common case of a non-circular empirical method paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not enumerate specific free parameters, axioms, or invented entities; the approach appears to rest on standard diffusion model assumptions and cross-modal alignment techniques drawn from prior literature.

pith-pipeline@v0.9.0 · 5571 in / 1084 out tokens · 38646 ms · 2026-05-10T05:24:55.901747+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

56 extracted references · 6 canonical work pages · 5 internal anchors

[1]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S Davis. 2017. Soft- NMS–improving object detection with one line of code. InProceedings of the IEEE international conference on computer vision. 5561–5569

2017
[3]

Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles
[4]

InProceedings of the ieee conference on computer vision and pattern recognition

Activitynet: A large-scale video benchmark for human activity understand- ing. InProceedings of the ieee conference on computer vision and pattern recognition. 961–970
[5]

Shengqu Cai, Eric Ryan Chan, Yunzhi Zhang, Leonidas Guibas, Jiajun Wu, and Gordon Wetzstein. 2025. Diffusion self-distillation for zero-shot customized image generation. InProceedings of the Computer Vision and Pattern Recognition Conference. 18434–18443

2025
[6]

Shuning Chang, Pichao Wang, Fan Wang, Hao Li, and Zheng Shou. 2022. Aug- mented transformer with adaptive graph for temporal action proposal generation. InProceedings of the 3rd International Workshop on Human-Centric Multimedia Analysis. 41–50

2022
[7]

Xiaoyong Chen, Yong Guo, Jiaming Liang, Sitong Zhuang, Runhao Zeng, and Xiping Hu. 2025. Temporal Action Detection Model Compression by Progres- sive Block Drop. InProceedings of the Computer Vision and Pattern Recognition Conference. 29225–29236

2025
[8]

Julien Denize, Mykola Liashuha, Jaonary Rabarisoa, Astrid Orcesi, and Romain Hérault. 2024. COMEDIAN: Self-supervised learning and knowledge distillation for action spotting using transformers. InProceedings of the IEEE/CVF Winter Conference on applications of computer vision. 530–540

2024
[9]

Dave Epstein, Allan Jabri, Ben Poole, Alexei Efros, and Aleksander Holynski
[10]

Denoise and Align: Diffusion-Driven Foreground Knowledge Prompting for Open-Vocabulary Temporal Action Detection SIGIR’26, July 20–24, 2026, Melbourne, Australia

Diffusion self-guidance for controllable image generation.Advances in Neural Information Processing Systems36 (2023), 16222–16239. Denoise and Align: Diffusion-Driven Foreground Knowledge Prompting for Open-Vocabulary Temporal Action Detection SIGIR’26, July 20–24, 2026, Melbourne, Australia

2023
[11]

Bo Fang, Wenhao Wu, Chang Liu, Yu Zhou, Yuxin Song, Weiping Wang, Xiangbo Shu, Xiangyang Ji, and Jingdong Wang. 2023. Uatvr: Uncertainty-adaptive text- video retrieval. InProceedings of the IEEE/CVF international conference on computer vision. 13723–13733

2023
[12]

Lin Geng Foo, Tianjiao Li, Hossein Rahmani, and Jun Liu. 2024. Action detection via an image diffusion process. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18351–18361

2024
[13]

Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models.Advances in neural information processing systems33 (2020), 6840–6851

2020
[14]

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. 2024. Gpt-4o system card.arXiv preprint arXiv:2410.21276(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Jeongseok Hyun, Su Ho Han, Hyolim Kang, Joon-Young Lee, and Seon Joo Kim
[16]

In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV)

Exploring scalability of self-training for open-vocabulary temporal action localization. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV). IEEE, 9406–9415
[17]

in the wild

Haroon Idrees, Amir R Zamir, Yu-Gang Jiang, Alex Gorban, Ivan Laptev, Rahul Sukthankar, and Mubarak Shah. 2017. The thumos challenge on action recog- nition for videos “in the wild”.Computer Vision and Image Understanding155 (2017), 1–23

2017
[18]

Weibo Jiang, Weihong Ren, Jiandong Tian, Liangqiong Qu, Zhiyong Wang, and Honghai Liu. 2024. Exploring self-and cross-triplet correlations for human- object interaction detection. InProceedings of the AAAI conference on artificial intelligence, Vol. 38. 2543–2551

2024
[19]

Chen Ju, Tengda Han, Kunhao Zheng, Ya Zhang, and Weidi Xie. 2022. Prompting visual-language models for efficient video understanding. InEuropean Conference on Computer Vision. Springer, 105–124

2022
[20]

Ho-Joong Kim, Jung-Ho Hong, Heejo Kong, and Seong-Whan Lee. 2024. Te-tad: Towards full end-to-end temporal action detection via time-aligned coordinate expression. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18837–18846

2024
[21]

Ho-Joong Kim, Yearang Lee, Jung-Ho Hong, and Seong-Whan Lee. 2025. DiGIT: Multi-Dilated Gated Encoder and Central-Adjacent Region Integrated Decoder for Temporal Action Detection Transformer. InProceedings of the Computer Vision and Pattern Recognition Conference. 24286–24296

2025
[22]

Jihwan Kim, Miso Lee, Cheol-Ho Cho, Jihyun Lee, and Jae-Pil Heo. 2025. Prediction-feedback DETR for temporal action detection. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 4266–4274

2025
[23]

Akash Kumar et al. 2025. Stable mean teacher for semi-supervised video action detection. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 4419–4427

2025
[24]

Yearang Lee et al. 2024. Text-infused attention and foreground-aware modeling for zero-shot temporal action detection.Advances in Neural Information Processing Systems37 (2024), 9864–9884

2024
[25]

Pandeng Li, Chen-Wei Xie, Hongtao Xie, Liming Zhao, Lei Zhang, Yun Zheng, Deli Zhao, and Yongdong Zhang. 2023. Momentdiff: Generative video moment retrieval from random to real.Advances in neural information processing systems 36 (2023), 65948–65966

2023
[26]

Zijie Li, Henry Li, Yichun Shi, Amir Barati Farimani, Yuval Kluger, Linjie Yang, and Peng Wang. 2025. Dual diffusion for unified image generation and understanding. InProceedings of the Computer Vision and Pattern Recognition Conference. 2779– 2790

2025
[27]

Zhiheng Li, Yujie Zhong, Ran Song, Tianjiao Li, Lin Ma, and Wei Zhang. 2024. Detal: Open-vocabulary temporal action localization with decoupled networks. IEEE Transactions on Pattern Analysis and Machine Intelligence46, 12 (2024), 7728–7741

2024
[28]

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Cheng- gang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Daizong Liu, Jiahao Zhu, Xiang Fang, Zeyu Xiong, Huan Wang, Renfu Li, and Pan Zhou. 2023. Conditional video diffusion network for fine-grained temporal sentence grounding.IEEE Transactions on Multimedia26 (2023), 5461–5476

2023
[30]

Shuming Liu, Chen-Lin Zhang, Chen Zhao, and Bernard Ghanem. 2024. End- to-end temporal action detection with 1b parameters across 1000 frames. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 18591–18601

2024
[31]

Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research9, Nov (2008), 2579–2605

2008
[32]

Sauradip Nag, Xiatian Zhu, Jiankang Deng, Yi-Zhe Song, and Tao Xiang. 2023. Difftad: Temporal action detection with proposal denoising diffusion. InProceed- ings of the IEEE/CVF international conference on computer vision. 10362–10374

2023
[33]

Sauradip Nag, Xiatian Zhu, Yi-Zhe Song, and Tao Xiang. 2022. Zero-shot tempo- ral action detection via vision-language prompting. InEuropean conference on computer vision. Springer, 681–697

2022
[34]

Zhanzhong Pang et al. 2025. Context-enhanced memory-refined transformer for online action detection. InProceedings of the Computer Vision and Pattern Recognition Conference. 8700–8710

2025
[35]

Thinh Phan, Khoa Vo, Duy Le, Gianfranco Doretto, Donald Adjeroh, and Ngan Le
[36]

InProceedings of the IEEE/CVF winter conference on applications of computer vision

Zeetad: Adapting pretrained vision-language model for zero-shot end-to- end temporal action detection. InProceedings of the IEEE/CVF winter conference on applications of computer vision. 7046–7055
[37]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763

2021
[38]

Dingfeng Shi, Yujie Zhong, Qiong Cao, Lin Ma, Jia Li, and Dacheng Tao. 2023. Tridet: Temporal action detection with relative boundary modeling. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 18857– 18866

2023
[39]

Ayush Singh, Aayush J Rana, Akash Kumar, Shruti Vyas, and Yogesh Singh Rawat
[40]

InProceedings of the AAAI Conference on Artificial Intelligence, Vol

Semi-supervised active learning for video action detection. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 4891–4899
[41]

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli
[42]

In International conference on machine learning

Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning. pmlr, 2256–2265
[43]

Jiaming Song, Chenlin Meng, and Stefano Ermon. 2020. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502(2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020
[44]

Jing Tan, Jiaqi Tang, Limin Wang, and Gangshan Wu. 2021. Relaxed transformer decoders for direct action proposal generation. InProceedings of the IEEE/CVF international conference on computer vision. 13526–13535

2021
[45]

Jiamian Wang, Pichao Wang, Dongfang Liu, Qiang Guan, Sohail Dianat, Majid Rabbani, Raghuveer Rao, and Zhiqiang Tao. 2024. Diffusion-inspired truncated sampler for text-video retrieval.Advances in Neural Information Processing Systems37 (2024), 3882–3906

2024
[46]

Song-miao Wang et al. 2025. Concept-Guided Open-Vocabulary Temporal Action Detection.Journal of Computer Science and Technology(2025)

2025
[47]

Mengmeng Xu, Chen Zhao, David S Rojas, Ali Thabet, and Bernard Ghanem. 2020. G-tad: Sub-graph localization for temporal action detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10156–10165

2020
[48]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

Le Yang, Ziwei Zheng, Yizeng Han, Hao Cheng, Shiji Song, Gao Huang, and Fan Li. 2024. Dyfadet: Dynamic feature aggregation for temporal action detection. In European Conference on Computer Vision. Springer, 305–322

2024
[50]

Shuhei M Yoshida, Takashi Shibata, Makoto Terao, Takayuki Okatani, and Masashi Sugiyama. 2025. Action-agnostic point-level supervision for temporal action detection. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 9571–9579

2025
[51]

Zongsheng Yue, Jianyi Wang, and Chen Change Loy. 2024. Efficient diffusion model for image restoration by residual shifting.IEEE Transactions on Pattern Analysis and Machine Intelligence(2024)

2024
[52]

Yingsen Zeng, Yujie Zhong, Chengjian Feng, and Lin Ma. 2024. Unimd: Towards unifying moment retrieval and temporal action detection. InEuropean Conference on Computer Vision. Springer, 286–304

2024
[53]

Sa Zhu, Huashan Chen, Wanqian Zhang, Jinchao Zhang, Zexian Yang, Xiaoshuai Hao, and Bo Li. 2025. Uneven Event Modeling for Partially Relevant Video Retrieval. In2025 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1–6

2025
[54]

Sa Zhu, Wanqian Zhang, Lin Wang, Xiaohua Chen, Chenxu Cui, Jinchao Zhang, and Bo Li. 2026. Decompose and Transfer: CoT-Prompting Enhanced Alignment for Open-Vocabulary Temporal Action Detection.arXiv preprint arXiv:2603.24030 (2026)

work page arXiv 2026
[55]

Yuanzhi Zhu, Zhaohai Li, Tianwei Wang, Mengchao He, and Cong Yao. 2023. Conditional text image generation with diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14235–14245

2023
[56]

Yuhan Zhu, Guozhen Zhang, Jing Tan, Gangshan Wu, and Limin Wang. 2024. Dual detrs for multi-label temporal action detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18559–18569

2024