Quality-Preserving Imperceptible Adversarial Attack on Skeleton-based Human Action Recognition

Hubert P. H. Shum; Kanglei Zhou; Xiaohui Liang; Ziyi Chang

arxiv: 2606.13022 · v1 · pith:AV2QQ2UZnew · submitted 2026-06-11 · 💻 cs.CV · cs.LG

Quality-Preserving Imperceptible Adversarial Attack on Skeleton-based Human Action Recognition

Ziyi Chang , Kanglei Zhou , Xiaohui Liang , Hubert P. H. Shum This is my paper

Pith reviewed 2026-06-27 07:16 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords adversarial attackskeleton-based action recognitionmotion qualitydistribution-based optimizationempirical and true risksimperceptible attackhuman action recognition

0 comments

The pith

A distribution-based attack closes the empirical-true risk gap to produce imperceptible adversarial motions on skeleton action recognition without noise perturbations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper finds that prior adversarial attacks degrade post-attack motion quality because their optimization creates a gap between empirical and true risks. It therefore introduces a distribution-based method that generates adversarial examples directly from motion distributions rather than by adding noise-like perturbations. This keeps the resulting motions natural while raising attack success rates against current skeleton-based human action recognizers. A new evaluation metric is also defined that better matches human judgments of naturalness. If the approach holds, existing action recognizers become more vulnerable than previously measured and require stronger defenses.

Core claim

Degradation in motion quality after adversarial attacks on skeleton-based human action recognition stems from the gap between empirical and true risks in the optimization process. A distribution-based adversarial attack method minimizes this gap without introducing noise-like perturbations, thereby preserving motion quality while achieving higher attack success rates, as shown through experiments on state-of-the-art methods across two datasets and a newly proposed human-aligned quality metric.

What carries the argument

Distribution-based adversarial attack that minimizes the empirical-true risk gap without noise-like perturbations.

If this is right

Attack success rates exceed those of prior noise-based methods on state-of-the-art skeleton action recognizers.
Post-attack motions retain naturalness according to both the proposed metric and qualitative inspection.
The attacks remain imperceptible even to recent S-HAR systems that detect noise-like perturbations.
The results indicate that current action recognizers lack robustness against carefully optimized distribution shifts.
Further defense research is needed to address optimization gaps rather than perturbation magnitude alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the risk-gap diagnosis is accurate, similar distribution-based formulations could be tested on video or RGB action datasets to check whether the same quality preservation occurs.
Defenses might be strengthened by explicitly regularizing the empirical-true risk gap during training rather than only penalizing perturbation size.
The new quality metric could serve as a general benchmark for any motion-editing task where human perception of naturalness matters.

Load-bearing premise

The gap between empirical and true risks is the primary driver of motion quality loss, and switching to distribution-based optimization removes that gap without creating new perceptible artifacts.

What would settle it

A controlled comparison in which the distribution-based attack still produces measurable drops in the new human-aligned quality metric or in which the empirical-true risk gap remains large after the method is applied.

Figures

Figures reproduced from arXiv: 2606.13022 by Hubert P. H. Shum, Kanglei Zhou, Xiaohui Liang, Ziyi Chang.

**Figure 2.** Figure 2: Overview of the proposed distribution-based imperceivable adversarial S-HAR attack where post-attack motion quality is preserved. We optimize the [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The visualization of diffusion latents at different timesteps. As shown, the earlier timesteps maintain more low-level details, and the later timesteps [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: The illustration of the attack strategy. We illustrate an intermediate [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: The mean power spectral density of adversarial samples found on 100STYLE (upper row) and HDM05 (lower row) against four classifiers. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: The visualization of acceleration changes. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Visual comparison among the adversarial motions generated by different attack methods against victim models. We visualize the starting and the ending [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: Perceptibility comparison across different methods. [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

read the original abstract

Adversarial attacks on skeletal human action recognition have received significant attention. However, existing methods typically introduce noise-like perturbations that degrade motion quality post-attack, and thereby are inherently perceptible with recent advancements in S-HAR systems. We discover that this degradation stems from the gap between empirical and true risks during the optimization process of previous adversarial attacks. To address this issue, we propose an attack where adversarial motions are obtained without compromising their motion quality. To minimize the risk gap and preserve motion quality, we propose a distribution-based adversarial attack method without introducing noise-like perturbations. To faithfully evaluate the motion quality, we propose a new metric that aligns with human perception on real-world naturalness. Experiments have been conducted on the state-of-the-art S-HAR methods across two datasets, demonstrating the superiority of our method in both the attack success rate and the post-attack motion quality through qualitative and quantitative analyses. The success of our quality-preserving attack application and distribution-based method raises serious concerns about the robustness of action recognizers, highlighting the need for further enhancements in this domain.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Distribution-based attack on skeleton HAR improves quality scores but the claimed causal fix for the empirical-true risk gap lacks isolating evidence.

read the letter

The paper's main offering is a distribution-based adversarial attack for skeleton action recognition that generates perturbations by sampling from a learned motion distribution rather than adding noise-like changes. They also introduce a new quality metric meant to track human perception of naturalness better than prior ones.

On the positive side, the experiments run on two standard datasets against current S-HAR models and report higher attack success rates alongside better post-attack quality numbers. The shift away from explicit noise is a reasonable direction if the goal is imperceptibility, and the new metric could be useful for future work if it holds up under more scrutiny.

The soft spot is the central causal claim. The abstract states that quality degradation in prior attacks stems from the gap between empirical and true risks during optimization, and that the distribution method closes this gap. Yet nothing in the provided summary shows an ablation that holds perturbation size or manifold distance fixed while varying only the risk-gap term. Without that, it is unclear whether the quality gain comes from the proposed mechanism or simply from staying inside a learned motion distribution by construction. The stress-test concern lands here.

This work is aimed at researchers doing adversarial robustness for human action recognition. It is narrow in scope but addresses a concrete evaluation issue in that subfield.

I would send it for peer review. Reviewers should be asked to verify the optimization details, check for the missing ablations, and test whether the new metric actually predicts human judgments on held-out motions.

Referee Report

2 major / 2 minor

Summary. The paper claims that motion quality degradation in prior adversarial attacks on skeleton-based human action recognition (S-HAR) arises from the gap between empirical and true risks during optimization. It proposes a distribution-based attack method that generates adversarial motions without noise-like perturbations to close this gap and preserve quality, introduces a new human-perception-aligned motion quality metric, and reports superior attack success rates and post-attack quality on state-of-the-art S-HAR models across two datasets via qualitative and quantitative experiments.

Significance. If the risk-gap causal mechanism is isolated and the distribution-based method demonstrably closes it without other artifacts, the work would usefully expose robustness gaps in S-HAR systems and supply a quality-preserving attack baseline. The multi-dataset, multi-model experimental protocol and the attempt to introduce a perceptually aligned metric are positive elements that could support follow-on robustness research.

major comments (2)

[Abstract] Abstract: the assertion that quality degradation 'stems from the gap between empirical and true risks' is load-bearing for the motivation and for the claim that the distribution-based method addresses the root cause. No ablation is described that holds perturbation magnitude or manifold distance fixed while varying only the risk-gap term, leaving the causal attribution untested against alternative explanations such as perturbation style.
[Abstract, §4] Abstract and §4 (method): the distribution-based optimization is presented as directly minimizing the risk gap without introducing perceptible artifacts, yet the manuscript does not report an explicit measurement of the empirical-true risk gap on prior attacks or a controlled comparison showing that gap reduction (rather than on-manifold sampling by construction) is responsible for the observed quality gains.

minor comments (2)

[§3.2] The new motion quality metric is introduced without a dedicated validation section comparing it against existing perceptual or geometric metrics on a human study; this should be added or referenced.
[§4] Notation for the distribution-based attack (e.g., how the motion distribution is parameterized and optimized) is introduced without an explicit equation or pseudocode block, making reproduction difficult.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. The concerns about isolating the causal role of the empirical-true risk gap are well-taken and point to opportunities to strengthen the manuscript. We respond to each major comment below and commit to revisions that directly address the identified gaps in evidence.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that quality degradation 'stems from the gap between empirical and true risks' is load-bearing for the motivation and for the claim that the distribution-based method addresses the root cause. No ablation is described that holds perturbation magnitude or manifold distance fixed while varying only the risk-gap term, leaving the causal attribution untested against alternative explanations such as perturbation style.

Authors: We agree that the current manuscript does not contain an ablation that isolates the risk-gap term while holding perturbation magnitude and manifold distance fixed. Such a controlled experiment would provide stronger support for the causal claim over alternative explanations. In the revised version we will add this ablation, comparing optimization objectives under matched constraints on perturbation size and manifold proximity. revision: yes
Referee: [Abstract, §4] Abstract and §4 (method): the distribution-based optimization is presented as directly minimizing the risk gap without introducing perceptible artifacts, yet the manuscript does not report an explicit measurement of the empirical-true risk gap on prior attacks or a controlled comparison showing that gap reduction (rather than on-manifold sampling by construction) is responsible for the observed quality gains.

Authors: The manuscript does not report numerical values of the empirical-true risk gap for baseline attacks, nor a direct comparison that attributes quality gains specifically to gap reduction versus the on-manifold sampling property. While the distribution-based formulation is motivated by gap minimization and the quality results are consistent with this view, we acknowledge the absence of the requested explicit measurements. We will compute and report the risk-gap values for prior methods and our approach, together with the controlled comparison, in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation relies on external empirical observation and new method, not self-referential reduction

full rationale

The abstract asserts a causal link between quality degradation and the empirical-true risk gap, then introduces a distribution-based attack to close that gap. No equations, fitted parameters renamed as predictions, or self-citations appear in the provided text that would make the proposed method or its superiority equivalent to the inputs by construction. The new quality metric and experimental comparisons are presented as independent evaluations. This matches the default expectation of a non-circular paper whose central claims rest on external benchmarks rather than tautological redefinitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the hypothesis that empirical-true risk gap drives quality loss and that distribution modeling closes it; these are introduced without upstream independent evidence in the provided abstract.

axioms (1)

domain assumption Degradation in motion quality of prior attacks stems primarily from the gap between empirical and true risks in optimization
Explicitly stated as the discovery motivating the work in the abstract.

invented entities (2)

distribution-based adversarial attack method no independent evidence
purpose: Generate adversarial motions without noise-like perturbations by minimizing risk gap
Proposed as the core solution in the abstract.
new motion quality metric no independent evidence
purpose: Faithfully evaluate post-attack motion naturalness in alignment with human perception
Introduced in the abstract to address evaluation limitations of prior methods.

pith-pipeline@v0.9.1-grok · 5726 in / 1393 out tokens · 37627 ms · 2026-06-27T07:16:18.776346+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

81 extracted references · 7 canonical work pages · 2 internal anchors

[1]

A survey on 3d skeleton-based action recognition using learning method,

B. Ren, M. Liu, R. Ding, and H. Liu, “A survey on 3d skeleton-based action recognition using learning method,”Cyborg and Bionic Systems, vol. 5, p. 0100, 2024

2024
[2]

Tasar: Transferable attack on skeletal action recognition,

Y . Diao, B. Wu, R. Zhang, A. Liu, X. Wei, M. Wang, and H. Wang, “Tasar: Transferable attack on skeletal action recognition,”arXiv preprint arXiv:2409.02483, 2024

work page arXiv 2024
[3]

Recent advances in adversarial training for adversarial robustness,

T. Bai, J. Luo, J. Zhao, B. Wen, and Q. Wang, “Recent advances in adversarial training for adversarial robustness,” inProceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI- 21, Z.-H. Zhou, Ed. International Joint Conferences on Artificial Intelligence Organization, 8 2021, pp. 4312–4321, survey Track

2021
[4]

Toward trustworthy artificial intelligence (tai) in the context of explainability and robustness,

B. Chander, C. John, L. Warrier, and K. Gopalakrishnan, “Toward trustworthy artificial intelligence (tai) in the context of explainability and robustness,”ACM Computing Surveys, 2024

2024
[5]

Understanding the robustness of skeleton-based action recognition under adversarial attack,

H. Wang, F. He, Z. Peng, T. Shao, Y .-L. Yang, K. Zhou, and D. Hogg, “Understanding the robustness of skeleton-based action recognition under adversarial attack,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 14 656–14 665

2021
[6]

Adversarial attack on skeleton-based human action recognition,

J. Liu, N. Akhtar, and A. Mian, “Adversarial attack on skeleton-based human action recognition,”IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 4, pp. 1609–1622, 2020

2020
[7]

Decomposing biological motion: A framework for analysis and synthesis of human gait patterns,

N. F. Troje, “Decomposing biological motion: A framework for analysis and synthesis of human gait patterns,”Journal of vision, vol. 2, no. 5, pp. 2–2, 2002

2002
[8]

Modulation of motor area activity during observation of unnatural body movements,

S. Shimada and K. Oki, “Modulation of motor area activity during observation of unnatural body movements,”Brain and cognition, vol. 80, no. 1, pp. 1–6, 2012

2012
[9]

Ntu rgb+ d: A large scale dataset for 3d human activity analysis,

A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “Ntu rgb+ d: A large scale dataset for 3d human activity analysis,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1010–1019

2016
[10]

The Kinetics Human Action Video Dataset

W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijaya- narasimhan, F. Viola, T. Green, T. Back, P. Natsevet al., “The kinetics human action video dataset,”arXiv preprint arXiv:1705.06950, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[11]

Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding,

J. Liu, A. Shahroudy, M. Perez, G. Wang, L.-Y . Duan, and A. C. Kot, “Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding,”IEEE transactions on pattern analysis and machine intelligence, vol. 42, no. 10, pp. 2684–2701, 2019

2019
[12]

Towards deep learning models resistant to adversarial attacks,

A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” inInternational Conference on Learning Representations, 2018

2018
[13]

Adversarial examples are not bugs, they are features,

A. Ilyas, S. Santurkar, D. Tsipras, L. Engstrom, B. Tran, and A. Madry, “Adversarial examples are not bugs, they are features,”Advances in neural information processing systems, vol. 32, 2019

2019
[14]

Adversarial examples are not real features,

A. Li, Y . Wang, Y . Guo, and Y . Wang, “Adversarial examples are not real features,”Advances in Neural Information Processing Systems, vol. 36, 2024

2024
[15]

Optimizing diffusion noise can serve as universal motion priors,

K. Karunratanakul, K. Preechakul, E. Aksan, T. Beeler, S. Suwajanakorn, and S. Tang, “Optimizing diffusion noise can serve as universal motion priors,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 1334–1345

2024
[16]

Lidpose: Real-time 3d human pose estimation in sparse lidar point clouds with non-repetitive circular scanning pattern,

L. Kov ´acs, B. M. B´odis, and C. Benedek, “Lidpose: Real-time 3d human pose estimation in sparse lidar point clouds with non-repetitive circular scanning pattern,”Sensors, vol. 24, no. 11, p. 3427, 2024

2024
[17]

Finepose: Fine-grained prompt-driven 3d human pose estimation via diffusion models,

J. Xu, Y . Guo, and Y . Peng, “Finepose: Fine-grained prompt-driven 3d human pose estimation via diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 561–570

2024
[18]

Wham: Reconstructing world-grounded humans with accurate 3d motion,

S. Shin, J. Kim, E. Halilaj, and M. J. Black, “Wham: Reconstructing world-grounded humans with accurate 3d motion,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 2070–2080

2024
[19]

Principles of risk minimization for learning theory,

V . Vapnik, “Principles of risk minimization for learning theory,”Ad- vances in neural information processing systems, vol. 4, 1991

1991
[20]

Devroye, L

L. Devroye, L. Gy ¨orfi, and G. Lugosi,A probabilistic theory of pattern recognition. Springer Science & Business Media, 2013, vol. 31

2013
[21]

Model selection and model averaging,

G. Claeskens and N. L. Hjort, “Model selection and model averaging,” Cambridge books, 2008

2008
[22]

Costs of position, velocity, and force requirements in optimal control induce triphasic muscle activation during reaching movement,

Y . Ueyama, “Costs of position, velocity, and force requirements in optimal control induce triphasic muscle activation during reaching movement,”Scientific Reports, vol. 11, no. 1, p. 16815, 2021

2021
[23]

Five basic muscle ac- tivation patterns account for muscle activity during human locomotion,

Y . P. Ivanenko, R. E. Poppele, and F. Lacquaniti, “Five basic muscle ac- tivation patterns account for muscle activity during human locomotion,” The Journal of physiology, vol. 556, no. 1, pp. 267–282, 2004

2004
[24]

Real-time style modelling of human locomotion via feature-wise transformations and local motion phases,

I. Mason, S. Starke, and T. Komura, “Real-time style modelling of human locomotion via feature-wise transformations and local motion phases,”Proceedings of the ACM on Computer Graphics and Interactive Techniques, vol. 5, no. 1, may 2022

2022
[25]

Documentation mocap database hdm05,

M. M ¨uller, T. R¨oder, M. Clausen, B. Eberhardt, B. Kr¨uger, and A. Weber, “Documentation mocap database hdm05,” Universit¨at Bonn, Tech. Rep. CG-2007-2, June 2007

2007
[26]

Intriguing properties of neural networks

C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,”arXiv preprint arXiv:1312.6199, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[27]

Better aggregation in test-time augmentation,

D. Shanmugam, D. Blalock, G. Balakrishnan, and J. Guttag, “Better aggregation in test-time augmentation,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 1214–1223

2021
[28]

Retouchuaa: Unconstrained adversarial attack via realistic image retouching,

M. Xie, Y . He, Z. Qin, and M. Fang, “Retouchuaa: Unconstrained adversarial attack via realistic image retouching,”IEEE Transactions IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 14 on Circuits and Systems for Video Technology, vol. 35, no. 3, pp. 2586– 2602, 2025

2025
[29]

Lesep: Boosting adversarial transferability via latent encoding and semantic embedding perturbations,

Y . Gan, C. Wu, D. Ouyang, S. Tang, M. Ye, and T. Xiang, “Lesep: Boosting adversarial transferability via latent encoding and semantic embedding perturbations,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 35, no. 4, pp. 3368–3382, 2025

2025
[30]

Videopure: Diffusion-based adversarial purification for video recognition,

K. Jiang, Z. Chen, J. Fu, L. Hong, J. Li, and W. Zhang, “Videopure: Diffusion-based adversarial purification for video recognition,”IEEE Transactions on Circuits and Systems for Video Technology, pp. 1–1, 2025

2025
[31]

Transhfc: Joints hypergraph filtering convolution and transformer framework for temporal forgery localization,

J. Huang, X. Yuan, C.-T. Lam, S.-K. Im, F. Lei, and X. Bi, “Transhfc: Joints hypergraph filtering convolution and transformer framework for temporal forgery localization,”IEEE Transactions on Circuits and Systems for Video Technology, 2025

2025
[32]

A unified framework for adversarial patch attacks against visual 3d object detection in autonomous driving,

J. Wang, F. Li, and L. He, “A unified framework for adversarial patch attacks against visual 3d object detection in autonomous driving,”IEEE Transactions on Circuits and Systems for Video Technology, 2025

2025
[33]

Synthesizing robust adversarial examples,

A. Athalye, L. Engstrom, A. Ilyas, and K. Kwok, “Synthesizing robust adversarial examples,” inInternational conference on machine learning. PMLR, 2018, pp. 284–293

2018
[34]

Deeprobust: A pytorch library for adversarial attacks and defenses,

Y . Li, W. Jin, H. Xu, and J. Tang, “Deeprobust: A pytorch library for adversarial attacks and defenses,”arXiv preprint arXiv:2005.06149, 2020

work page arXiv 2005
[35]

Adversarial attacks on time series,

F. Karim, S. Majumdar, and H. Darabi, “Adversarial attacks on time series,”IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 10, pp. 3309–3320, 2020

2020
[36]

Hard no- box adversarial attack on skeleton-based human action recognition with skeleton-motion-informed gradient,

Z. Lu, H. Wang, Z. Chang, G. Yang, and H. P. Shum, “Hard no- box adversarial attack on skeleton-based human action recognition with skeleton-motion-informed gradient,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4597–4606

2023
[37]

Understanding the vulnerability of skeleton-based human activity recognition via black-box attack,

Y . Diao, H. Wang, T. Shao, Y . Yang, K. Zhou, D. Hogg, and M. Wang, “Understanding the vulnerability of skeleton-based human activity recognition via black-box attack,”Pattern Recognition, vol. 153, p. 110564, 2024

2024
[38]

Adversarial bone length attack on action recognition,

N. Tanaka, H. Kera, and K. Kawamoto, “Adversarial bone length attack on action recognition,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 2, 2022, pp. 2335–2343

2022
[39]

Bones of contention: Exploring query-efficient attacks against skeleton recognition systems,

Y . Cao, K. Ye, D. Wang, M. Xue, H. Ge, C. Qian, and J. S. Dong, “Bones of contention: Exploring query-efficient attacks against skeleton recognition systems,”IEEE Transactions on Information Forensics and Security, vol. 21, pp. 183–196, 2025

2025
[40]

Qesar: Query effective decision-based attack on skeletal action recognition,

Z. Kang, Y . Zhang, R. Zhang, Y . Jiang, and H. Xia, “Qesar: Query effective decision-based attack on skeletal action recognition,” inChi- nese Conference on Pattern Recognition and Computer Vision (PRCV). Springer, 2023, pp. 417–429

2023
[41]

Deep unsupervised learning using nonequilibrium thermodynamics,

J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” inInternational Conference on Machine Learning. PMLR, 2015, pp. 2256–2265

2015
[42]

On the design fundamentals of diffusion models: A survey,

Z. Chang, G. A. Koulieris, and H. P. Shum, “On the design fundamentals of diffusion models: A survey,”arXiv preprint arXiv:2306.04542, 2023

work page arXiv 2023
[43]

Boosting black-box attack to deep neural networks with conditional diffusion models,

R. Liu, W. Zhou, T. Zhang, K. Chen, J. Zhao, and K.-Y . Lam, “Boosting black-box attack to deep neural networks with conditional diffusion models,”IEEE Transactions on Information Forensics and Security, 2024

2024
[44]

Advdiffuser: Natural adversarial example synthesis with diffusion models,

X. Chen, X. Gao, J. Zhao, K. Ye, and C.-Z. Xu, “Advdiffuser: Natural adversarial example synthesis with diffusion models,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4562–4572

2023
[45]

Un- stoppable attack: Label-only model inversion via conditional diffusion model,

R. Liu, D. Wang, Y . Ren, Z. Wang, K. Guo, Q. Qin, and X. Liu, “Un- stoppable attack: Label-only model inversion via conditional diffusion model,”IEEE Transactions on Information Forensics and Security, 2024

2024
[46]

Towards transferable attack via adversarial diffusion in face recognition,

C. Hu, Y . Li, Z. Feng, and X. Wu, “Towards transferable attack via adversarial diffusion in face recognition,”IEEE Transactions on Information Forensics and Security, 2024

2024
[47]

Establishing robust generative image steganography via popular stable diffusion,

X. Hu, S. Li, Q. Ying, W. Peng, X. Zhang, and Z. Qian, “Establishing robust generative image steganography via popular stable diffusion,” IEEE Transactions on Information Forensics and Security, 2024

2024
[48]

Diffilter: Defending against adversarial perturbations with diffusion filter,

Y . Chen, X. Li, X. Wang, P. Hu, and D. Peng, “Diffilter: Defending against adversarial perturbations with diffusion filter,”IEEE Transac- tions on Information Forensics and Security, 2024

2024
[49]

Diffusion patch attack with spatial–temporal cross-evolution for video recognition,

J. Yang, Z. Guan, J. Li, Z. Shi, and X. Liu, “Diffusion patch attack with spatial–temporal cross-evolution for video recognition,”IEEE Transac- tions on Circuits and Systems for Video Technology, vol. 34, no. 12, pp. 13 190–13 200, 2024

2024
[50]

Videopure: Diffusion-based adversarial purification for video recognition,

K. Jiang, Z. Chen, J. Fu, L. Hong, J. Li, and W. Zhang, “Videopure: Diffusion-based adversarial purification for video recognition,”IEEE Transactions on Circuits and Systems for Video Technology, 2025

2025
[51]

Human motion generation: A survey,

W. Zhu, X. Ma, D. Ro, H. Ci, J. Zhang, J. Shi, F. Gao, Q. Tian, and Y . Wang, “Human motion generation: A survey,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023

2023
[52]

From the perception of action to the understanding of intention,

S.-J. Blakemore and J. Decety, “From the perception of action to the understanding of intention,”Nature reviews neuroscience, vol. 2, no. 8, pp. 561–567, 2001

2001
[53]

Brain areas involved in perception of biological motion,

E. Grossman, M. Donnelly, R. Price, D. Pickens, V . Morgan, G. Neigh- bor, and R. Blake, “Brain areas involved in perception of biological motion,”Journal of cognitive neuroscience, vol. 12, no. 5, pp. 711–720, 2000

2000
[54]

A latent space of stochastic diffusion models for zero-shot image editing and guidance,

C. H. Wu and F. De la Torre, “A latent space of stochastic diffusion models for zero-shot image editing and guidance,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 7378–7387

2023
[55]

Intriguing properties of generative classifiers,

P. Jaini, K. Clark, and R. Geirhos, “Intriguing properties of generative classifiers,” inThe Twelfth International Conference on Learning Rep- resentations, 2024

2024
[56]

Action-conditioned 3d human motion synthesis with transformer vae,

M. Petrovich, M. J. Black, and G. Varol, “Action-conditioned 3d human motion synthesis with transformer vae,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 985–10 995

2021
[57]

Diffusion autoencoders: Toward a meaningful and decodable represen- tation,

K. Preechakul, N. Chatthee, S. Wizadwongsa, and S. Suwajanakorn, “Diffusion autoencoders: Toward a meaningful and decodable represen- tation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 619–10 629

2022
[58]

Noise2score: tweedie’s approach to self- supervised image denoising without clean images,

K. Kim and J. C. Ye, “Noise2score: tweedie’s approach to self- supervised image denoising without clean images,”Advances in Neural Information Processing Systems, vol. 34, pp. 864–874, 2021

2021
[59]

Denoising diffusion implicit models,

J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” inInternational Conference on Learning Representations, 2021

2021
[60]

Muscles in action,

M. Chiquier and C. V ondrick, “Muscles in action,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 22 091–22 101

2023
[61]

Muscles in time: Learning to understand human motion in-depth by simulating muscle activations,

D. Schneider, S. Reiß, M. Kugler, A. Jaus, K. Peng, S. Sutschet, M. S. Sarfraz, S. Matthiesen, and R. Stiefelhagen, “Muscles in time: Learning to understand human motion in-depth by simulating muscle activations,” Advances in Neural Information Processing Systems, 2025

2025
[62]

Basar: Black-box attack on skeletal action recognition,

Y . Diao, T. Shao, Y .-L. Yang, K. Zhou, and H. Wang, “Basar: Black-box attack on skeletal action recognition,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, Confer- ence Proceedings, pp. 7597–7607

2021
[63]

Smoodi: Stylized motion diffusion model,

L. Zhong, Y . Xie, V . Jampani, D. Sun, and H. Jiang, “Smoodi: Stylized motion diffusion model,” inEuropean Conference on Computer Vision. Springer, 2025, pp. 405–421

2025
[64]

Spatio-temporal tuples transformer for skeleton-based action recognition,

H. Qiu, B. Hou, B. Ren, and X. Zhang, “Spatio-temporal tuples transformer for skeleton-based action recognition,”arXiv preprint arXiv:2201.02849, 2022

work page arXiv 2022
[65]

Skateformer: Skeletal-temporal transformer for human action recognition,

J. Do and M. Kim, “Skateformer: Skeletal-temporal transformer for human action recognition,” inEuropean Conference on Computer Vision. Springer, 2025

2025
[66]

Learning discriminative representations for skeleton based action recognition,

H. Zhou, Q. Liu, and Y . Wang, “Learning discriminative representations for skeleton based action recognition,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 10 608–10 617

2023
[67]

Certified adversarial robustness via randomized smoothing,

J. Cohen, E. Rosenfeld, and Z. Kolter, “Certified adversarial robustness via randomized smoothing,” ininternational conference on machine learning. PMLR, 2019, pp. 1310–1320

2019
[68]

Generating diverse and natural 3d human motions from text,

C. Guo, S. Zou, X. Zuo, S. Wang, W. Ji, X. Li, and L. Cheng, “Generating diverse and natural 3d human motions from text,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 5152–5161

2022
[69]

Physics-constrained attack against convolution-based human motion prediction,

C. Duan, Z. Zhang, X. Liu, Y . Dang, and J. Yin, “Physics-constrained attack against convolution-based human motion prediction,”Neurocom- puting, vol. 575, p. 127272, 2024

2024
[70]

Adversarial examples in the physical world,

A. Kurakin, I. J. Goodfellow, and S. Bengio, “Adversarial examples in the physical world,” inArtificial intelligence safety and security. Chapman and Hall/CRC, 2018, pp. 99–112

2018
[71]

Boosting adversarial attacks with momentum,

Y . Dong, F. Liao, T. Pang, H. Su, J. Zhu, X. Hu, and J. Li, “Boosting adversarial attacks with momentum,” inProceedings of the IEEE confer- ence on computer vision and pattern recognition, 2018, pp. 9185–9193

2018
[72]

Transferable adversarial attack for both vision transformers and convolutional networks via momentum integrated gradients,

W. Ma, Y . Li, X. Jia, and W. Xu, “Transferable adversarial attack for both vision transformers and convolutional networks via momentum integrated gradients,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4630–4639. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 15

2023
[73]

Human motion diffusion model,

G. Tevet, S. Raab, B. Gordon, Y . Shafir, D. Cohen-or, and A. H. Bermano, “Human motion diffusion model,” inThe Eleventh Interna- tional Conference on Learning Representations, 2023

2023
[74]

Understanding the latent space of diffusion models through the lens of riemannian geometry,

Y .-H. Park, M. Kwon, J. Choi, J. Jo, and Y . Uh, “Understanding the latent space of diffusion models through the lens of riemannian geometry,”Advances in Neural Information Processing Systems, vol. 36, pp. 24 129–24 142, 2023

2023
[75]

A phase transition in diffu- sion models reveals the hierarchical nature of data,

A. Sclocchi, A. Favero, and M. Wyart, “A phase transition in diffu- sion models reveals the hierarchical nature of data,”arXiv preprint arXiv:2402.16991, 2024

work page arXiv 2024
[76]

Dreamtime: An improved optimization strategy for diffusion-guided 3d generation,

Y . Huang, J. Wang, Y . Shi, B. Tang, X. Qi, and L. Zhang, “Dreamtime: An improved optimization strategy for diffusion-guided 3d generation,” inThe Twelfth International Conference on Learning Representations, 2023

2023
[77]

Diffusion models already have a semantic latent space,

M. Kwon, J. Jeong, and Y . Uh, “Diffusion models already have a semantic latent space,” inThe Eleventh International Conference on Learning Representations, 2023

2023
[78]

Magr: Manifold-aligned graph regularization for continual action quality assessment,

K. Zhou, L. Wang, X. Zhang, H. P. H. Shum, F. W. B. Li, J. Li, and X. Liang, “Magr: Manifold-aligned graph regularization for continual action quality assessment,” inProceedings of the 2024 European Con- ference on Computer Vision, ser. ECCV ’24. Springer, 2024

2024
[79]

A video-based augmented reality system for human-in-the-loop muscle strength assessment of juvenile dermatomyositis,

K. Zhou, R. Cai, Y . Ma, Q. Tan, X. Wang, J. Li, H. P. Shum, F. W. Li, S. Jin, and X. Liang, “A video-based augmented reality system for human-in-the-loop muscle strength assessment of juvenile dermatomyositis,”IEEE Transactions on Visualization and Computer Graphics, vol. 29, no. 5, pp. 2456–2466, 2023

2023
[80]

Phi: Bridging domain shift in long-term action quality assessment via progressive hierarchical instruction,

K. Zhou, H. P. H. Shum, F. W. B. Li, X. Zhang, and X. Liang, “Phi: Bridging domain shift in long-term action quality assessment via progressive hierarchical instruction,”IEEE Transactions on Image Processing, vol. 34, pp. 3718–3732, 2025

2025

Showing first 80 references.

[1] [1]

A survey on 3d skeleton-based action recognition using learning method,

B. Ren, M. Liu, R. Ding, and H. Liu, “A survey on 3d skeleton-based action recognition using learning method,”Cyborg and Bionic Systems, vol. 5, p. 0100, 2024

2024

[2] [2]

Tasar: Transferable attack on skeletal action recognition,

Y . Diao, B. Wu, R. Zhang, A. Liu, X. Wei, M. Wang, and H. Wang, “Tasar: Transferable attack on skeletal action recognition,”arXiv preprint arXiv:2409.02483, 2024

work page arXiv 2024

[3] [3]

Recent advances in adversarial training for adversarial robustness,

T. Bai, J. Luo, J. Zhao, B. Wen, and Q. Wang, “Recent advances in adversarial training for adversarial robustness,” inProceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI- 21, Z.-H. Zhou, Ed. International Joint Conferences on Artificial Intelligence Organization, 8 2021, pp. 4312–4321, survey Track

2021

[4] [4]

Toward trustworthy artificial intelligence (tai) in the context of explainability and robustness,

B. Chander, C. John, L. Warrier, and K. Gopalakrishnan, “Toward trustworthy artificial intelligence (tai) in the context of explainability and robustness,”ACM Computing Surveys, 2024

2024

[5] [5]

Understanding the robustness of skeleton-based action recognition under adversarial attack,

H. Wang, F. He, Z. Peng, T. Shao, Y .-L. Yang, K. Zhou, and D. Hogg, “Understanding the robustness of skeleton-based action recognition under adversarial attack,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 14 656–14 665

2021

[6] [6]

Adversarial attack on skeleton-based human action recognition,

J. Liu, N. Akhtar, and A. Mian, “Adversarial attack on skeleton-based human action recognition,”IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 4, pp. 1609–1622, 2020

2020

[7] [7]

Decomposing biological motion: A framework for analysis and synthesis of human gait patterns,

N. F. Troje, “Decomposing biological motion: A framework for analysis and synthesis of human gait patterns,”Journal of vision, vol. 2, no. 5, pp. 2–2, 2002

2002

[8] [8]

Modulation of motor area activity during observation of unnatural body movements,

S. Shimada and K. Oki, “Modulation of motor area activity during observation of unnatural body movements,”Brain and cognition, vol. 80, no. 1, pp. 1–6, 2012

2012

[9] [9]

Ntu rgb+ d: A large scale dataset for 3d human activity analysis,

A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “Ntu rgb+ d: A large scale dataset for 3d human activity analysis,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1010–1019

2016

[10] [10]

The Kinetics Human Action Video Dataset

W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijaya- narasimhan, F. Viola, T. Green, T. Back, P. Natsevet al., “The kinetics human action video dataset,”arXiv preprint arXiv:1705.06950, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[11] [11]

Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding,

J. Liu, A. Shahroudy, M. Perez, G. Wang, L.-Y . Duan, and A. C. Kot, “Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding,”IEEE transactions on pattern analysis and machine intelligence, vol. 42, no. 10, pp. 2684–2701, 2019

2019

[12] [12]

Towards deep learning models resistant to adversarial attacks,

A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” inInternational Conference on Learning Representations, 2018

2018

[13] [13]

Adversarial examples are not bugs, they are features,

A. Ilyas, S. Santurkar, D. Tsipras, L. Engstrom, B. Tran, and A. Madry, “Adversarial examples are not bugs, they are features,”Advances in neural information processing systems, vol. 32, 2019

2019

[14] [14]

Adversarial examples are not real features,

A. Li, Y . Wang, Y . Guo, and Y . Wang, “Adversarial examples are not real features,”Advances in Neural Information Processing Systems, vol. 36, 2024

2024

[15] [15]

Optimizing diffusion noise can serve as universal motion priors,

K. Karunratanakul, K. Preechakul, E. Aksan, T. Beeler, S. Suwajanakorn, and S. Tang, “Optimizing diffusion noise can serve as universal motion priors,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 1334–1345

2024

[16] [16]

Lidpose: Real-time 3d human pose estimation in sparse lidar point clouds with non-repetitive circular scanning pattern,

L. Kov ´acs, B. M. B´odis, and C. Benedek, “Lidpose: Real-time 3d human pose estimation in sparse lidar point clouds with non-repetitive circular scanning pattern,”Sensors, vol. 24, no. 11, p. 3427, 2024

2024

[17] [17]

Finepose: Fine-grained prompt-driven 3d human pose estimation via diffusion models,

J. Xu, Y . Guo, and Y . Peng, “Finepose: Fine-grained prompt-driven 3d human pose estimation via diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 561–570

2024

[18] [18]

Wham: Reconstructing world-grounded humans with accurate 3d motion,

S. Shin, J. Kim, E. Halilaj, and M. J. Black, “Wham: Reconstructing world-grounded humans with accurate 3d motion,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 2070–2080

2024

[19] [19]

Principles of risk minimization for learning theory,

V . Vapnik, “Principles of risk minimization for learning theory,”Ad- vances in neural information processing systems, vol. 4, 1991

1991

[20] [20]

Devroye, L

L. Devroye, L. Gy ¨orfi, and G. Lugosi,A probabilistic theory of pattern recognition. Springer Science & Business Media, 2013, vol. 31

2013

[21] [21]

Model selection and model averaging,

G. Claeskens and N. L. Hjort, “Model selection and model averaging,” Cambridge books, 2008

2008

[22] [22]

Costs of position, velocity, and force requirements in optimal control induce triphasic muscle activation during reaching movement,

Y . Ueyama, “Costs of position, velocity, and force requirements in optimal control induce triphasic muscle activation during reaching movement,”Scientific Reports, vol. 11, no. 1, p. 16815, 2021

2021

[23] [23]

Five basic muscle ac- tivation patterns account for muscle activity during human locomotion,

Y . P. Ivanenko, R. E. Poppele, and F. Lacquaniti, “Five basic muscle ac- tivation patterns account for muscle activity during human locomotion,” The Journal of physiology, vol. 556, no. 1, pp. 267–282, 2004

2004

[24] [24]

Real-time style modelling of human locomotion via feature-wise transformations and local motion phases,

I. Mason, S. Starke, and T. Komura, “Real-time style modelling of human locomotion via feature-wise transformations and local motion phases,”Proceedings of the ACM on Computer Graphics and Interactive Techniques, vol. 5, no. 1, may 2022

2022

[25] [25]

Documentation mocap database hdm05,

M. M ¨uller, T. R¨oder, M. Clausen, B. Eberhardt, B. Kr¨uger, and A. Weber, “Documentation mocap database hdm05,” Universit¨at Bonn, Tech. Rep. CG-2007-2, June 2007

2007

[26] [26]

Intriguing properties of neural networks

C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,”arXiv preprint arXiv:1312.6199, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[27] [27]

Better aggregation in test-time augmentation,

D. Shanmugam, D. Blalock, G. Balakrishnan, and J. Guttag, “Better aggregation in test-time augmentation,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 1214–1223

2021

[28] [28]

Retouchuaa: Unconstrained adversarial attack via realistic image retouching,

M. Xie, Y . He, Z. Qin, and M. Fang, “Retouchuaa: Unconstrained adversarial attack via realistic image retouching,”IEEE Transactions IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 14 on Circuits and Systems for Video Technology, vol. 35, no. 3, pp. 2586– 2602, 2025

2025

[29] [29]

Lesep: Boosting adversarial transferability via latent encoding and semantic embedding perturbations,

Y . Gan, C. Wu, D. Ouyang, S. Tang, M. Ye, and T. Xiang, “Lesep: Boosting adversarial transferability via latent encoding and semantic embedding perturbations,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 35, no. 4, pp. 3368–3382, 2025

2025

[30] [30]

Videopure: Diffusion-based adversarial purification for video recognition,

K. Jiang, Z. Chen, J. Fu, L. Hong, J. Li, and W. Zhang, “Videopure: Diffusion-based adversarial purification for video recognition,”IEEE Transactions on Circuits and Systems for Video Technology, pp. 1–1, 2025

2025

[31] [31]

Transhfc: Joints hypergraph filtering convolution and transformer framework for temporal forgery localization,

J. Huang, X. Yuan, C.-T. Lam, S.-K. Im, F. Lei, and X. Bi, “Transhfc: Joints hypergraph filtering convolution and transformer framework for temporal forgery localization,”IEEE Transactions on Circuits and Systems for Video Technology, 2025

2025

[32] [32]

A unified framework for adversarial patch attacks against visual 3d object detection in autonomous driving,

J. Wang, F. Li, and L. He, “A unified framework for adversarial patch attacks against visual 3d object detection in autonomous driving,”IEEE Transactions on Circuits and Systems for Video Technology, 2025

2025

[33] [33]

Synthesizing robust adversarial examples,

A. Athalye, L. Engstrom, A. Ilyas, and K. Kwok, “Synthesizing robust adversarial examples,” inInternational conference on machine learning. PMLR, 2018, pp. 284–293

2018

[34] [34]

Deeprobust: A pytorch library for adversarial attacks and defenses,

Y . Li, W. Jin, H. Xu, and J. Tang, “Deeprobust: A pytorch library for adversarial attacks and defenses,”arXiv preprint arXiv:2005.06149, 2020

work page arXiv 2005

[35] [35]

Adversarial attacks on time series,

F. Karim, S. Majumdar, and H. Darabi, “Adversarial attacks on time series,”IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 10, pp. 3309–3320, 2020

2020

[36] [36]

Hard no- box adversarial attack on skeleton-based human action recognition with skeleton-motion-informed gradient,

Z. Lu, H. Wang, Z. Chang, G. Yang, and H. P. Shum, “Hard no- box adversarial attack on skeleton-based human action recognition with skeleton-motion-informed gradient,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4597–4606

2023

[37] [37]

Understanding the vulnerability of skeleton-based human activity recognition via black-box attack,

Y . Diao, H. Wang, T. Shao, Y . Yang, K. Zhou, D. Hogg, and M. Wang, “Understanding the vulnerability of skeleton-based human activity recognition via black-box attack,”Pattern Recognition, vol. 153, p. 110564, 2024

2024

[38] [38]

Adversarial bone length attack on action recognition,

N. Tanaka, H. Kera, and K. Kawamoto, “Adversarial bone length attack on action recognition,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 2, 2022, pp. 2335–2343

2022

[39] [39]

Bones of contention: Exploring query-efficient attacks against skeleton recognition systems,

Y . Cao, K. Ye, D. Wang, M. Xue, H. Ge, C. Qian, and J. S. Dong, “Bones of contention: Exploring query-efficient attacks against skeleton recognition systems,”IEEE Transactions on Information Forensics and Security, vol. 21, pp. 183–196, 2025

2025

[40] [40]

Qesar: Query effective decision-based attack on skeletal action recognition,

Z. Kang, Y . Zhang, R. Zhang, Y . Jiang, and H. Xia, “Qesar: Query effective decision-based attack on skeletal action recognition,” inChi- nese Conference on Pattern Recognition and Computer Vision (PRCV). Springer, 2023, pp. 417–429

2023

[41] [41]

Deep unsupervised learning using nonequilibrium thermodynamics,

J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” inInternational Conference on Machine Learning. PMLR, 2015, pp. 2256–2265

2015

[42] [42]

On the design fundamentals of diffusion models: A survey,

Z. Chang, G. A. Koulieris, and H. P. Shum, “On the design fundamentals of diffusion models: A survey,”arXiv preprint arXiv:2306.04542, 2023

work page arXiv 2023

[43] [43]

Boosting black-box attack to deep neural networks with conditional diffusion models,

R. Liu, W. Zhou, T. Zhang, K. Chen, J. Zhao, and K.-Y . Lam, “Boosting black-box attack to deep neural networks with conditional diffusion models,”IEEE Transactions on Information Forensics and Security, 2024

2024

[44] [44]

Advdiffuser: Natural adversarial example synthesis with diffusion models,

X. Chen, X. Gao, J. Zhao, K. Ye, and C.-Z. Xu, “Advdiffuser: Natural adversarial example synthesis with diffusion models,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4562–4572

2023

[45] [45]

Un- stoppable attack: Label-only model inversion via conditional diffusion model,

R. Liu, D. Wang, Y . Ren, Z. Wang, K. Guo, Q. Qin, and X. Liu, “Un- stoppable attack: Label-only model inversion via conditional diffusion model,”IEEE Transactions on Information Forensics and Security, 2024

2024

[46] [46]

Towards transferable attack via adversarial diffusion in face recognition,

C. Hu, Y . Li, Z. Feng, and X. Wu, “Towards transferable attack via adversarial diffusion in face recognition,”IEEE Transactions on Information Forensics and Security, 2024

2024

[47] [47]

Establishing robust generative image steganography via popular stable diffusion,

X. Hu, S. Li, Q. Ying, W. Peng, X. Zhang, and Z. Qian, “Establishing robust generative image steganography via popular stable diffusion,” IEEE Transactions on Information Forensics and Security, 2024

2024

[48] [48]

Diffilter: Defending against adversarial perturbations with diffusion filter,

Y . Chen, X. Li, X. Wang, P. Hu, and D. Peng, “Diffilter: Defending against adversarial perturbations with diffusion filter,”IEEE Transac- tions on Information Forensics and Security, 2024

2024

[49] [49]

Diffusion patch attack with spatial–temporal cross-evolution for video recognition,

J. Yang, Z. Guan, J. Li, Z. Shi, and X. Liu, “Diffusion patch attack with spatial–temporal cross-evolution for video recognition,”IEEE Transac- tions on Circuits and Systems for Video Technology, vol. 34, no. 12, pp. 13 190–13 200, 2024

2024

[50] [50]

Videopure: Diffusion-based adversarial purification for video recognition,

K. Jiang, Z. Chen, J. Fu, L. Hong, J. Li, and W. Zhang, “Videopure: Diffusion-based adversarial purification for video recognition,”IEEE Transactions on Circuits and Systems for Video Technology, 2025

2025

[51] [51]

Human motion generation: A survey,

W. Zhu, X. Ma, D. Ro, H. Ci, J. Zhang, J. Shi, F. Gao, Q. Tian, and Y . Wang, “Human motion generation: A survey,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023

2023

[52] [52]

From the perception of action to the understanding of intention,

S.-J. Blakemore and J. Decety, “From the perception of action to the understanding of intention,”Nature reviews neuroscience, vol. 2, no. 8, pp. 561–567, 2001

2001

[53] [53]

Brain areas involved in perception of biological motion,

E. Grossman, M. Donnelly, R. Price, D. Pickens, V . Morgan, G. Neigh- bor, and R. Blake, “Brain areas involved in perception of biological motion,”Journal of cognitive neuroscience, vol. 12, no. 5, pp. 711–720, 2000

2000

[54] [54]

A latent space of stochastic diffusion models for zero-shot image editing and guidance,

C. H. Wu and F. De la Torre, “A latent space of stochastic diffusion models for zero-shot image editing and guidance,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 7378–7387

2023

[55] [55]

Intriguing properties of generative classifiers,

P. Jaini, K. Clark, and R. Geirhos, “Intriguing properties of generative classifiers,” inThe Twelfth International Conference on Learning Rep- resentations, 2024

2024

[56] [56]

Action-conditioned 3d human motion synthesis with transformer vae,

M. Petrovich, M. J. Black, and G. Varol, “Action-conditioned 3d human motion synthesis with transformer vae,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 985–10 995

2021

[57] [57]

Diffusion autoencoders: Toward a meaningful and decodable represen- tation,

K. Preechakul, N. Chatthee, S. Wizadwongsa, and S. Suwajanakorn, “Diffusion autoencoders: Toward a meaningful and decodable represen- tation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 619–10 629

2022

[58] [58]

Noise2score: tweedie’s approach to self- supervised image denoising without clean images,

K. Kim and J. C. Ye, “Noise2score: tweedie’s approach to self- supervised image denoising without clean images,”Advances in Neural Information Processing Systems, vol. 34, pp. 864–874, 2021

2021

[59] [59]

Denoising diffusion implicit models,

J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” inInternational Conference on Learning Representations, 2021

2021

[60] [60]

Muscles in action,

M. Chiquier and C. V ondrick, “Muscles in action,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 22 091–22 101

2023

[61] [61]

Muscles in time: Learning to understand human motion in-depth by simulating muscle activations,

D. Schneider, S. Reiß, M. Kugler, A. Jaus, K. Peng, S. Sutschet, M. S. Sarfraz, S. Matthiesen, and R. Stiefelhagen, “Muscles in time: Learning to understand human motion in-depth by simulating muscle activations,” Advances in Neural Information Processing Systems, 2025

2025

[62] [62]

Basar: Black-box attack on skeletal action recognition,

Y . Diao, T. Shao, Y .-L. Yang, K. Zhou, and H. Wang, “Basar: Black-box attack on skeletal action recognition,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, Confer- ence Proceedings, pp. 7597–7607

2021

[63] [63]

Smoodi: Stylized motion diffusion model,

L. Zhong, Y . Xie, V . Jampani, D. Sun, and H. Jiang, “Smoodi: Stylized motion diffusion model,” inEuropean Conference on Computer Vision. Springer, 2025, pp. 405–421

2025

[64] [64]

Spatio-temporal tuples transformer for skeleton-based action recognition,

H. Qiu, B. Hou, B. Ren, and X. Zhang, “Spatio-temporal tuples transformer for skeleton-based action recognition,”arXiv preprint arXiv:2201.02849, 2022

work page arXiv 2022

[65] [65]

Skateformer: Skeletal-temporal transformer for human action recognition,

J. Do and M. Kim, “Skateformer: Skeletal-temporal transformer for human action recognition,” inEuropean Conference on Computer Vision. Springer, 2025

2025

[66] [66]

Learning discriminative representations for skeleton based action recognition,

H. Zhou, Q. Liu, and Y . Wang, “Learning discriminative representations for skeleton based action recognition,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 10 608–10 617

2023

[67] [67]

Certified adversarial robustness via randomized smoothing,

J. Cohen, E. Rosenfeld, and Z. Kolter, “Certified adversarial robustness via randomized smoothing,” ininternational conference on machine learning. PMLR, 2019, pp. 1310–1320

2019

[68] [68]

Generating diverse and natural 3d human motions from text,

C. Guo, S. Zou, X. Zuo, S. Wang, W. Ji, X. Li, and L. Cheng, “Generating diverse and natural 3d human motions from text,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 5152–5161

2022

[69] [69]

Physics-constrained attack against convolution-based human motion prediction,

C. Duan, Z. Zhang, X. Liu, Y . Dang, and J. Yin, “Physics-constrained attack against convolution-based human motion prediction,”Neurocom- puting, vol. 575, p. 127272, 2024

2024

[70] [70]

Adversarial examples in the physical world,

A. Kurakin, I. J. Goodfellow, and S. Bengio, “Adversarial examples in the physical world,” inArtificial intelligence safety and security. Chapman and Hall/CRC, 2018, pp. 99–112

2018

[71] [71]

Boosting adversarial attacks with momentum,

Y . Dong, F. Liao, T. Pang, H. Su, J. Zhu, X. Hu, and J. Li, “Boosting adversarial attacks with momentum,” inProceedings of the IEEE confer- ence on computer vision and pattern recognition, 2018, pp. 9185–9193

2018

[72] [72]

Transferable adversarial attack for both vision transformers and convolutional networks via momentum integrated gradients,

W. Ma, Y . Li, X. Jia, and W. Xu, “Transferable adversarial attack for both vision transformers and convolutional networks via momentum integrated gradients,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4630–4639. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 15

2023

[73] [73]

Human motion diffusion model,

G. Tevet, S. Raab, B. Gordon, Y . Shafir, D. Cohen-or, and A. H. Bermano, “Human motion diffusion model,” inThe Eleventh Interna- tional Conference on Learning Representations, 2023

2023

[74] [74]

Understanding the latent space of diffusion models through the lens of riemannian geometry,

Y .-H. Park, M. Kwon, J. Choi, J. Jo, and Y . Uh, “Understanding the latent space of diffusion models through the lens of riemannian geometry,”Advances in Neural Information Processing Systems, vol. 36, pp. 24 129–24 142, 2023

2023

[75] [75]

A phase transition in diffu- sion models reveals the hierarchical nature of data,

A. Sclocchi, A. Favero, and M. Wyart, “A phase transition in diffu- sion models reveals the hierarchical nature of data,”arXiv preprint arXiv:2402.16991, 2024

work page arXiv 2024

[76] [76]

Dreamtime: An improved optimization strategy for diffusion-guided 3d generation,

Y . Huang, J. Wang, Y . Shi, B. Tang, X. Qi, and L. Zhang, “Dreamtime: An improved optimization strategy for diffusion-guided 3d generation,” inThe Twelfth International Conference on Learning Representations, 2023

2023

[77] [77]

Diffusion models already have a semantic latent space,

M. Kwon, J. Jeong, and Y . Uh, “Diffusion models already have a semantic latent space,” inThe Eleventh International Conference on Learning Representations, 2023

2023

[78] [78]

Magr: Manifold-aligned graph regularization for continual action quality assessment,

K. Zhou, L. Wang, X. Zhang, H. P. H. Shum, F. W. B. Li, J. Li, and X. Liang, “Magr: Manifold-aligned graph regularization for continual action quality assessment,” inProceedings of the 2024 European Con- ference on Computer Vision, ser. ECCV ’24. Springer, 2024

2024

[79] [79]

A video-based augmented reality system for human-in-the-loop muscle strength assessment of juvenile dermatomyositis,

K. Zhou, R. Cai, Y . Ma, Q. Tan, X. Wang, J. Li, H. P. Shum, F. W. Li, S. Jin, and X. Liang, “A video-based augmented reality system for human-in-the-loop muscle strength assessment of juvenile dermatomyositis,”IEEE Transactions on Visualization and Computer Graphics, vol. 29, no. 5, pp. 2456–2466, 2023

2023

[80] [80]

Phi: Bridging domain shift in long-term action quality assessment via progressive hierarchical instruction,

K. Zhou, H. P. H. Shum, F. W. B. Li, X. Zhang, and X. Liang, “Phi: Bridging domain shift in long-term action quality assessment via progressive hierarchical instruction,”IEEE Transactions on Image Processing, vol. 34, pp. 3718–3732, 2025

2025