pith. sign in

arxiv: 2606.13022 · v1 · pith:AV2QQ2UZnew · submitted 2026-06-11 · 💻 cs.CV · cs.LG

Quality-Preserving Imperceptible Adversarial Attack on Skeleton-based Human Action Recognition

Pith reviewed 2026-06-27 07:16 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords adversarial attackskeleton-based action recognitionmotion qualitydistribution-based optimizationempirical and true risksimperceptible attackhuman action recognition
0
0 comments X

The pith

A distribution-based attack closes the empirical-true risk gap to produce imperceptible adversarial motions on skeleton action recognition without noise perturbations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper finds that prior adversarial attacks degrade post-attack motion quality because their optimization creates a gap between empirical and true risks. It therefore introduces a distribution-based method that generates adversarial examples directly from motion distributions rather than by adding noise-like perturbations. This keeps the resulting motions natural while raising attack success rates against current skeleton-based human action recognizers. A new evaluation metric is also defined that better matches human judgments of naturalness. If the approach holds, existing action recognizers become more vulnerable than previously measured and require stronger defenses.

Core claim

Degradation in motion quality after adversarial attacks on skeleton-based human action recognition stems from the gap between empirical and true risks in the optimization process. A distribution-based adversarial attack method minimizes this gap without introducing noise-like perturbations, thereby preserving motion quality while achieving higher attack success rates, as shown through experiments on state-of-the-art methods across two datasets and a newly proposed human-aligned quality metric.

What carries the argument

Distribution-based adversarial attack that minimizes the empirical-true risk gap without noise-like perturbations.

If this is right

  • Attack success rates exceed those of prior noise-based methods on state-of-the-art skeleton action recognizers.
  • Post-attack motions retain naturalness according to both the proposed metric and qualitative inspection.
  • The attacks remain imperceptible even to recent S-HAR systems that detect noise-like perturbations.
  • The results indicate that current action recognizers lack robustness against carefully optimized distribution shifts.
  • Further defense research is needed to address optimization gaps rather than perturbation magnitude alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the risk-gap diagnosis is accurate, similar distribution-based formulations could be tested on video or RGB action datasets to check whether the same quality preservation occurs.
  • Defenses might be strengthened by explicitly regularizing the empirical-true risk gap during training rather than only penalizing perturbation size.
  • The new quality metric could serve as a general benchmark for any motion-editing task where human perception of naturalness matters.

Load-bearing premise

The gap between empirical and true risks is the primary driver of motion quality loss, and switching to distribution-based optimization removes that gap without creating new perceptible artifacts.

What would settle it

A controlled comparison in which the distribution-based attack still produces measurable drops in the new human-aligned quality metric or in which the empirical-true risk gap remains large after the method is applied.

Figures

Figures reproduced from arXiv: 2606.13022 by Hubert P. H. Shum, Kanglei Zhou, Xiaohui Liang, Ziyi Chang.

Figure 1
Figure 1. Figure 1: The schema of our proposed method when compared with existing [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed distribution-based imperceivable adversarial S-HAR attack where post-attack motion quality is preserved. We optimize the [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The visualization of diffusion latents at different timesteps. As shown, the earlier timesteps maintain more low-level details, and the later timesteps [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The illustration of the attack strategy. We illustrate an intermediate [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The mean power spectral density of adversarial samples found on 100STYLE (upper row) and HDM05 (lower row) against four classifiers. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The visualization of acceleration changes. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visual comparison among the adversarial motions generated by different attack methods against victim models. We visualize the starting and the ending [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Perceptibility comparison across different methods. [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
read the original abstract

Adversarial attacks on skeletal human action recognition have received significant attention. However, existing methods typically introduce noise-like perturbations that degrade motion quality post-attack, and thereby are inherently perceptible with recent advancements in S-HAR systems. We discover that this degradation stems from the gap between empirical and true risks during the optimization process of previous adversarial attacks. To address this issue, we propose an attack where adversarial motions are obtained without compromising their motion quality. To minimize the risk gap and preserve motion quality, we propose a distribution-based adversarial attack method without introducing noise-like perturbations. To faithfully evaluate the motion quality, we propose a new metric that aligns with human perception on real-world naturalness. Experiments have been conducted on the state-of-the-art S-HAR methods across two datasets, demonstrating the superiority of our method in both the attack success rate and the post-attack motion quality through qualitative and quantitative analyses. The success of our quality-preserving attack application and distribution-based method raises serious concerns about the robustness of action recognizers, highlighting the need for further enhancements in this domain.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that motion quality degradation in prior adversarial attacks on skeleton-based human action recognition (S-HAR) arises from the gap between empirical and true risks during optimization. It proposes a distribution-based attack method that generates adversarial motions without noise-like perturbations to close this gap and preserve quality, introduces a new human-perception-aligned motion quality metric, and reports superior attack success rates and post-attack quality on state-of-the-art S-HAR models across two datasets via qualitative and quantitative experiments.

Significance. If the risk-gap causal mechanism is isolated and the distribution-based method demonstrably closes it without other artifacts, the work would usefully expose robustness gaps in S-HAR systems and supply a quality-preserving attack baseline. The multi-dataset, multi-model experimental protocol and the attempt to introduce a perceptually aligned metric are positive elements that could support follow-on robustness research.

major comments (2)
  1. [Abstract] Abstract: the assertion that quality degradation 'stems from the gap between empirical and true risks' is load-bearing for the motivation and for the claim that the distribution-based method addresses the root cause. No ablation is described that holds perturbation magnitude or manifold distance fixed while varying only the risk-gap term, leaving the causal attribution untested against alternative explanations such as perturbation style.
  2. [Abstract, §4] Abstract and §4 (method): the distribution-based optimization is presented as directly minimizing the risk gap without introducing perceptible artifacts, yet the manuscript does not report an explicit measurement of the empirical-true risk gap on prior attacks or a controlled comparison showing that gap reduction (rather than on-manifold sampling by construction) is responsible for the observed quality gains.
minor comments (2)
  1. [§3.2] The new motion quality metric is introduced without a dedicated validation section comparing it against existing perceptual or geometric metrics on a human study; this should be added or referenced.
  2. [§4] Notation for the distribution-based attack (e.g., how the motion distribution is parameterized and optimized) is introduced without an explicit equation or pseudocode block, making reproduction difficult.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. The concerns about isolating the causal role of the empirical-true risk gap are well-taken and point to opportunities to strengthen the manuscript. We respond to each major comment below and commit to revisions that directly address the identified gaps in evidence.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertion that quality degradation 'stems from the gap between empirical and true risks' is load-bearing for the motivation and for the claim that the distribution-based method addresses the root cause. No ablation is described that holds perturbation magnitude or manifold distance fixed while varying only the risk-gap term, leaving the causal attribution untested against alternative explanations such as perturbation style.

    Authors: We agree that the current manuscript does not contain an ablation that isolates the risk-gap term while holding perturbation magnitude and manifold distance fixed. Such a controlled experiment would provide stronger support for the causal claim over alternative explanations. In the revised version we will add this ablation, comparing optimization objectives under matched constraints on perturbation size and manifold proximity. revision: yes

  2. Referee: [Abstract, §4] Abstract and §4 (method): the distribution-based optimization is presented as directly minimizing the risk gap without introducing perceptible artifacts, yet the manuscript does not report an explicit measurement of the empirical-true risk gap on prior attacks or a controlled comparison showing that gap reduction (rather than on-manifold sampling by construction) is responsible for the observed quality gains.

    Authors: The manuscript does not report numerical values of the empirical-true risk gap for baseline attacks, nor a direct comparison that attributes quality gains specifically to gap reduction versus the on-manifold sampling property. While the distribution-based formulation is motivated by gap minimization and the quality results are consistent with this view, we acknowledge the absence of the requested explicit measurements. We will compute and report the risk-gap values for prior methods and our approach, together with the controlled comparison, in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation relies on external empirical observation and new method, not self-referential reduction

full rationale

The abstract asserts a causal link between quality degradation and the empirical-true risk gap, then introduces a distribution-based attack to close that gap. No equations, fitted parameters renamed as predictions, or self-citations appear in the provided text that would make the proposed method or its superiority equivalent to the inputs by construction. The new quality metric and experimental comparisons are presented as independent evaluations. This matches the default expectation of a non-circular paper whose central claims rest on external benchmarks rather than tautological redefinitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the hypothesis that empirical-true risk gap drives quality loss and that distribution modeling closes it; these are introduced without upstream independent evidence in the provided abstract.

axioms (1)
  • domain assumption Degradation in motion quality of prior attacks stems primarily from the gap between empirical and true risks in optimization
    Explicitly stated as the discovery motivating the work in the abstract.
invented entities (2)
  • distribution-based adversarial attack method no independent evidence
    purpose: Generate adversarial motions without noise-like perturbations by minimizing risk gap
    Proposed as the core solution in the abstract.
  • new motion quality metric no independent evidence
    purpose: Faithfully evaluate post-attack motion naturalness in alignment with human perception
    Introduced in the abstract to address evaluation limitations of prior methods.

pith-pipeline@v0.9.1-grok · 5726 in / 1393 out tokens · 37627 ms · 2026-06-27T07:16:18.776346+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

81 extracted references · 7 canonical work pages · 2 internal anchors

  1. [1]

    A survey on 3d skeleton-based action recognition using learning method,

    B. Ren, M. Liu, R. Ding, and H. Liu, “A survey on 3d skeleton-based action recognition using learning method,”Cyborg and Bionic Systems, vol. 5, p. 0100, 2024

  2. [2]

    Tasar: Transferable attack on skeletal action recognition,

    Y . Diao, B. Wu, R. Zhang, A. Liu, X. Wei, M. Wang, and H. Wang, “Tasar: Transferable attack on skeletal action recognition,”arXiv preprint arXiv:2409.02483, 2024

  3. [3]

    Recent advances in adversarial training for adversarial robustness,

    T. Bai, J. Luo, J. Zhao, B. Wen, and Q. Wang, “Recent advances in adversarial training for adversarial robustness,” inProceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI- 21, Z.-H. Zhou, Ed. International Joint Conferences on Artificial Intelligence Organization, 8 2021, pp. 4312–4321, survey Track

  4. [4]

    Toward trustworthy artificial intelligence (tai) in the context of explainability and robustness,

    B. Chander, C. John, L. Warrier, and K. Gopalakrishnan, “Toward trustworthy artificial intelligence (tai) in the context of explainability and robustness,”ACM Computing Surveys, 2024

  5. [5]

    Understanding the robustness of skeleton-based action recognition under adversarial attack,

    H. Wang, F. He, Z. Peng, T. Shao, Y .-L. Yang, K. Zhou, and D. Hogg, “Understanding the robustness of skeleton-based action recognition under adversarial attack,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 14 656–14 665

  6. [6]

    Adversarial attack on skeleton-based human action recognition,

    J. Liu, N. Akhtar, and A. Mian, “Adversarial attack on skeleton-based human action recognition,”IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 4, pp. 1609–1622, 2020

  7. [7]

    Decomposing biological motion: A framework for analysis and synthesis of human gait patterns,

    N. F. Troje, “Decomposing biological motion: A framework for analysis and synthesis of human gait patterns,”Journal of vision, vol. 2, no. 5, pp. 2–2, 2002

  8. [8]

    Modulation of motor area activity during observation of unnatural body movements,

    S. Shimada and K. Oki, “Modulation of motor area activity during observation of unnatural body movements,”Brain and cognition, vol. 80, no. 1, pp. 1–6, 2012

  9. [9]

    Ntu rgb+ d: A large scale dataset for 3d human activity analysis,

    A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “Ntu rgb+ d: A large scale dataset for 3d human activity analysis,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1010–1019

  10. [10]

    The Kinetics Human Action Video Dataset

    W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijaya- narasimhan, F. Viola, T. Green, T. Back, P. Natsevet al., “The kinetics human action video dataset,”arXiv preprint arXiv:1705.06950, 2017

  11. [11]

    Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding,

    J. Liu, A. Shahroudy, M. Perez, G. Wang, L.-Y . Duan, and A. C. Kot, “Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding,”IEEE transactions on pattern analysis and machine intelligence, vol. 42, no. 10, pp. 2684–2701, 2019

  12. [12]

    Towards deep learning models resistant to adversarial attacks,

    A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” inInternational Conference on Learning Representations, 2018

  13. [13]

    Adversarial examples are not bugs, they are features,

    A. Ilyas, S. Santurkar, D. Tsipras, L. Engstrom, B. Tran, and A. Madry, “Adversarial examples are not bugs, they are features,”Advances in neural information processing systems, vol. 32, 2019

  14. [14]

    Adversarial examples are not real features,

    A. Li, Y . Wang, Y . Guo, and Y . Wang, “Adversarial examples are not real features,”Advances in Neural Information Processing Systems, vol. 36, 2024

  15. [15]

    Optimizing diffusion noise can serve as universal motion priors,

    K. Karunratanakul, K. Preechakul, E. Aksan, T. Beeler, S. Suwajanakorn, and S. Tang, “Optimizing diffusion noise can serve as universal motion priors,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 1334–1345

  16. [16]

    Lidpose: Real-time 3d human pose estimation in sparse lidar point clouds with non-repetitive circular scanning pattern,

    L. Kov ´acs, B. M. B´odis, and C. Benedek, “Lidpose: Real-time 3d human pose estimation in sparse lidar point clouds with non-repetitive circular scanning pattern,”Sensors, vol. 24, no. 11, p. 3427, 2024

  17. [17]

    Finepose: Fine-grained prompt-driven 3d human pose estimation via diffusion models,

    J. Xu, Y . Guo, and Y . Peng, “Finepose: Fine-grained prompt-driven 3d human pose estimation via diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 561–570

  18. [18]

    Wham: Reconstructing world-grounded humans with accurate 3d motion,

    S. Shin, J. Kim, E. Halilaj, and M. J. Black, “Wham: Reconstructing world-grounded humans with accurate 3d motion,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 2070–2080

  19. [19]

    Principles of risk minimization for learning theory,

    V . Vapnik, “Principles of risk minimization for learning theory,”Ad- vances in neural information processing systems, vol. 4, 1991

  20. [20]

    Devroye, L

    L. Devroye, L. Gy ¨orfi, and G. Lugosi,A probabilistic theory of pattern recognition. Springer Science & Business Media, 2013, vol. 31

  21. [21]

    Model selection and model averaging,

    G. Claeskens and N. L. Hjort, “Model selection and model averaging,” Cambridge books, 2008

  22. [22]

    Costs of position, velocity, and force requirements in optimal control induce triphasic muscle activation during reaching movement,

    Y . Ueyama, “Costs of position, velocity, and force requirements in optimal control induce triphasic muscle activation during reaching movement,”Scientific Reports, vol. 11, no. 1, p. 16815, 2021

  23. [23]

    Five basic muscle ac- tivation patterns account for muscle activity during human locomotion,

    Y . P. Ivanenko, R. E. Poppele, and F. Lacquaniti, “Five basic muscle ac- tivation patterns account for muscle activity during human locomotion,” The Journal of physiology, vol. 556, no. 1, pp. 267–282, 2004

  24. [24]

    Real-time style modelling of human locomotion via feature-wise transformations and local motion phases,

    I. Mason, S. Starke, and T. Komura, “Real-time style modelling of human locomotion via feature-wise transformations and local motion phases,”Proceedings of the ACM on Computer Graphics and Interactive Techniques, vol. 5, no. 1, may 2022

  25. [25]

    Documentation mocap database hdm05,

    M. M ¨uller, T. R¨oder, M. Clausen, B. Eberhardt, B. Kr¨uger, and A. Weber, “Documentation mocap database hdm05,” Universit¨at Bonn, Tech. Rep. CG-2007-2, June 2007

  26. [26]

    Intriguing properties of neural networks

    C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,”arXiv preprint arXiv:1312.6199, 2013

  27. [27]

    Better aggregation in test-time augmentation,

    D. Shanmugam, D. Blalock, G. Balakrishnan, and J. Guttag, “Better aggregation in test-time augmentation,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 1214–1223

  28. [28]

    Retouchuaa: Unconstrained adversarial attack via realistic image retouching,

    M. Xie, Y . He, Z. Qin, and M. Fang, “Retouchuaa: Unconstrained adversarial attack via realistic image retouching,”IEEE Transactions IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 14 on Circuits and Systems for Video Technology, vol. 35, no. 3, pp. 2586– 2602, 2025

  29. [29]

    Lesep: Boosting adversarial transferability via latent encoding and semantic embedding perturbations,

    Y . Gan, C. Wu, D. Ouyang, S. Tang, M. Ye, and T. Xiang, “Lesep: Boosting adversarial transferability via latent encoding and semantic embedding perturbations,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 35, no. 4, pp. 3368–3382, 2025

  30. [30]

    Videopure: Diffusion-based adversarial purification for video recognition,

    K. Jiang, Z. Chen, J. Fu, L. Hong, J. Li, and W. Zhang, “Videopure: Diffusion-based adversarial purification for video recognition,”IEEE Transactions on Circuits and Systems for Video Technology, pp. 1–1, 2025

  31. [31]

    Transhfc: Joints hypergraph filtering convolution and transformer framework for temporal forgery localization,

    J. Huang, X. Yuan, C.-T. Lam, S.-K. Im, F. Lei, and X. Bi, “Transhfc: Joints hypergraph filtering convolution and transformer framework for temporal forgery localization,”IEEE Transactions on Circuits and Systems for Video Technology, 2025

  32. [32]

    A unified framework for adversarial patch attacks against visual 3d object detection in autonomous driving,

    J. Wang, F. Li, and L. He, “A unified framework for adversarial patch attacks against visual 3d object detection in autonomous driving,”IEEE Transactions on Circuits and Systems for Video Technology, 2025

  33. [33]

    Synthesizing robust adversarial examples,

    A. Athalye, L. Engstrom, A. Ilyas, and K. Kwok, “Synthesizing robust adversarial examples,” inInternational conference on machine learning. PMLR, 2018, pp. 284–293

  34. [34]

    Deeprobust: A pytorch library for adversarial attacks and defenses,

    Y . Li, W. Jin, H. Xu, and J. Tang, “Deeprobust: A pytorch library for adversarial attacks and defenses,”arXiv preprint arXiv:2005.06149, 2020

  35. [35]

    Adversarial attacks on time series,

    F. Karim, S. Majumdar, and H. Darabi, “Adversarial attacks on time series,”IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 10, pp. 3309–3320, 2020

  36. [36]

    Hard no- box adversarial attack on skeleton-based human action recognition with skeleton-motion-informed gradient,

    Z. Lu, H. Wang, Z. Chang, G. Yang, and H. P. Shum, “Hard no- box adversarial attack on skeleton-based human action recognition with skeleton-motion-informed gradient,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4597–4606

  37. [37]

    Understanding the vulnerability of skeleton-based human activity recognition via black-box attack,

    Y . Diao, H. Wang, T. Shao, Y . Yang, K. Zhou, D. Hogg, and M. Wang, “Understanding the vulnerability of skeleton-based human activity recognition via black-box attack,”Pattern Recognition, vol. 153, p. 110564, 2024

  38. [38]

    Adversarial bone length attack on action recognition,

    N. Tanaka, H. Kera, and K. Kawamoto, “Adversarial bone length attack on action recognition,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 2, 2022, pp. 2335–2343

  39. [39]

    Bones of contention: Exploring query-efficient attacks against skeleton recognition systems,

    Y . Cao, K. Ye, D. Wang, M. Xue, H. Ge, C. Qian, and J. S. Dong, “Bones of contention: Exploring query-efficient attacks against skeleton recognition systems,”IEEE Transactions on Information Forensics and Security, vol. 21, pp. 183–196, 2025

  40. [40]

    Qesar: Query effective decision-based attack on skeletal action recognition,

    Z. Kang, Y . Zhang, R. Zhang, Y . Jiang, and H. Xia, “Qesar: Query effective decision-based attack on skeletal action recognition,” inChi- nese Conference on Pattern Recognition and Computer Vision (PRCV). Springer, 2023, pp. 417–429

  41. [41]

    Deep unsupervised learning using nonequilibrium thermodynamics,

    J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” inInternational Conference on Machine Learning. PMLR, 2015, pp. 2256–2265

  42. [42]

    On the design fundamentals of diffusion models: A survey,

    Z. Chang, G. A. Koulieris, and H. P. Shum, “On the design fundamentals of diffusion models: A survey,”arXiv preprint arXiv:2306.04542, 2023

  43. [43]

    Boosting black-box attack to deep neural networks with conditional diffusion models,

    R. Liu, W. Zhou, T. Zhang, K. Chen, J. Zhao, and K.-Y . Lam, “Boosting black-box attack to deep neural networks with conditional diffusion models,”IEEE Transactions on Information Forensics and Security, 2024

  44. [44]

    Advdiffuser: Natural adversarial example synthesis with diffusion models,

    X. Chen, X. Gao, J. Zhao, K. Ye, and C.-Z. Xu, “Advdiffuser: Natural adversarial example synthesis with diffusion models,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4562–4572

  45. [45]

    Un- stoppable attack: Label-only model inversion via conditional diffusion model,

    R. Liu, D. Wang, Y . Ren, Z. Wang, K. Guo, Q. Qin, and X. Liu, “Un- stoppable attack: Label-only model inversion via conditional diffusion model,”IEEE Transactions on Information Forensics and Security, 2024

  46. [46]

    Towards transferable attack via adversarial diffusion in face recognition,

    C. Hu, Y . Li, Z. Feng, and X. Wu, “Towards transferable attack via adversarial diffusion in face recognition,”IEEE Transactions on Information Forensics and Security, 2024

  47. [47]

    Establishing robust generative image steganography via popular stable diffusion,

    X. Hu, S. Li, Q. Ying, W. Peng, X. Zhang, and Z. Qian, “Establishing robust generative image steganography via popular stable diffusion,” IEEE Transactions on Information Forensics and Security, 2024

  48. [48]

    Diffilter: Defending against adversarial perturbations with diffusion filter,

    Y . Chen, X. Li, X. Wang, P. Hu, and D. Peng, “Diffilter: Defending against adversarial perturbations with diffusion filter,”IEEE Transac- tions on Information Forensics and Security, 2024

  49. [49]

    Diffusion patch attack with spatial–temporal cross-evolution for video recognition,

    J. Yang, Z. Guan, J. Li, Z. Shi, and X. Liu, “Diffusion patch attack with spatial–temporal cross-evolution for video recognition,”IEEE Transac- tions on Circuits and Systems for Video Technology, vol. 34, no. 12, pp. 13 190–13 200, 2024

  50. [50]

    Videopure: Diffusion-based adversarial purification for video recognition,

    K. Jiang, Z. Chen, J. Fu, L. Hong, J. Li, and W. Zhang, “Videopure: Diffusion-based adversarial purification for video recognition,”IEEE Transactions on Circuits and Systems for Video Technology, 2025

  51. [51]

    Human motion generation: A survey,

    W. Zhu, X. Ma, D. Ro, H. Ci, J. Zhang, J. Shi, F. Gao, Q. Tian, and Y . Wang, “Human motion generation: A survey,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023

  52. [52]

    From the perception of action to the understanding of intention,

    S.-J. Blakemore and J. Decety, “From the perception of action to the understanding of intention,”Nature reviews neuroscience, vol. 2, no. 8, pp. 561–567, 2001

  53. [53]

    Brain areas involved in perception of biological motion,

    E. Grossman, M. Donnelly, R. Price, D. Pickens, V . Morgan, G. Neigh- bor, and R. Blake, “Brain areas involved in perception of biological motion,”Journal of cognitive neuroscience, vol. 12, no. 5, pp. 711–720, 2000

  54. [54]

    A latent space of stochastic diffusion models for zero-shot image editing and guidance,

    C. H. Wu and F. De la Torre, “A latent space of stochastic diffusion models for zero-shot image editing and guidance,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 7378–7387

  55. [55]

    Intriguing properties of generative classifiers,

    P. Jaini, K. Clark, and R. Geirhos, “Intriguing properties of generative classifiers,” inThe Twelfth International Conference on Learning Rep- resentations, 2024

  56. [56]

    Action-conditioned 3d human motion synthesis with transformer vae,

    M. Petrovich, M. J. Black, and G. Varol, “Action-conditioned 3d human motion synthesis with transformer vae,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 985–10 995

  57. [57]

    Diffusion autoencoders: Toward a meaningful and decodable represen- tation,

    K. Preechakul, N. Chatthee, S. Wizadwongsa, and S. Suwajanakorn, “Diffusion autoencoders: Toward a meaningful and decodable represen- tation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 619–10 629

  58. [58]

    Noise2score: tweedie’s approach to self- supervised image denoising without clean images,

    K. Kim and J. C. Ye, “Noise2score: tweedie’s approach to self- supervised image denoising without clean images,”Advances in Neural Information Processing Systems, vol. 34, pp. 864–874, 2021

  59. [59]

    Denoising diffusion implicit models,

    J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” inInternational Conference on Learning Representations, 2021

  60. [60]

    Muscles in action,

    M. Chiquier and C. V ondrick, “Muscles in action,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 22 091–22 101

  61. [61]

    Muscles in time: Learning to understand human motion in-depth by simulating muscle activations,

    D. Schneider, S. Reiß, M. Kugler, A. Jaus, K. Peng, S. Sutschet, M. S. Sarfraz, S. Matthiesen, and R. Stiefelhagen, “Muscles in time: Learning to understand human motion in-depth by simulating muscle activations,” Advances in Neural Information Processing Systems, 2025

  62. [62]

    Basar: Black-box attack on skeletal action recognition,

    Y . Diao, T. Shao, Y .-L. Yang, K. Zhou, and H. Wang, “Basar: Black-box attack on skeletal action recognition,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, Confer- ence Proceedings, pp. 7597–7607

  63. [63]

    Smoodi: Stylized motion diffusion model,

    L. Zhong, Y . Xie, V . Jampani, D. Sun, and H. Jiang, “Smoodi: Stylized motion diffusion model,” inEuropean Conference on Computer Vision. Springer, 2025, pp. 405–421

  64. [64]

    Spatio-temporal tuples transformer for skeleton-based action recognition,

    H. Qiu, B. Hou, B. Ren, and X. Zhang, “Spatio-temporal tuples transformer for skeleton-based action recognition,”arXiv preprint arXiv:2201.02849, 2022

  65. [65]

    Skateformer: Skeletal-temporal transformer for human action recognition,

    J. Do and M. Kim, “Skateformer: Skeletal-temporal transformer for human action recognition,” inEuropean Conference on Computer Vision. Springer, 2025

  66. [66]

    Learning discriminative representations for skeleton based action recognition,

    H. Zhou, Q. Liu, and Y . Wang, “Learning discriminative representations for skeleton based action recognition,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 10 608–10 617

  67. [67]

    Certified adversarial robustness via randomized smoothing,

    J. Cohen, E. Rosenfeld, and Z. Kolter, “Certified adversarial robustness via randomized smoothing,” ininternational conference on machine learning. PMLR, 2019, pp. 1310–1320

  68. [68]

    Generating diverse and natural 3d human motions from text,

    C. Guo, S. Zou, X. Zuo, S. Wang, W. Ji, X. Li, and L. Cheng, “Generating diverse and natural 3d human motions from text,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 5152–5161

  69. [69]

    Physics-constrained attack against convolution-based human motion prediction,

    C. Duan, Z. Zhang, X. Liu, Y . Dang, and J. Yin, “Physics-constrained attack against convolution-based human motion prediction,”Neurocom- puting, vol. 575, p. 127272, 2024

  70. [70]

    Adversarial examples in the physical world,

    A. Kurakin, I. J. Goodfellow, and S. Bengio, “Adversarial examples in the physical world,” inArtificial intelligence safety and security. Chapman and Hall/CRC, 2018, pp. 99–112

  71. [71]

    Boosting adversarial attacks with momentum,

    Y . Dong, F. Liao, T. Pang, H. Su, J. Zhu, X. Hu, and J. Li, “Boosting adversarial attacks with momentum,” inProceedings of the IEEE confer- ence on computer vision and pattern recognition, 2018, pp. 9185–9193

  72. [72]

    Transferable adversarial attack for both vision transformers and convolutional networks via momentum integrated gradients,

    W. Ma, Y . Li, X. Jia, and W. Xu, “Transferable adversarial attack for both vision transformers and convolutional networks via momentum integrated gradients,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4630–4639. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 15

  73. [73]

    Human motion diffusion model,

    G. Tevet, S. Raab, B. Gordon, Y . Shafir, D. Cohen-or, and A. H. Bermano, “Human motion diffusion model,” inThe Eleventh Interna- tional Conference on Learning Representations, 2023

  74. [74]

    Understanding the latent space of diffusion models through the lens of riemannian geometry,

    Y .-H. Park, M. Kwon, J. Choi, J. Jo, and Y . Uh, “Understanding the latent space of diffusion models through the lens of riemannian geometry,”Advances in Neural Information Processing Systems, vol. 36, pp. 24 129–24 142, 2023

  75. [75]

    A phase transition in diffu- sion models reveals the hierarchical nature of data,

    A. Sclocchi, A. Favero, and M. Wyart, “A phase transition in diffu- sion models reveals the hierarchical nature of data,”arXiv preprint arXiv:2402.16991, 2024

  76. [76]

    Dreamtime: An improved optimization strategy for diffusion-guided 3d generation,

    Y . Huang, J. Wang, Y . Shi, B. Tang, X. Qi, and L. Zhang, “Dreamtime: An improved optimization strategy for diffusion-guided 3d generation,” inThe Twelfth International Conference on Learning Representations, 2023

  77. [77]

    Diffusion models already have a semantic latent space,

    M. Kwon, J. Jeong, and Y . Uh, “Diffusion models already have a semantic latent space,” inThe Eleventh International Conference on Learning Representations, 2023

  78. [78]

    Magr: Manifold-aligned graph regularization for continual action quality assessment,

    K. Zhou, L. Wang, X. Zhang, H. P. H. Shum, F. W. B. Li, J. Li, and X. Liang, “Magr: Manifold-aligned graph regularization for continual action quality assessment,” inProceedings of the 2024 European Con- ference on Computer Vision, ser. ECCV ’24. Springer, 2024

  79. [79]

    A video-based augmented reality system for human-in-the-loop muscle strength assessment of juvenile dermatomyositis,

    K. Zhou, R. Cai, Y . Ma, Q. Tan, X. Wang, J. Li, H. P. Shum, F. W. Li, S. Jin, and X. Liang, “A video-based augmented reality system for human-in-the-loop muscle strength assessment of juvenile dermatomyositis,”IEEE Transactions on Visualization and Computer Graphics, vol. 29, no. 5, pp. 2456–2466, 2023

  80. [80]

    Phi: Bridging domain shift in long-term action quality assessment via progressive hierarchical instruction,

    K. Zhou, H. P. H. Shum, F. W. B. Li, X. Zhang, and X. Liang, “Phi: Bridging domain shift in long-term action quality assessment via progressive hierarchical instruction,”IEEE Transactions on Image Processing, vol. 34, pp. 3718–3732, 2025

Showing first 80 references.