pith. machine review for the scientific record. sign in

arxiv: 2604.09063 · v2 · submitted 2026-04-10 · 💻 cs.CV · cs.AI

Recognition: unknown

Frequency-Enhanced Diffusion Models: Curriculum-Guided Semantic Alignment for Zero-Shot Skeleton Action Recognition

Jingyu Pan, Yuxi Zhou, Zhengbo Zhang, Zhigang Tu, Zhiyu Lin

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:51 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords zero-shot skeleton action recognitiondiffusion modelsspectral biasfrequency enhancementsemantic alignmentcurriculum learningskeleton-text matching
0
0 comments X

The pith

Frequency-aware diffusion models recover fine-grained motion details for zero-shot skeleton action recognition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to overcome the spectral bias in diffusion models that causes oversmoothing of high-frequency motion dynamics when matching skeleton sequences to text descriptions in the absence of action labels. It introduces three modules that guide the generative process to preserve detailed temporal patterns while building semantic alignment. A sympathetic reader would care because zero-shot recognition could then handle novel actions in applications like surveillance or human-robot interaction without exhaustive new annotations. The method is evaluated on standard skeleton benchmarks and reports improved results over prior approaches.

Core claim

By integrating a Semantic-Guided Spectral Residual Module, a Timestep-Adaptive Spectral Loss, and Curriculum-based Semantic Abstraction into a diffusion framework called FDSM, the approach counters spectral bias to recover fine-grained motion details. This enables better skeleton-text matching in the zero-shot setting, producing state-of-the-art recognition accuracy on the NTU RGB+D, PKU-MMD, and Kinetics-skeleton datasets.

What carries the argument

Frequency-Aware Diffusion for Skeleton-Text Matching (FDSM) that uses semantic guidance to correct high-frequency loss during the diffusion process.

If this is right

  • The modules restore motion details that standard diffusion oversmooths during semantic alignment.
  • Curriculum abstraction supports progressive learning of text-skeleton correspondences without labels.
  • The combined losses allow diffusion models to generalize to unseen actions on multiple benchmarks.
  • State-of-the-art results follow on NTU RGB+D, PKU-MMD, and Kinetics-skeleton.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar frequency corrections could apply to other sequence-to-text tasks where diffusion models lose temporal sharpness.
  • The approach suggests a route to reduce reliance on labeled data across multimodal action understanding problems.
  • Testing the modules on non-skeleton inputs such as RGB video could reveal whether the bias correction is modality-specific.

Load-bearing premise

That the spectral bias of diffusion models is the main bottleneck in zero-shot skeleton action recognition and that the three modules correct it without new errors or dataset-specific tuning.

What would settle it

A direct comparison showing no measurable improvement in high-frequency skeleton components or failure to exceed prior zero-shot methods on the NTU RGB+D dataset.

read the original abstract

Human action recognition is pivotal in computer vision, with applications ranging from surveillance to human-robot interaction. Despite the effectiveness of supervised skeleton-based methods, their reliance on exhaustive annotation limits generalization to novel actions. Zero-Shot Skeleton Action Recognition (ZSAR) emerges as a promising paradigm, yet it faces challenges due to the spectral bias of diffusion models, which oversmooth high-frequency dynamics. Here, we propose Frequency-Aware Diffusion for Skeleton-Text Matching (FDSM), integrating a Semantic-Guided Spectral Residual Module, a Timestep-Adaptive Spectral Loss, and Curriculum-based Semantic Abstraction to address these challenges. Our approach effectively recovers fine-grained motion details, achieving state-of-the-art performance on NTU RGB+D, PKU-MMD, and Kinetics-skeleton datasets. Code has been made available at https://github.com/yuzhi535/FDSM. Project homepage: https://yuzhi535.github.io/FDSM.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 1 minor

Summary. The manuscript proposes Frequency-Aware Diffusion for Skeleton-Text Matching (FDSM) for zero-shot skeleton action recognition. It integrates three modules—a Semantic-Guided Spectral Residual Module, a Timestep-Adaptive Spectral Loss, and Curriculum-based Semantic Abstraction—to counteract the spectral bias of diffusion models that oversmooths high-frequency motion details. The central claim is that the resulting method recovers fine-grained dynamics and reaches state-of-the-art performance on the NTU RGB+D, PKU-MMD, and Kinetics-skeleton benchmarks, with code released at a public repository.

Significance. If the performance claims are substantiated by quantitative results, ablations, and error analysis, the work would offer a targeted improvement to diffusion-based zero-shot skeleton recognition by explicitly recovering high-frequency components. The public code release would further support reproducibility and extension by the community.

minor comments (1)
  1. Abstract: the claim of state-of-the-art performance is stated without any numerical metrics, baseline comparisons, or ablation results, making immediate assessment of the central empirical claim impossible from the provided text.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their review of our manuscript on Frequency-Aware Diffusion for Skeleton-Text Matching (FDSM). We note the positive assessment of potential significance if the performance claims are substantiated, and the uncertain recommendation. The manuscript provides quantitative results, ablations, and supporting analysis on the NTU RGB+D, PKU-MMD, and Kinetics-skeleton benchmarks, with public code release for reproducibility. No specific major comments were listed in the report.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The abstract and available context describe a methodological proposal (FDSM with three modules) addressing an external known issue (spectral bias of diffusion models in ZSAR) via empirical integration and SOTA claims on standard datasets (NTU RGB+D, PKU-MMD, Kinetics-skeleton). No equations, derivations, predictions, or self-citations appear that reduce any result to its own inputs by construction. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations are present. The derivation chain is self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review reveals no explicit free parameters, mathematical axioms, or postulated physical entities. The three modules are algorithmic contributions whose internal hyperparameters or training details are not described here.

pith-pipeline@v0.9.0 · 5474 in / 1335 out tokens · 45714 ms · 2026-05-10T16:51:09.040451+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

80 extracted references · 13 canonical work pages · 5 internal anchors

  1. [1]

    Multimedia Tools and Applications78, 17165–17196 (2019)

    Singh, R., Kushwaha, A.K.S., Srivastava, R.: Multi-view recognition system for human activity based on multiple features for video surveillance system. Multimedia Tools and Applications78, 17165–17196 (2019)

  2. [2]

    Computer Animation and Virtual Worlds36(1), 70011 (2025)

    Hong-qin, X., Yuan-yuan, Z.: Advanced gesture recognition method based on frac- tional fourier transform and relevance vector machine for smart home appliances. Computer Animation and Virtual Worlds36(1), 70011 (2025)

  3. [3]

    yang et al

    Yang, Y., Zhou, J., Hu, W., Tu, Z.: End-to-end pose-action recognition via implicit pose encoding and multi-scale skeleton modeling: Y. yang et al. The Visual Computer, 1–17 (2025)

  4. [4]

    IEEE Transactions on Systems, Man, and Cybernetics: Systems51(5), 2774–2788 (2019)

    Aouaidjia, K., Sheng, B., Li, P., Kim, J., Feng, D.D.: Efficient body motion quantification and similarity evaluation using 3-d joints skeleton coordinates. IEEE Transactions on Systems, Man, and Cybernetics: Systems51(5), 2774–2788 (2019)

  5. [5]

    Virtual Reality & Intelligent Hardware5(4), 366–377 (2023)

    Hu, X., Bao, X., Wei, G., Li, Z.: Human-pose estimation based on weak supervision. Virtual Reality & Intelligent Hardware5(4), 366–377 (2023)

  6. [6]

    IEEE Transactions on Circuits and Systems for Video Technology28(3), 807–811 (2016)

    Hou, Y., Li, Z., Wang, P., Li, W.: Skeleton optical spectra-based action recognition using convolutional neural networks. IEEE Transactions on Circuits and Systems for Video Technology28(3), 807–811 (2016)

  7. [8]

    The Visual Computer39(5), 2191–2203 (2023) 26

    Qiu, Z.-X., Zhang, H.-B., Deng, W.-M., Du, J.-X., Lei, Q., Zhang, G.-L.: Effective skeleton topology and semantics-guided adaptive graph convolution network for action recognition. The Visual Computer39(5), 2191–2203 (2023) 26

  8. [9]

    IEEE Transactions on Circuits and Systems for Video Technology (2023)

    Liu, H., Liu, Y., Chen, Y., Yuan, C., Li, B., Hu, W.: Transkeleton: Hierar- chical spatial-temporal transformer for skeleton-based action recognition. IEEE Transactions on Circuits and Systems for Video Technology (2023)

  9. [10]

    In: Proceedings of the 32nd ACM International Conference on Multimedia, pp

    Wu, W., Zheng, C., Yang, Z., Chen, C., Das, S., Lu, A.: Frequency guidance matters: Skeletal action recognition by frequency-aware mixed transformer. In: Proceedings of the 32nd ACM International Conference on Multimedia, pp. 4660– 4669 (2024)

  10. [11]

    zhao et al

    Zhao, J., Dai, J., Zhou, F., Pan, J., Xu, H.: Dual-path spatio-temporal mamba for skeleton-based action recognition: J. zhao et al. The Visual Computer, 1–13 (2025)

  11. [12]

    The Visual Computer, 1–13 (2025)

    Xie, Z., Chen, J., Wang, Y., Xie, B.: Enhanced fine-grained relearning for skeleton- based action recognition. The Visual Computer, 1–13 (2025)

  12. [13]

    IEEE Transactions on Image Processing34, 7335–7346 (2025)

    Tu, Z., Zhang, Z., Gong, J., Yuan, J., Du, B.: Informative sample selection model for skeleton-based action recognition with limited training samples. IEEE Transactions on Image Processing34, 7335–7346 (2025)

  13. [14]

    In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision, pp

    Do, J., Kim, M.: Bridging the skeleton-text modality gap: Diffusion-powered modality alignment for zero-shot skeleton-based action recognition. In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12757–12768 (2025)

  14. [15]

    In: Proceedings of the IEEE International Conference on Computer Vision, pp

    Hubert Tsai, Y.-H., Huang, L.-K., Salakhutdinov, R.: Learning robust visual- semantic embeddings. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3571–3580 (2017)

  15. [19]

    In: Proceedings of the 32nd ACM International Conference on Multimedia, pp

    Chen, Y., Guo, J., He, T., Lu, X., Wang, L.: Fine-grained side information guided dual-prompts for zero-shot skeleton action recognition. In: Proceedings of the 32nd ACM International Conference on Multimedia, pp. 778–786 (2024)

  16. [20]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Zhu, A., Ke, Q., Gong, M., Bailey, J.: Part-aware unified representation of 27 language and skeleton for zero-shot action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18761–18770 (2024)

  17. [21]

    In: European Conference on Computer Vision, pp

    Li, S.-W., Wei, Z.-X., Chen, W.-J., Yu, Y.-H., Yang, C.-Y., Hsu, J.Y.-j.: Sa-dvae: Improving zero-shot skeleton-based action recognition by disentangled variational autoencoders. In: European Conference on Computer Vision, pp. 447–462 (2025). Springer

  18. [22]

    Advances in Neural Information Processing Systems33, 6840–6851 (2020)

    Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems33, 6840–6851 (2020)

  19. [23]

    In: Proceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition (CVPR), pp

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition (CVPR), pp. 10684–10695 (2022)

  20. [24]

    In: European Conference on Computer Vision, pp

    Zhang, Z., Xu, L., Peng, D., Rahmani, H., Liu, J.: Diff-tracker: text-to-image diffusion models are unsupervised trackers. In: European Conference on Computer Vision, pp. 319–337 (2024). Springer

  21. [25]

    In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp

    Wu, W., Guo, Z., Chen, C., Xue, H., Lu, A.: Frequency-semantic enhanced variational autoencoder for zero-shot skeleton-based action recognition. In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 11122–11131 (2025)

  22. [26]

    Advances in Neural Information Processing Systems35, 23495–23509 (2022)

    Si, C., Yu, W., Zhou, P., Zhou, Y., Wang, X., Yan, S.: Inception transformer. Advances in Neural Information Processing Systems35, 23495–23509 (2022)

  23. [27]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp

    Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 586–595 (2018)

  24. [28]

    In: International Conference on Machine Learning, pp

    Rahaman, N., Baratin, A., Arpit, D., Draxler, F., Lin, M., Hamprecht, F., Ben- gio, Y., Courville, A.: On the spectral bias of neural networks. In: International Conference on Machine Learning, pp. 5301–5310 (2019). PMLR

  25. [29]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp

    Shahroudy, A., Liu, J., Ng, T.-T., Wang, G.: Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1010–1019 (2016)

  26. [30]

    IEEE transactions on pattern analysis and machine intelligence42(10), 2684–2701 (2019) 28

    Liu, J., Shahroudy, A., Perez, M., Wang, G., Duan, L.-Y., Kot, A.C.: Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding. IEEE transactions on pattern analysis and machine intelligence42(10), 2684–2701 (2019) 28

  27. [31]

    Pku-mmd: A large scale benchmark for continuous multi-modal human action understanding.arXiv preprint arXiv:1703.07475, 2017

    Liu, C., Hu, Y., Li, Y., Song, S., Liu, J.: Pku-mmd: A large scale bench- mark for continuous multi-modal human action understanding. arXiv preprint arXiv:1703.07475 (2017)

  28. [33]

    The Kinetics Human Action Video Dataset

    Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)

  29. [34]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Schonfeld, E., Ebrahimi, S., Sinha, S., Darrell, T., Akata, Z.: Generalized zero- and few-shot learning via aligned variational autoencoders. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8247–8255 (2019)

  30. [35]

    In: 2021 IEEE International Conference on Image Processing (ICIP), pp

    Gupta, P., Sharma, D., Sarvadevabhatla, R.K.: Syntactically guided generative embeddings for zero-shot skeleton action recognition. In: 2021 IEEE International Conference on Image Processing (ICIP), pp. 439–443 (2021). IEEE

  31. [36]

    In: International Conference on Image and Graphics, pp

    Li, M.-Z., Jia, Z., Zhang, Z., Ma, Z., Wang, L.: Multi-semantic fusion model for generalized zero-shot skeleton-based action recognition. In: International Conference on Image and Graphics, pp. 68–80 (2023). Springer

  32. [37]

    arXiv preprint arXiv:2407.13460 (2024)

    Li, S.-W., Wei, Z.-X., Chen, W.-J., Yu, Y.-H., Yang, C.-Y., Hsu, J.Y.-j.: Sa-dvae: Improving zero-shot skeleton-based action recognition by disentangled variational autoencoders. arXiv preprint arXiv:2407.13460 (2024)

  33. [38]

    In: Proceedings of the 31st ACM International Conference on Multimedia, pp

    Zhou, Y., Qiang, W., Rao, A., Lin, N., Su, B., Wang, J.: Zero-shot skeleton- based action recognition via mutual information estimation and maximization. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 5302–5310 (2023)

  34. [39]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Zhu, A., Ke, Q., Gong, M., Bailey, J.: Part-aware unified representation of language and skeleton for zero-shot action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18761–18770 (2024)

  35. [40]

    arXiv preprint arXiv:2404.07487 (2024)

    Chen, Y., Guo, J., He, T., Wang, L.: Fine-grained side information guided dual-prompts for zero-shot skeleton action recognition. arXiv preprint arXiv:2404.07487 (2024)

  36. [41]

    arXiv preprint arXiv:2409.14336 (2024) 29

    Kuang, J., Wang, H., Han, C., Gui, J.: Zero-shot skeleton-based action recognition with dual visual-text alignment. arXiv preprint arXiv:2409.14336 (2024) 29

  37. [42]

    arXiv preprint arXiv:2406.00639 (2024)

    Xu, H., Gao, Y., Li, J., Gao, X.: An information compensation framework for zero- shot skeleton-based action recognition. arXiv preprint arXiv:2406.00639 (2024)

  38. [43]

    In: International Conference on Machine Learning, pp

    Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR

  39. [44]

    Language Models are Few-Shot Learners

    Brown, T.B.: Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020)

  40. [45]

    In: International Conference on Learning Representations (ICLR) (2023)

    Tevet, G., Raab, S., Abu-Horany, B., Cohen-Or, D.: Human motion diffusion model. In: International Conference on Learning Representations (ICLR) (2023)

  41. [46]

    IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2024)

    Zhang, M., Cai, Z., Pan, L., Hong, F., Guo, X., Yang, L., Liu, Z.: Motiondiffuse: Text-driven human motion generation with diffusion model. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2024)

  42. [47]

    In: Proceedings of the IEEE International Conference on Computer Vision, pp

    Zhang, P., Lan, C., Xing, J., Zeng, W., Xue, J., Zheng, N.: View adaptive recur- rent neural networks for high performance human action recognition from skeleton data. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2117–2126 (2017)

  43. [48]

    In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14, pp

    Liu, J., Shahroudy, A., Xu, D., Wang, G.: Spatio-temporal lstm with trust gates for 3d human action recognition. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14, pp. 816–833 (2016). Springer

  44. [49]

    In: International Conference on Machine Learning (2023)

    Cai, D., Kang, Y., Yao, A., Chen, Y.: Ske2grid: Skeleton-to-grid representation learning for action recognition. In: International Conference on Machine Learning (2023)

  45. [50]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Duan, H., Zhao, Y., Chen, K., Lin, D., Dai, B.: Revisiting skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2969–2978 (2022)

  46. [51]

    In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision, pp

    Chen, Y., Zhang, Z., Yuan, C., Li, B., Deng, Y., Hu, W.: Channel-wise topology refinement graph convolution for skeleton-based action recognition. In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13359–13368 (2021)

  47. [52]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Chi, H.-g., Ha, M.H., Chi, S., Lee, S.W., Huang, Q., Ramani, K.: Infogcn: Repre- sentation learning for human skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20186–20196 (2022)

  48. [53]

    In: Proceedings of the 30 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Liu, Z., Zhang, H., Chen, Z., Wang, Z., Ouyang, W.: Disentangling and unifying graph convolutions for skeleton-based action recognition. In: Proceedings of the 30 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 143– 152 (2020)

  49. [54]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Zhou, Y., Yan, X., Cheng, Z.-Q., Yan, Y., Dai, Q., Hua, X.-S.: Blockgcn: Redefine topology awareness for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2049– 2058 (2024)

  50. [55]

    In: Proceedings of the AAAI Conference on Artificial Intelligence, vol

    Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)

  51. [56]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Cheng, K., Zhang, Y., He, X., Chen, W., Cheng, J., Lu, H.: Skeleton-based action recognition with shift graph convolutional network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 183–192 (2020)

  52. [57]

    In: European Conference on Computer Vision, pp

    Pang, Y., Ke, Q., Rahmani, H., Bailey, J., Liu, J.: Igformer: Interaction graph transformer for skeleton-based human interaction recognition. In: European Conference on Computer Vision, pp. 605–622 (2022). Springer

  53. [58]

    Skateformer: Skeletal-temporal transformer for human action recognition,

    Do, J., Kim, M.: Skateformer: Skeletal-temporal transformer for human action recognition. arXiv preprint arXiv:2403.09508 (2024)

  54. [59]

    zhao et al

    Zhao, J., Ning, K., Zhou, F., Pan, J., Xu, H., Dai, J.: Multi-level fusion tokens for enhanced self-supervised skeleton-based action recognition: J. zhao et al. The Visual Computer42(1), 37 (2026)

  55. [60]

    The Visual Computer40(8), 5733–5745 (2024)

    Sun, S., Jia, Z., Zhu, Y., Liu, G., Yu, Z.: Decoupled spatio-temporal grouping transformer for skeleton-based action recognition. The Visual Computer40(8), 5733–5745 (2024)

  56. [61]

    The Visual Computer39(10), 4501–4512 (2023)

    Zhang, J., Xie, W., Wang, C., Tu, R., Tu, Z.: Graph-aware transformer for skeleton-based action recognition. The Visual Computer39(10), 4501–4512 (2023)

  57. [62]

    In: Proceedings of the 31st ACM International Conference on Multimedia, pp

    Yao, J., Chen, J., Niu, L., Sheng, B.: Scene-aware human pose generation using transformer. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 2847–2855 (2023)

  58. [63]

    Advances in neural information processing systems33, 6840–6851 (2020)

    Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020)

  59. [64]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022) 31

  60. [65]

    Computer Animation and Virtual Worlds36(3), 70040 (2025)

    Peng, J., Liu, Z., Lin, J., He, G.: Precise motion inbetweening via bidirectional autoregressive diffusion models. Computer Animation and Virtual Worlds36(3), 70040 (2025)

  61. [66]

    In: Medical Image Computing and Computer-assisted intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pp

    Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomed- ical image segmentation. In: Medical Image Computing and Computer-assisted intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pp. 234–241 (2015). Springer

  62. [67]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

    Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4195–4205 (2023)

  63. [68]

    Vision transformer adapter for dense predictions,

    Chen, Z., Duan, Y., Wang, W., He, J., Lu, T., Dai, J., Qiao, Y.: Vision transformer adapter for dense predictions. ArXivabs/2205.08534(2022)

  64. [69]

    IEEE transac- tions on Computers100(1), 90–93 (1974)

    Ahmed, N., Natarajan, T., Rao, K.R.: Discrete cosine transform. IEEE transac- tions on Computers100(1), 90–93 (1974)

  65. [70]

    IEEE transactions on image processing9(3), 505–510 (2000)

    Polesel, A., Ramponi, G., Mathews, V.J.: Image enhancement via adaptive unsharp masking. IEEE transactions on image processing9(3), 505–510 (2000)

  66. [71]

    arXiv preprint arXiv:2205.03650 (2022)

    Zhang, Z., Zhou, C., Tu, Z.: Distilling inter-class distance for semantic segmen- tation. arXiv preprint arXiv:2205.03650 (2022)

  67. [72]

    International journal of computer vision129(6), 1789–1819 (2021)

    Gou, J., Yu, B., Maybank, S.J., Tao, D.: Knowledge distillation: A survey. International journal of computer vision129(6), 1789–1819 (2021)

  68. [73]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Choi, J., Lee, J., Shin, C., Kim, S., Kim, H., Yoon, S.: Perception prioritized training of diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11472–11481 (2022)

  69. [74]

    In: Interna- tional Conference on Learning Representations (2020)

    Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: Interna- tional Conference on Learning Representations (2020)

  70. [75]

    In: 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings

    Loshchilov, I., Hutter, F.: SGDR: stochastic gradient descent with warm restarts. In: 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenRe- view.net, ??? (2017).https://openreview.net/forum?id=Skq89Scxx

  71. [76]

    In: Proceedings of the 26th Annual International Conference on Machine Learning (ICML), pp

    Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning. In: Proceedings of the 26th Annual International Conference on Machine Learning (ICML), pp. 41–48 (2009)

  72. [77]

    In: Proceedings of the IEEE International Conference on Computer Vision, pp

    Hubert Tsai, Y.-H., Huang, L.-K., Salakhutdinov, R.: Learning robust visual- semantic embeddings. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3571–3580 (2017) 32

  73. [78]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

    Wray, M., Larlus, D., Csurka, G., Damen, D.: Fine-grained action retrieval through multiple parts-of-speech embeddings. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 450–459 (2019)

  74. [79]

    Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in pytorch (2017)

  75. [80]

    Decoupled Weight Decay Regularization

    Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

  76. [81]

    SGDR: Stochastic Gradient Descent with Warm Restarts

    Loshchilov, I., Hutter, F.: Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016)

  77. [82]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  78. [83]

    In: International Conference on Machine Learning, pp

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J.,et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR

  79. [84]

    Ilharco, G., Wortsman, M., Carlini, N., Taori, R., Dave, A., Shankar, V., Namkoong, H., Miller, J., Hajishirzi, H., Farhadi, A., Schmidt, L.: Open Clip (2021)

  80. [85]

    Advances in neural information processing systems26(2013) 33

    Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: Devise: A deep visual-semantic embedding model. Advances in neural information processing systems26(2013) 33