Recognition: unknown
Frequency-Enhanced Diffusion Models: Curriculum-Guided Semantic Alignment for Zero-Shot Skeleton Action Recognition
Pith reviewed 2026-05-10 16:51 UTC · model grok-4.3
The pith
Frequency-aware diffusion models recover fine-grained motion details for zero-shot skeleton action recognition.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By integrating a Semantic-Guided Spectral Residual Module, a Timestep-Adaptive Spectral Loss, and Curriculum-based Semantic Abstraction into a diffusion framework called FDSM, the approach counters spectral bias to recover fine-grained motion details. This enables better skeleton-text matching in the zero-shot setting, producing state-of-the-art recognition accuracy on the NTU RGB+D, PKU-MMD, and Kinetics-skeleton datasets.
What carries the argument
Frequency-Aware Diffusion for Skeleton-Text Matching (FDSM) that uses semantic guidance to correct high-frequency loss during the diffusion process.
If this is right
- The modules restore motion details that standard diffusion oversmooths during semantic alignment.
- Curriculum abstraction supports progressive learning of text-skeleton correspondences without labels.
- The combined losses allow diffusion models to generalize to unseen actions on multiple benchmarks.
- State-of-the-art results follow on NTU RGB+D, PKU-MMD, and Kinetics-skeleton.
Where Pith is reading between the lines
- Similar frequency corrections could apply to other sequence-to-text tasks where diffusion models lose temporal sharpness.
- The approach suggests a route to reduce reliance on labeled data across multimodal action understanding problems.
- Testing the modules on non-skeleton inputs such as RGB video could reveal whether the bias correction is modality-specific.
Load-bearing premise
That the spectral bias of diffusion models is the main bottleneck in zero-shot skeleton action recognition and that the three modules correct it without new errors or dataset-specific tuning.
What would settle it
A direct comparison showing no measurable improvement in high-frequency skeleton components or failure to exceed prior zero-shot methods on the NTU RGB+D dataset.
read the original abstract
Human action recognition is pivotal in computer vision, with applications ranging from surveillance to human-robot interaction. Despite the effectiveness of supervised skeleton-based methods, their reliance on exhaustive annotation limits generalization to novel actions. Zero-Shot Skeleton Action Recognition (ZSAR) emerges as a promising paradigm, yet it faces challenges due to the spectral bias of diffusion models, which oversmooth high-frequency dynamics. Here, we propose Frequency-Aware Diffusion for Skeleton-Text Matching (FDSM), integrating a Semantic-Guided Spectral Residual Module, a Timestep-Adaptive Spectral Loss, and Curriculum-based Semantic Abstraction to address these challenges. Our approach effectively recovers fine-grained motion details, achieving state-of-the-art performance on NTU RGB+D, PKU-MMD, and Kinetics-skeleton datasets. Code has been made available at https://github.com/yuzhi535/FDSM. Project homepage: https://yuzhi535.github.io/FDSM.github.io/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Frequency-Aware Diffusion for Skeleton-Text Matching (FDSM) for zero-shot skeleton action recognition. It integrates three modules—a Semantic-Guided Spectral Residual Module, a Timestep-Adaptive Spectral Loss, and Curriculum-based Semantic Abstraction—to counteract the spectral bias of diffusion models that oversmooths high-frequency motion details. The central claim is that the resulting method recovers fine-grained dynamics and reaches state-of-the-art performance on the NTU RGB+D, PKU-MMD, and Kinetics-skeleton benchmarks, with code released at a public repository.
Significance. If the performance claims are substantiated by quantitative results, ablations, and error analysis, the work would offer a targeted improvement to diffusion-based zero-shot skeleton recognition by explicitly recovering high-frequency components. The public code release would further support reproducibility and extension by the community.
minor comments (1)
- Abstract: the claim of state-of-the-art performance is stated without any numerical metrics, baseline comparisons, or ablation results, making immediate assessment of the central empirical claim impossible from the provided text.
Simulated Author's Rebuttal
We thank the referee for their review of our manuscript on Frequency-Aware Diffusion for Skeleton-Text Matching (FDSM). We note the positive assessment of potential significance if the performance claims are substantiated, and the uncertain recommendation. The manuscript provides quantitative results, ablations, and supporting analysis on the NTU RGB+D, PKU-MMD, and Kinetics-skeleton benchmarks, with public code release for reproducibility. No specific major comments were listed in the report.
Circularity Check
No significant circularity in derivation chain
full rationale
The abstract and available context describe a methodological proposal (FDSM with three modules) addressing an external known issue (spectral bias of diffusion models in ZSAR) via empirical integration and SOTA claims on standard datasets (NTU RGB+D, PKU-MMD, Kinetics-skeleton). No equations, derivations, predictions, or self-citations appear that reduce any result to its own inputs by construction. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations are present. The derivation chain is self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Multimedia Tools and Applications78, 17165–17196 (2019)
Singh, R., Kushwaha, A.K.S., Srivastava, R.: Multi-view recognition system for human activity based on multiple features for video surveillance system. Multimedia Tools and Applications78, 17165–17196 (2019)
2019
-
[2]
Computer Animation and Virtual Worlds36(1), 70011 (2025)
Hong-qin, X., Yuan-yuan, Z.: Advanced gesture recognition method based on frac- tional fourier transform and relevance vector machine for smart home appliances. Computer Animation and Virtual Worlds36(1), 70011 (2025)
2025
-
[3]
yang et al
Yang, Y., Zhou, J., Hu, W., Tu, Z.: End-to-end pose-action recognition via implicit pose encoding and multi-scale skeleton modeling: Y. yang et al. The Visual Computer, 1–17 (2025)
2025
-
[4]
IEEE Transactions on Systems, Man, and Cybernetics: Systems51(5), 2774–2788 (2019)
Aouaidjia, K., Sheng, B., Li, P., Kim, J., Feng, D.D.: Efficient body motion quantification and similarity evaluation using 3-d joints skeleton coordinates. IEEE Transactions on Systems, Man, and Cybernetics: Systems51(5), 2774–2788 (2019)
2019
-
[5]
Virtual Reality & Intelligent Hardware5(4), 366–377 (2023)
Hu, X., Bao, X., Wei, G., Li, Z.: Human-pose estimation based on weak supervision. Virtual Reality & Intelligent Hardware5(4), 366–377 (2023)
2023
-
[6]
IEEE Transactions on Circuits and Systems for Video Technology28(3), 807–811 (2016)
Hou, Y., Li, Z., Wang, P., Li, W.: Skeleton optical spectra-based action recognition using convolutional neural networks. IEEE Transactions on Circuits and Systems for Video Technology28(3), 807–811 (2016)
2016
-
[8]
The Visual Computer39(5), 2191–2203 (2023) 26
Qiu, Z.-X., Zhang, H.-B., Deng, W.-M., Du, J.-X., Lei, Q., Zhang, G.-L.: Effective skeleton topology and semantics-guided adaptive graph convolution network for action recognition. The Visual Computer39(5), 2191–2203 (2023) 26
2023
-
[9]
IEEE Transactions on Circuits and Systems for Video Technology (2023)
Liu, H., Liu, Y., Chen, Y., Yuan, C., Li, B., Hu, W.: Transkeleton: Hierar- chical spatial-temporal transformer for skeleton-based action recognition. IEEE Transactions on Circuits and Systems for Video Technology (2023)
2023
-
[10]
In: Proceedings of the 32nd ACM International Conference on Multimedia, pp
Wu, W., Zheng, C., Yang, Z., Chen, C., Das, S., Lu, A.: Frequency guidance matters: Skeletal action recognition by frequency-aware mixed transformer. In: Proceedings of the 32nd ACM International Conference on Multimedia, pp. 4660– 4669 (2024)
2024
-
[11]
zhao et al
Zhao, J., Dai, J., Zhou, F., Pan, J., Xu, H.: Dual-path spatio-temporal mamba for skeleton-based action recognition: J. zhao et al. The Visual Computer, 1–13 (2025)
2025
-
[12]
The Visual Computer, 1–13 (2025)
Xie, Z., Chen, J., Wang, Y., Xie, B.: Enhanced fine-grained relearning for skeleton- based action recognition. The Visual Computer, 1–13 (2025)
2025
-
[13]
IEEE Transactions on Image Processing34, 7335–7346 (2025)
Tu, Z., Zhang, Z., Gong, J., Yuan, J., Du, B.: Informative sample selection model for skeleton-based action recognition with limited training samples. IEEE Transactions on Image Processing34, 7335–7346 (2025)
2025
-
[14]
In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision, pp
Do, J., Kim, M.: Bridging the skeleton-text modality gap: Diffusion-powered modality alignment for zero-shot skeleton-based action recognition. In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12757–12768 (2025)
2025
-
[15]
In: Proceedings of the IEEE International Conference on Computer Vision, pp
Hubert Tsai, Y.-H., Huang, L.-K., Salakhutdinov, R.: Learning robust visual- semantic embeddings. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3571–3580 (2017)
2017
-
[19]
In: Proceedings of the 32nd ACM International Conference on Multimedia, pp
Chen, Y., Guo, J., He, T., Lu, X., Wang, L.: Fine-grained side information guided dual-prompts for zero-shot skeleton action recognition. In: Proceedings of the 32nd ACM International Conference on Multimedia, pp. 778–786 (2024)
2024
-
[20]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Zhu, A., Ke, Q., Gong, M., Bailey, J.: Part-aware unified representation of 27 language and skeleton for zero-shot action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18761–18770 (2024)
2024
-
[21]
In: European Conference on Computer Vision, pp
Li, S.-W., Wei, Z.-X., Chen, W.-J., Yu, Y.-H., Yang, C.-Y., Hsu, J.Y.-j.: Sa-dvae: Improving zero-shot skeleton-based action recognition by disentangled variational autoencoders. In: European Conference on Computer Vision, pp. 447–462 (2025). Springer
2025
-
[22]
Advances in Neural Information Processing Systems33, 6840–6851 (2020)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems33, 6840–6851 (2020)
2020
-
[23]
In: Proceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition (CVPR), pp
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition (CVPR), pp. 10684–10695 (2022)
2022
-
[24]
In: European Conference on Computer Vision, pp
Zhang, Z., Xu, L., Peng, D., Rahmani, H., Liu, J.: Diff-tracker: text-to-image diffusion models are unsupervised trackers. In: European Conference on Computer Vision, pp. 319–337 (2024). Springer
2024
-
[25]
In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp
Wu, W., Guo, Z., Chen, C., Xue, H., Lu, A.: Frequency-semantic enhanced variational autoencoder for zero-shot skeleton-based action recognition. In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 11122–11131 (2025)
2025
-
[26]
Advances in Neural Information Processing Systems35, 23495–23509 (2022)
Si, C., Yu, W., Zhou, P., Zhou, Y., Wang, X., Yan, S.: Inception transformer. Advances in Neural Information Processing Systems35, 23495–23509 (2022)
2022
-
[27]
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 586–595 (2018)
2018
-
[28]
In: International Conference on Machine Learning, pp
Rahaman, N., Baratin, A., Arpit, D., Draxler, F., Lin, M., Hamprecht, F., Ben- gio, Y., Courville, A.: On the spectral bias of neural networks. In: International Conference on Machine Learning, pp. 5301–5310 (2019). PMLR
2019
-
[29]
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp
Shahroudy, A., Liu, J., Ng, T.-T., Wang, G.: Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1010–1019 (2016)
2016
-
[30]
IEEE transactions on pattern analysis and machine intelligence42(10), 2684–2701 (2019) 28
Liu, J., Shahroudy, A., Perez, M., Wang, G., Duan, L.-Y., Kot, A.C.: Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding. IEEE transactions on pattern analysis and machine intelligence42(10), 2684–2701 (2019) 28
2019
-
[31]
Liu, C., Hu, Y., Li, Y., Song, S., Liu, J.: Pku-mmd: A large scale bench- mark for continuous multi-modal human action understanding. arXiv preprint arXiv:1703.07475 (2017)
-
[33]
The Kinetics Human Action Video Dataset
Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
work page internal anchor Pith review arXiv 2017
-
[34]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Schonfeld, E., Ebrahimi, S., Sinha, S., Darrell, T., Akata, Z.: Generalized zero- and few-shot learning via aligned variational autoencoders. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8247–8255 (2019)
2019
-
[35]
In: 2021 IEEE International Conference on Image Processing (ICIP), pp
Gupta, P., Sharma, D., Sarvadevabhatla, R.K.: Syntactically guided generative embeddings for zero-shot skeleton action recognition. In: 2021 IEEE International Conference on Image Processing (ICIP), pp. 439–443 (2021). IEEE
2021
-
[36]
In: International Conference on Image and Graphics, pp
Li, M.-Z., Jia, Z., Zhang, Z., Ma, Z., Wang, L.: Multi-semantic fusion model for generalized zero-shot skeleton-based action recognition. In: International Conference on Image and Graphics, pp. 68–80 (2023). Springer
2023
-
[37]
arXiv preprint arXiv:2407.13460 (2024)
Li, S.-W., Wei, Z.-X., Chen, W.-J., Yu, Y.-H., Yang, C.-Y., Hsu, J.Y.-j.: Sa-dvae: Improving zero-shot skeleton-based action recognition by disentangled variational autoencoders. arXiv preprint arXiv:2407.13460 (2024)
-
[38]
In: Proceedings of the 31st ACM International Conference on Multimedia, pp
Zhou, Y., Qiang, W., Rao, A., Lin, N., Su, B., Wang, J.: Zero-shot skeleton- based action recognition via mutual information estimation and maximization. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 5302–5310 (2023)
2023
-
[39]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Zhu, A., Ke, Q., Gong, M., Bailey, J.: Part-aware unified representation of language and skeleton for zero-shot action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18761–18770 (2024)
2024
-
[40]
arXiv preprint arXiv:2404.07487 (2024)
Chen, Y., Guo, J., He, T., Wang, L.: Fine-grained side information guided dual-prompts for zero-shot skeleton action recognition. arXiv preprint arXiv:2404.07487 (2024)
-
[41]
arXiv preprint arXiv:2409.14336 (2024) 29
Kuang, J., Wang, H., Han, C., Gui, J.: Zero-shot skeleton-based action recognition with dual visual-text alignment. arXiv preprint arXiv:2409.14336 (2024) 29
-
[42]
arXiv preprint arXiv:2406.00639 (2024)
Xu, H., Gao, Y., Li, J., Gao, X.: An information compensation framework for zero- shot skeleton-based action recognition. arXiv preprint arXiv:2406.00639 (2024)
-
[43]
In: International Conference on Machine Learning, pp
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR
2020
-
[44]
Language Models are Few-Shot Learners
Brown, T.B.: Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[45]
In: International Conference on Learning Representations (ICLR) (2023)
Tevet, G., Raab, S., Abu-Horany, B., Cohen-Or, D.: Human motion diffusion model. In: International Conference on Learning Representations (ICLR) (2023)
2023
-
[46]
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2024)
Zhang, M., Cai, Z., Pan, L., Hong, F., Guo, X., Yang, L., Liu, Z.: Motiondiffuse: Text-driven human motion generation with diffusion model. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2024)
2024
-
[47]
In: Proceedings of the IEEE International Conference on Computer Vision, pp
Zhang, P., Lan, C., Xing, J., Zeng, W., Xue, J., Zheng, N.: View adaptive recur- rent neural networks for high performance human action recognition from skeleton data. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2117–2126 (2017)
2017
-
[48]
In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14, pp
Liu, J., Shahroudy, A., Xu, D., Wang, G.: Spatio-temporal lstm with trust gates for 3d human action recognition. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14, pp. 816–833 (2016). Springer
2016
-
[49]
In: International Conference on Machine Learning (2023)
Cai, D., Kang, Y., Yao, A., Chen, Y.: Ske2grid: Skeleton-to-grid representation learning for action recognition. In: International Conference on Machine Learning (2023)
2023
-
[50]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Duan, H., Zhao, Y., Chen, K., Lin, D., Dai, B.: Revisiting skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2969–2978 (2022)
2022
-
[51]
In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision, pp
Chen, Y., Zhang, Z., Yuan, C., Li, B., Deng, Y., Hu, W.: Channel-wise topology refinement graph convolution for skeleton-based action recognition. In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13359–13368 (2021)
2021
-
[52]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Chi, H.-g., Ha, M.H., Chi, S., Lee, S.W., Huang, Q., Ramani, K.: Infogcn: Repre- sentation learning for human skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20186–20196 (2022)
2022
-
[53]
In: Proceedings of the 30 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Liu, Z., Zhang, H., Chen, Z., Wang, Z., Ouyang, W.: Disentangling and unifying graph convolutions for skeleton-based action recognition. In: Proceedings of the 30 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 143– 152 (2020)
2020
-
[54]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Zhou, Y., Yan, X., Cheng, Z.-Q., Yan, Y., Dai, Q., Hua, X.-S.: Blockgcn: Redefine topology awareness for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2049– 2058 (2024)
2049
-
[55]
In: Proceedings of the AAAI Conference on Artificial Intelligence, vol
Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
2018
-
[56]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Cheng, K., Zhang, Y., He, X., Chen, W., Cheng, J., Lu, H.: Skeleton-based action recognition with shift graph convolutional network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 183–192 (2020)
2020
-
[57]
In: European Conference on Computer Vision, pp
Pang, Y., Ke, Q., Rahmani, H., Bailey, J., Liu, J.: Igformer: Interaction graph transformer for skeleton-based human interaction recognition. In: European Conference on Computer Vision, pp. 605–622 (2022). Springer
2022
-
[58]
Skateformer: Skeletal-temporal transformer for human action recognition,
Do, J., Kim, M.: Skateformer: Skeletal-temporal transformer for human action recognition. arXiv preprint arXiv:2403.09508 (2024)
-
[59]
zhao et al
Zhao, J., Ning, K., Zhou, F., Pan, J., Xu, H., Dai, J.: Multi-level fusion tokens for enhanced self-supervised skeleton-based action recognition: J. zhao et al. The Visual Computer42(1), 37 (2026)
2026
-
[60]
The Visual Computer40(8), 5733–5745 (2024)
Sun, S., Jia, Z., Zhu, Y., Liu, G., Yu, Z.: Decoupled spatio-temporal grouping transformer for skeleton-based action recognition. The Visual Computer40(8), 5733–5745 (2024)
2024
-
[61]
The Visual Computer39(10), 4501–4512 (2023)
Zhang, J., Xie, W., Wang, C., Tu, R., Tu, Z.: Graph-aware transformer for skeleton-based action recognition. The Visual Computer39(10), 4501–4512 (2023)
2023
-
[62]
In: Proceedings of the 31st ACM International Conference on Multimedia, pp
Yao, J., Chen, J., Niu, L., Sheng, B.: Scene-aware human pose generation using transformer. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 2847–2855 (2023)
2023
-
[63]
Advances in neural information processing systems33, 6840–6851 (2020)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020)
2020
-
[64]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022) 31
2022
-
[65]
Computer Animation and Virtual Worlds36(3), 70040 (2025)
Peng, J., Liu, Z., Lin, J., He, G.: Precise motion inbetweening via bidirectional autoregressive diffusion models. Computer Animation and Virtual Worlds36(3), 70040 (2025)
2025
-
[66]
In: Medical Image Computing and Computer-assisted intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pp
Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomed- ical image segmentation. In: Medical Image Computing and Computer-assisted intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pp. 234–241 (2015). Springer
2015
-
[67]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp
Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4195–4205 (2023)
2023
-
[68]
Vision transformer adapter for dense predictions,
Chen, Z., Duan, Y., Wang, W., He, J., Lu, T., Dai, J., Qiao, Y.: Vision transformer adapter for dense predictions. ArXivabs/2205.08534(2022)
-
[69]
IEEE transac- tions on Computers100(1), 90–93 (1974)
Ahmed, N., Natarajan, T., Rao, K.R.: Discrete cosine transform. IEEE transac- tions on Computers100(1), 90–93 (1974)
1974
-
[70]
IEEE transactions on image processing9(3), 505–510 (2000)
Polesel, A., Ramponi, G., Mathews, V.J.: Image enhancement via adaptive unsharp masking. IEEE transactions on image processing9(3), 505–510 (2000)
2000
-
[71]
arXiv preprint arXiv:2205.03650 (2022)
Zhang, Z., Zhou, C., Tu, Z.: Distilling inter-class distance for semantic segmen- tation. arXiv preprint arXiv:2205.03650 (2022)
-
[72]
International journal of computer vision129(6), 1789–1819 (2021)
Gou, J., Yu, B., Maybank, S.J., Tao, D.: Knowledge distillation: A survey. International journal of computer vision129(6), 1789–1819 (2021)
2021
-
[73]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Choi, J., Lee, J., Shin, C., Kim, S., Kim, H., Yoon, S.: Perception prioritized training of diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11472–11481 (2022)
2022
-
[74]
In: Interna- tional Conference on Learning Representations (2020)
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: Interna- tional Conference on Learning Representations (2020)
2020
-
[75]
In: 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings
Loshchilov, I., Hutter, F.: SGDR: stochastic gradient descent with warm restarts. In: 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenRe- view.net, ??? (2017).https://openreview.net/forum?id=Skq89Scxx
2017
-
[76]
In: Proceedings of the 26th Annual International Conference on Machine Learning (ICML), pp
Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning. In: Proceedings of the 26th Annual International Conference on Machine Learning (ICML), pp. 41–48 (2009)
2009
-
[77]
In: Proceedings of the IEEE International Conference on Computer Vision, pp
Hubert Tsai, Y.-H., Huang, L.-K., Salakhutdinov, R.: Learning robust visual- semantic embeddings. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3571–3580 (2017) 32
2017
-
[78]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp
Wray, M., Larlus, D., Csurka, G., Damen, D.: Fine-grained action retrieval through multiple parts-of-speech embeddings. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 450–459 (2019)
2019
-
[79]
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in pytorch (2017)
2017
-
[80]
Decoupled Weight Decay Regularization
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[81]
SGDR: Stochastic Gradient Descent with Warm Restarts
Loshchilov, I., Hutter, F.: Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016)
work page internal anchor Pith review arXiv 2016
-
[82]
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[83]
In: International Conference on Machine Learning, pp
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J.,et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR
2021
-
[84]
Ilharco, G., Wortsman, M., Carlini, N., Taori, R., Dave, A., Shankar, V., Namkoong, H., Miller, J., Hajishirzi, H., Farhadi, A., Schmidt, L.: Open Clip (2021)
2021
-
[85]
Advances in neural information processing systems26(2013) 33
Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: Devise: A deep visual-semantic embedding model. Advances in neural information processing systems26(2013) 33
2013
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.