Recognition: unknown
Unsupervised Skeleton-Based Action Segmentation via Hierarchical Spatiotemporal Vector Quantization
Pith reviewed 2026-05-10 11:21 UTC · model grok-4.3
The pith
A hierarchical spatiotemporal vector quantization framework segments skeleton-based actions unsupervised by multi-level clustering while reconstructing input data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Our hierarchical spatiotemporal vector quantization scheme performs multi-level clustering on skeleton data while simultaneously recovering input skeletons and their corresponding timestamps, which allows it to outperform non-hierarchical baselines and establish new state-of-the-art performance in unsupervised skeleton-based temporal action segmentation while reducing segment length bias.
What carries the argument
The hierarchical spatiotemporal vector quantization framework, which uses two consecutive levels of vector quantization to first associate skeletons with subactions and then aggregate them into actions, while reconstructing both poses and timestamps.
Load-bearing premise
That reconstruction of input skeletons and timestamps via hierarchical vector quantization will produce semantically meaningful subaction and action clusters without supervision or additional constraints.
What would settle it
If the segments produced by the method on HuGaDB, LARa, or BABEL show boundary alignment with ground-truth annotations no better than a standard single-level clustering baseline, or if segment-length bias remains unchanged after the hierarchical extension.
Figures
read the original abstract
We propose a novel hierarchical spatiotemporal vector quantization framework for unsupervised skeleton-based temporal action segmentation. We first introduce a hierarchical approach, which includes two consecutive levels of vector quantization. Specifically, the lower level associates skeletons with fine-grained subactions, while the higher level further aggregates subactions into action-level representations. Our hierarchical approach outperforms the non-hierarchical baseline, while primarily exploiting spatial cues by reconstructing input skeletons. Next, we extend our approach by leveraging both spatial and temporal information, yielding a hierarchical spatiotemporal vector quantization scheme. In particular, our hierarchical spatiotemporal approach performs multi-level clustering, while simultaneously recovering input skeletons and their corresponding timestamps. Lastly, extensive experiments on multiple benchmarks, including HuGaDB, LARa, and BABEL, demonstrate that our approach establishes a new state-of-the-art performance and reduces segment length bias in unsupervised skeleton-based temporal action segmentation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a hierarchical spatiotemporal vector quantization framework for unsupervised skeleton-based temporal action segmentation. It introduces two consecutive VQ levels where the lower level associates skeletons with fine-grained subactions via spatial reconstruction and the higher level aggregates them into action representations; the spatiotemporal extension additionally recovers timestamps. Experiments on HuGaDB, LARa, and BABEL are claimed to establish new state-of-the-art performance while reducing segment length bias.
Significance. If the empirical results are robust and the learned codes align with semantic subactions/actions rather than low-level features, the work would advance unsupervised skeleton-based segmentation by demonstrating that hierarchical VQ reconstruction can perform multi-level clustering without labels or explicit boundary terms, potentially offering a scalable alternative to supervised methods and mitigating common biases in temporal segmentation.
major comments (2)
- [Method] Method description (hierarchical VQ and spatiotemporal extension): The central claim that reconstruction of input skeletons (and timestamps) at two VQ levels automatically yields semantically meaningful subaction and action clusters rests on an unstated assumption. The objective minimizes only reconstruction error with no explicit terms for temporal consistency, boundary detection, or semantic alignment; nothing prevents clustering by static pose similarity or frame-rate artifacts. This directly undermines the reported SOTA and bias-reduction claims.
- [Experiments] Experiments section: The abstract asserts SOTA results and segment-length-bias reduction on HuGaDB, LARa, and BABEL, yet the manuscript summary provides no quantitative metrics, baseline details, ablation studies, or error analysis. Without these load-bearing elements, the magnitude of improvement and the validity of the unsupervised clustering cannot be assessed.
minor comments (1)
- Clarify notation for the two VQ levels and the timestamp reconstruction term to avoid ambiguity in the hierarchical formulation.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment below and outline the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Method] Method description (hierarchical VQ and spatiotemporal extension): The central claim that reconstruction of input skeletons (and timestamps) at two VQ levels automatically yields semantically meaningful subaction and action clusters rests on an unstated assumption. The objective minimizes only reconstruction error with no explicit terms for temporal consistency, boundary detection, or semantic alignment; nothing prevents clustering by static pose similarity or frame-rate artifacts. This directly undermines the reported SOTA and bias-reduction claims.
Authors: We acknowledge that the training objective consists solely of reconstruction losses at both VQ levels without explicit regularization for temporal consistency, boundary detection, or semantic alignment. The hierarchical design is intended to induce multi-scale clustering implicitly: the lower VQ level reconstructs fine-grained spatial variations that correspond to subactions, while the higher level aggregates these into action-level representations. The spatiotemporal extension further incorporates timestamp reconstruction to encourage temporally coherent codes. We agree that this mechanism is implicit and that additional analysis is warranted to rule out clustering by static pose or frame-rate artifacts. In the revised manuscript we will (1) explicitly articulate the underlying assumptions in the method section, (2) add visualizations (e.g., t-SNE of code assignments colored by ground-truth labels) and quantitative checks (e.g., mutual information between codes and semantic labels versus low-level features) to demonstrate semantic alignment, and (3) discuss potential limitations of the reconstruction-only objective. These additions will support rather than undermine the empirical claims. revision: yes
-
Referee: [Experiments] Experiments section: The abstract asserts SOTA results and segment-length-bias reduction on HuGaDB, LARa, and BABEL, yet the manuscript summary provides no quantitative metrics, baseline details, ablation studies, or error analysis. Without these load-bearing elements, the magnitude of improvement and the validity of the unsupervised clustering cannot be assessed.
Authors: The full manuscript contains a dedicated Experiments section that reports quantitative metrics, comparisons against multiple unsupervised baselines, ablation studies on the hierarchical and spatiotemporal components, and analysis of segment-length bias across HuGaDB, LARa, and BABEL. However, we recognize that the initial submission summary did not sufficiently foreground these results. In the revision we will (1) include a concise summary table of key metrics in the introduction or abstract, (2) expand the error analysis with per-dataset breakdowns and statistical significance tests, and (3) ensure all baseline details and ablation configurations are clearly tabulated. These changes will make the magnitude of improvement and the validity of the clustering directly verifiable. revision: partial
Circularity Check
No significant circularity; derivation is self-contained
full rationale
The paper defines an unsupervised hierarchical spatiotemporal VQ model whose objective is explicit reconstruction of input skeletons plus timestamps at two levels. Clustering emerges as a byproduct of discretization for reconstruction fidelity, but the claimed SOTA performance and segment-length-bias reduction are reported as outcomes of separate empirical evaluation on HuGaDB, LARa, and BABEL. No equations equate the learned codes to semantic subactions by construction, no fitted parameters are relabeled as predictions, and no load-bearing self-citations appear in the provided derivation chain. The semantic-alignment assumption is therefore an independent empirical claim rather than a definitional tautology.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Vector quantization can map skeleton poses to discrete subaction codes that aggregate meaningfully into actions
- domain assumption Reconstruction loss on skeletons and timestamps suffices to learn segmentation boundaries
Reference graph
Works this paper leans on
-
[1]
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Alayrac, J.B., Bojanowski, P., Agrawal, N., Sivic, J., Laptev, I., Lacoste-Julien, S.: Unsuper- vised learning from narrated instruction videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4575–4583 (2016)
2016
-
[2]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Ali, A.S., Mahmood, S.A., Saeed, M., Konin, A., Zia, M.Z., Tran, Q.H.: Joint self-supervised video alignment and action segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10807–10818 (2025)
2025
-
[3]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Bueno-Benito, E., Dimiccoli, M.: Clot: Closed loop optimal transport for unsupervised ac- tion segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10719–10729 (2025)
2025
-
[4]
In: Proceedings of the European conference on computer vision (ECCV)
Caron, M., Bojanowski, P., Joulin, A., Douze, M.: Deep clustering for unsupervised learning of visual features. In: Proceedings of the European conference on computer vision (ECCV). pp. 132–149 (2018)
2018
-
[5]
Advances in neural information processing systems33, 9912–9924 (2020)
Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., Joulin, A.: Unsupervised learn- ing of visual features by contrasting cluster assignments. Advances in neural information processing systems33, 9912–9924 (2020)
2020
-
[6]
In: Proceedings of the IEEE/CVF interna- tional conference on computer vision
Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF interna- tional conference on computer vision. pp. 9650–9660 (2021)
2021
-
[7]
Neurocomputing p
Chen, E., Wang, X., Guo, X.: Masked reconstruction model of latent space vector quantiza- tion for human skeleton-based action recognition. Neurocomputing p. 132126 (2025)
2025
-
[8]
In: International conference on machine learning
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International conference on machine learning. pp. 1597–1607. PmLR (2020)
2020
-
[9]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Chen, Y ., Ge, Y ., Tang, W., Li, Y ., Ge, Y ., Ding, M., Shan, Y ., Liu, X.: Moto: Latent motion token as the bridging language for learning robot manipulation from videos. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 19752–19763 (2025)
2025
-
[10]
In: International Conference on Analysis of Images, Social Networks and Texts
Chereshnev, R., Kertész-Farkas, A.: Hugadb: Human gait database for activity recognition from wearable inertial sensor networks. In: International Conference on Analysis of Images, Social Networks and Texts. pp. 131–141. Springer (2017)
2017
-
[11]
Jukebox: A Generative Model for Music
Dhariwal, P., Jun, H., Payne, C., Kim, J.W., Radford, A., Sutskever, I.: Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341 (2020)
work page Pith review arXiv 2005
-
[12]
IEEE Transactions on Pattern Analysis and Machine Intelligence46(2), 1011–1030 (2023)
Ding, G., Sener, F., Yao, A.: Temporal action segmentation: An analysis of modern tech- niques. IEEE Transactions on Pattern Analysis and Machine Intelligence46(2), 1011–1030 (2023)
2023
-
[13]
arXiv preprint (2017)
Ding, L., Xu, C.: Tricornet: A hybrid temporal convolutional and recurrent network for video action segmentation. arXiv preprint (2017)
2017
-
[14]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Farha, Y .A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action seg- mentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3575–3584 (2019)
2019
-
[15]
Advances in neural information processing systems35, 35946–35958 (2022) 16 Ahmed et al
Feichtenhofer, C., Li, Y ., He, K., et al.: Masked autoencoders as spatiotemporal learners. Advances in neural information processing systems35, 35946–35958 (2022) 16 Ahmed et al
2022
-
[16]
IEEE Transactions on Emerging Top- ics in Computing12(1), 202–212 (2022)
Filtjens, B., Vanrumste, B., Slaets, P.: Skeleton-based action segmentation with multi-stage spatial-temporal graph convolutional neural networks. IEEE Transactions on Emerging Top- ics in Computing12(1), 202–212 (2022)
2022
-
[17]
In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision
Gökay, U., Spurio, F., Bach, D.R., Gall, J.: Skeleton motion words for unsupervised skeleton- based temporal action segmentation. In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision. pp. 12101–12111 (2025)
2025
-
[18]
Advances in neural information processing systems33, 21271–21284 (2020)
Grill, J.B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent-a new ap- proach to self-supervised learning. Advances in neural information processing systems33, 21271–21284 (2020)
2020
-
[19]
In: Proceedings of the AAAI conference on artificial intelligence
Guo, T., Liu, H., Chen, Z., Liu, M., Wang, T., Ding, R.: Contrastive learning from extremely augmented skeleton sequences for self-supervised action recognition. In: Proceedings of the AAAI conference on artificial intelligence. vol. 36, pp. 762–770 (2022)
2022
-
[20]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
He, K., Chen, X., Xie, S., Li, Y ., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16000–16009 (2022)
2022
-
[21]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
He, K., Fan, H., Wu, Y ., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9729–9738 (2020)
2020
-
[22]
Data Science and Engineering5(2), 126–139 (2020)
Hosseini, B., Montagne, R., Hammer, B.: Deep-aligned convolutional neural network for skeleton-based action recognition and segmentation. Data Science and Engineering5(2), 126–139 (2020)
2020
-
[23]
In: 2024 IEEE International Conference on Robotics and Automation (ICRA)
Hyder, S.W., Usama, M., Zafar, A., Naufil, M., Fateh, F.J., Konin, A., Zia, M.Z., Tran, Q.H.: Action segmentation using 2d skeleton heatmaps and multi-modality fusion. In: 2024 IEEE International Conference on Robotics and Automation (ICRA). pp. 1048–1055. IEEE (2024)
2024
-
[24]
In: European Conference on Computer Vision
Ji, H., Chen, B., Xu, X., Ren, W., Wang, Z., Liu, H.: Language-assisted skeleton action understanding for skeleton-based temporal action segmentation. In: European Conference on Computer Vision. pp. 400–417. Springer (2024)
2024
-
[25]
In: 2022 IEEE/RSJ In- ternational Conference on Intelligent Robots and Systems (IROS)
Khan, H., Haresh, S., Ahmed, A., Siddiqui, S., Konin, A., Zia, M.Z., Tran, Q.H.: Timestamp- supervised action segmentation with graph convolutional networks. In: 2022 IEEE/RSJ In- ternational Conference on Intelligent Robots and Systems (IROS). pp. 10619–10626. IEEE (2022)
2022
-
[26]
Adam: A Method for Stochastic Optimization
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[27]
IEEE transactions on pattern analysis and machine intelligence 42(4), 765–779 (2018)
Kuehne, H., Richard, A., Gall, J.: A hybrid rnn-hmm approach for weakly supervised tem- poral action segmentation. IEEE transactions on pattern analysis and machine intelligence 42(4), 765–779 (2018)
2018
-
[28]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Kukleva, A., Kuehne, H., Sener, F., Gall, J.: Unsupervised learning of action classes with continuous temporal embedding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 12066–12074 (2019)
2019
-
[29]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Kumar, S., Haresh, S., Ahmed, A., Konin, A., Zia, M.Z., Tran, Q.H.: Unsupervised action segmentation by joint representation learning and online clustering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 20174– 20185 (2022)
2022
-
[30]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Kwon, T., Tekin, B., Tang, S., Pollefeys, M.: Context-aware sequence alignment using 4d skeletal augmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8172–8182 (2022)
2022
-
[31]
In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR)
Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR). pp. 156–165 (2017) Hierarchical Spatiotemporal Vector Quantization 17
2017
-
[32]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Lei, P., Todorovic, S.: Temporal deformable residual networks for action segmentation in videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 6742–6751 (2018)
2018
-
[33]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition
Li, J., Todorovic, S.: Action shuffle alternating learning for unsupervised action segmenta- tion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition. pp. 12628–12636 (2021)
2021
-
[34]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Li, L., Wang, M., Ni, B., Wang, H., Yang, J., Zhang, W.: 3d human action representation learning via cross-view consistency pursuit. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4741–4750 (2021)
2021
-
[35]
IEEE Transactions on Circuits and Systems for Video Technology34(1), 647–660 (2023)
Li, Y .H., Liu, K.Y ., Liu, S.L., Feng, L., Qiao, H.: Involving distinguished temporal graph convolutional networks for skeleton-based temporal action segmentation. IEEE Transactions on Circuits and Systems for Video Technology34(1), 647–660 (2023)
2023
-
[36]
arXiv preprint arXiv:2312.05830 (2023)
Li, Y ., Li, Z., Gao, S., Wang, Q., Hou, Q., Cheng, M.M.: A decoupled spatio-temporal frame- work for skeleton-based action segmentation. arXiv preprint arXiv:2312.05830 (2023)
-
[37]
In: Proceedings of the 28th ACM international conference on mul- timedia
Lin, L., Song, S., Yang, W., Liu, J.: Ms2l: Multi-task self-supervised learning for skeleton based action recognition. In: Proceedings of the 28th ACM international conference on mul- timedia. pp. 2490–2498 (2020)
2020
-
[38]
In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition
Lin, L., Zhang, J., Liu, J.: Actionlet-dependent contrastive learning for unsupervised skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition. pp. 2363–2372 (2023)
2023
-
[39]
IEEE Signal Processing Letters29, 1883–1887 (2022)
Liu, K., Li, Y ., Xu, Y ., Liu, S., Liu, S.: Spatial focus attention for fine-grained skeleton-based action tasks. IEEE Signal Processing Letters29, 1883–1887 (2022)
2022
-
[40]
In: Proceedings of the IEEE/CVF international conference on computer vision
Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: Amass: Archive of motion capture as surface shapes. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 5442–5451 (2019)
2019
-
[41]
In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision
Mahmood, S.A., Ali, A.S., Ahmed, U., Fateh, F.J., Zia, M.Z., Tran, Q.H.: Procedure learn- ing via regularized gromov-wasserstein optimal transport. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 6925–6935 (2026)
2026
-
[42]
Sensors20(15), 4083 (2020)
Niemann, F., Reining, C., Moya Rueda, F., Nair, N.R., Steffens, J.A., Fink, G.A., Ten Hom- pel, M.: Lara: Creating a dataset for human activity recognition in logistics using semantic attributes. Sensors20(15), 4083 (2020)
2020
-
[43]
arXiv preprint arXiv:2204.10312 (2022)
Paoletti, G., Cavazza, J., Beyan, C., Del Bue, A.: Unsupervised human action recogni- tion with skeletal graph laplacian and self-supervised viewpoints invariance. arXiv preprint arXiv:2204.10312 (2022)
-
[44]
In: Proceedings of the IEEE/CVF winter conference on applications of computer vision
Parsa, B., Banerjee, A.G.: A multi-task learning approach for human activity segmentation and ergonomics risk assessment. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 2352–2362 (2021)
2021
-
[45]
In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision
Parsa, B., Dariush, B., et al.: Spatio-temporal pyramid graph convolutions for human action recognition and postural assessment. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 1080–1090 (2020)
2020
-
[46]
In: Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition
Punnakkal, A.R., Chandrasekaran, A., Athanasiou, N., Quiros-Ramirez, A., Black, M.J.: Ba- bel: Bodies, action and behavior with english labels. In: Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition. pp. 722–731 (2021)
2021
-
[47]
Advances in neural information processing systems32(2019)
Razavi, A., Van den Oord, A., Vinyals, O.: Generating diverse high-fidelity images with vq-vae-2. Advances in neural information processing systems32(2019)
2019
-
[48]
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Sener, F., Yao, A.: Unsupervised learning and segmentation of complex activities from video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 8368–8376 (2018)
2018
-
[49]
In: Proceedings of the IEEE International Conference on Computer Vision
Sener, O., Zamir, A.R., Savarese, S., Saxena, A.: Unsupervised semantic parsing of video collections. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 4480–4488 (2015) 18 Ahmed et al
2015
-
[50]
In: Proceedings of the AAAI Conference on Artificial Intelli- gence
Spurio, F., Bahrami, E., Francesca, G., Gall, J.: Hierarchical vector quantization for unsu- pervised action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelli- gence. vol. 39, pp. 6996–7005 (2025)
2025
-
[51]
In: European conference on computer vision
Stoffl, L., Bonnetto, A., d’Ascoli, S., Mathis, A.: Elucidating the hierarchical nature of be- havior with masked autoencoders. In: European conference on computer vision. pp. 106–
-
[52]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition
Su, K., Liu, X., Shlizerman, E.: Predict & cluster: Unsupervised skeleton based action recog- nition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition. pp. 9631–9640 (2020)
2020
-
[53]
In: Chinese Conference on Pattern Recogni- tion and Computer Vision (PRCV)
Tan, C., Sun, T., Fu, T., Wang, Y ., Xu, M., Liu, S.: Hierarchical spatial-temporal network for skeleton-based temporal action segmentation. In: Chinese Conference on Pattern Recogni- tion and Computer Vision (PRCV). pp. 28–39. Springer (2023)
2023
-
[54]
In: 2023 IEEE International Con- ference on Multimedia and Expo Workshops (ICMEW)
Tian, X., Jin, Y ., Zhang, Z., Liu, P., Tang, X.: Stga-net: Spatial-temporal graph attention network for skeleton-based temporal action segmentation. In: 2023 IEEE International Con- ference on Multimedia and Expo Workshops (ICMEW). pp. 218–223. IEEE (2023)
2023
-
[55]
Multimedia Tools and Applications83(15), 44273–44297 (2024)
Tian, X., Jin, Y ., Zhang, Z., Liu, P., Tang, X.: Spatial-temporal graph transformer network for skeleton-based temporal action segmentation. Multimedia Tools and Applications83(15), 44273–44297 (2024)
2024
-
[56]
In: European Conference on Computer Vision
Tran, Q.H., Ahmed, M., Popattia, M., Ahmed, M.H., Konin, A., Zia, M.Z.: Learning by align- ing 2d skeleton sequences and multi-modality fusion. In: European Conference on Computer Vision. pp. 141–161. Springer (2024)
2024
-
[57]
In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision
Tran, Q.H., Mehmood, A., Ahmed, M., Naufil, M., Zafar, A., Konin, A., Zia, Z.: Permutation- aware activity segmentation via unsupervised frame-to-segment alignment. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 6426–6436 (2024)
2024
-
[58]
Advances in neural information processing systems30(2017)
Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. Advances in neural information processing systems30(2017)
2017
-
[59]
In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision
VidalMata, R.G., Scheirer, W.J., Kukleva, A., Cox, D., Kuehne, H.: Joint visual-temporal embedding for unsupervised learning of actions in untrimmed sequences. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 1238–1247 (2021)
2021
-
[60]
In: 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
Vuong, A.D., Vu, M.N., An, D., Reid, I.: Action tokenizer matters in in-context imitation learning. In: 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 13490–13496. IEEE (2025)
2025
-
[61]
Computer Vision and Image Understanding232, 103707 (2023)
Xu, L., Wang, Q., Lin, X., Yuan, L.: An efficient framework for few-shot skeleton-based tem- poral action segmentation. Computer Vision and Image Understanding232, 103707 (2023)
2023
-
[62]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Xu, M., Gould, S.: Temporally consistent unbalanced optimal transport for unsupervised action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 14618–14627 (2024)
2024
-
[63]
Advances in Neural Information Processing Systems34, 3205– 3217 (2021)
Xu, Z., Shen, X., Wong, Y ., Kankanhalli, M.S.: Unsupervised motion representation learning with capsule autoencoders. Advances in Neural Information Processing Systems34, 3205– 3217 (2021)
2021
-
[64]
In: Proceedings of the AAAI conference on artificial intelligence
Yan, S., Xiong, Y ., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the AAAI conference on artificial intelligence. vol. 32 (2018)
2018
-
[65]
VideoGPT: Video Generation using VQ-VAE and Transformers
Yan, W., Zhang, Y ., Abbeel, P., Srinivas, A.: Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157 (2021)
work page internal anchor Pith review arXiv 2021
-
[66]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Yang, D., Wang, Y ., Dantcheva, A., Kong, Q., Garattoni, L., Francesca, G., Bremond, F.: Lac-latent action composition for skeleton-based action segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 13679–13690 (2023) Hierarchical Spatiotemporal Vector Quantization 19
2023
-
[67]
In: Proceedings of the AAAI Conference on Artificial Intelligence
Yu, Q., Fujiwara, K.: Frame-level label refinement for skeleton-based weakly-supervised ac- tion recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 37, pp. 3322–3330 (2023)
2023
-
[68]
In: European Conference on Computer Vision
Zhang, H., Hou, Y ., Zhang, W., Li, W.: Contrastive positive mining for unsupervised 3d action representation learning. In: European Conference on Computer Vision. pp. 36–51. Springer (2022)
2022
-
[69]
In: Proceedings of the AAAI Conference on Artificial Intelligence
Zheng, N., Wen, J., Liu, R., Long, L., Dai, J., Gong, Z.: Unsupervised representation learning with long-term dynamics for skeleton based action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 32 (2018)
2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.