pith. machine review for the scientific record. sign in

arxiv: 2604.15196 · v1 · submitted 2026-04-16 · 💻 cs.CV

Recognition: unknown

Unsupervised Skeleton-Based Action Segmentation via Hierarchical Spatiotemporal Vector Quantization

Authors on Pith no claims yet

Pith reviewed 2026-05-10 11:21 UTC · model grok-4.3

classification 💻 cs.CV
keywords unsupervised action segmentationskeleton-based actionsvector quantizationhierarchical clusteringtemporal segmentationspatiotemporal analysissubaction discovery
0
0 comments X

The pith

A hierarchical spatiotemporal vector quantization framework segments skeleton-based actions unsupervised by multi-level clustering while reconstructing input data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a hierarchical spatiotemporal vector quantization method for unsupervised temporal action segmentation from skeleton sequences. It applies vector quantization in two successive stages, with the lower stage mapping individual skeletons to fine-grained subactions and the upper stage grouping those into full actions. The process simultaneously reconstructs both the original skeleton poses and their timestamps, allowing the model to discover structure from spatial and temporal cues alone. This design outperforms a non-hierarchical baseline and produces segments with less length bias across the HuGaDB, LARa, and BABEL benchmarks.

Core claim

Our hierarchical spatiotemporal vector quantization scheme performs multi-level clustering on skeleton data while simultaneously recovering input skeletons and their corresponding timestamps, which allows it to outperform non-hierarchical baselines and establish new state-of-the-art performance in unsupervised skeleton-based temporal action segmentation while reducing segment length bias.

What carries the argument

The hierarchical spatiotemporal vector quantization framework, which uses two consecutive levels of vector quantization to first associate skeletons with subactions and then aggregate them into actions, while reconstructing both poses and timestamps.

Load-bearing premise

That reconstruction of input skeletons and timestamps via hierarchical vector quantization will produce semantically meaningful subaction and action clusters without supervision or additional constraints.

What would settle it

If the segments produced by the method on HuGaDB, LARa, or BABEL show boundary alignment with ground-truth annotations no better than a standard single-level clustering baseline, or if segment-length bias remains unchanged after the hierarchical extension.

Figures

Figures reproduced from arXiv: 2604.15196 by Fawad Javed Fateh, M. Shaheer Luqman, M. Zeeshan Zia, Quoc-Huy Tran, Syed Ahmed Mahmood, Umer Ahmed.

Figure 1
Figure 1. Figure 1: (a) Previous unsupervised skeleton-based temporal action segmentation methods, e.g., SMQ [17], rely on traditional vector quantization techniques, which perform flat clustering and mostly exploit spatial cues via reconstructing input skeletons. (b) We propose a hierarchical spa￾tiotemporal vector quantization approach, which conducts multi-level clustering and exploits both spatial and temporal cues by joi… view at source ↗
Figure 2
Figure 2. Figure 2: Given an input skeleton sequence S with associated timestamps T, we pass S to an en￾coder, which maps each joint sequence independently to the latent space. We concatenate the embedded joint sequences into the embedded skeleton sequence and divide it into patches along the temporal dimension, yielding XP . Next, each patch pk is first assigned to the nearest sub￾action prototype zj , which is then assigned… view at source ↗
Figure 3
Figure 3. Figure 3: Histograms of segment lengths on BABEL Subset-3. GT Ours SMQ GT Ours SMQ (a) HuGaDB GT Ours SMQ GT Ours SMQ (b) LARa GT Ours SMQ GT Ours SMQ (c) BABEL Subset-2 GT Ours SMQ GT Ours SMQ (d) BABEL Subset-3 [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparisons on HuGaDB, LARa, BABAL Subset-2, and BABEL Subset-3. across all examples. This highlights the superior accuracy and reduced segment length bias of our HiST-VQ approach compared to the state-of-the-art model SMQ [17]. 4.2 Ablation Results Impact of Model Components. We systematically remove one model component at a time to assess its contribution to the overall performance and report… view at source ↗
read the original abstract

We propose a novel hierarchical spatiotemporal vector quantization framework for unsupervised skeleton-based temporal action segmentation. We first introduce a hierarchical approach, which includes two consecutive levels of vector quantization. Specifically, the lower level associates skeletons with fine-grained subactions, while the higher level further aggregates subactions into action-level representations. Our hierarchical approach outperforms the non-hierarchical baseline, while primarily exploiting spatial cues by reconstructing input skeletons. Next, we extend our approach by leveraging both spatial and temporal information, yielding a hierarchical spatiotemporal vector quantization scheme. In particular, our hierarchical spatiotemporal approach performs multi-level clustering, while simultaneously recovering input skeletons and their corresponding timestamps. Lastly, extensive experiments on multiple benchmarks, including HuGaDB, LARa, and BABEL, demonstrate that our approach establishes a new state-of-the-art performance and reduces segment length bias in unsupervised skeleton-based temporal action segmentation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a hierarchical spatiotemporal vector quantization framework for unsupervised skeleton-based temporal action segmentation. It introduces two consecutive VQ levels where the lower level associates skeletons with fine-grained subactions via spatial reconstruction and the higher level aggregates them into action representations; the spatiotemporal extension additionally recovers timestamps. Experiments on HuGaDB, LARa, and BABEL are claimed to establish new state-of-the-art performance while reducing segment length bias.

Significance. If the empirical results are robust and the learned codes align with semantic subactions/actions rather than low-level features, the work would advance unsupervised skeleton-based segmentation by demonstrating that hierarchical VQ reconstruction can perform multi-level clustering without labels or explicit boundary terms, potentially offering a scalable alternative to supervised methods and mitigating common biases in temporal segmentation.

major comments (2)
  1. [Method] Method description (hierarchical VQ and spatiotemporal extension): The central claim that reconstruction of input skeletons (and timestamps) at two VQ levels automatically yields semantically meaningful subaction and action clusters rests on an unstated assumption. The objective minimizes only reconstruction error with no explicit terms for temporal consistency, boundary detection, or semantic alignment; nothing prevents clustering by static pose similarity or frame-rate artifacts. This directly undermines the reported SOTA and bias-reduction claims.
  2. [Experiments] Experiments section: The abstract asserts SOTA results and segment-length-bias reduction on HuGaDB, LARa, and BABEL, yet the manuscript summary provides no quantitative metrics, baseline details, ablation studies, or error analysis. Without these load-bearing elements, the magnitude of improvement and the validity of the unsupervised clustering cannot be assessed.
minor comments (1)
  1. Clarify notation for the two VQ levels and the timestamp reconstruction term to avoid ambiguity in the hierarchical formulation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Method] Method description (hierarchical VQ and spatiotemporal extension): The central claim that reconstruction of input skeletons (and timestamps) at two VQ levels automatically yields semantically meaningful subaction and action clusters rests on an unstated assumption. The objective minimizes only reconstruction error with no explicit terms for temporal consistency, boundary detection, or semantic alignment; nothing prevents clustering by static pose similarity or frame-rate artifacts. This directly undermines the reported SOTA and bias-reduction claims.

    Authors: We acknowledge that the training objective consists solely of reconstruction losses at both VQ levels without explicit regularization for temporal consistency, boundary detection, or semantic alignment. The hierarchical design is intended to induce multi-scale clustering implicitly: the lower VQ level reconstructs fine-grained spatial variations that correspond to subactions, while the higher level aggregates these into action-level representations. The spatiotemporal extension further incorporates timestamp reconstruction to encourage temporally coherent codes. We agree that this mechanism is implicit and that additional analysis is warranted to rule out clustering by static pose or frame-rate artifacts. In the revised manuscript we will (1) explicitly articulate the underlying assumptions in the method section, (2) add visualizations (e.g., t-SNE of code assignments colored by ground-truth labels) and quantitative checks (e.g., mutual information between codes and semantic labels versus low-level features) to demonstrate semantic alignment, and (3) discuss potential limitations of the reconstruction-only objective. These additions will support rather than undermine the empirical claims. revision: yes

  2. Referee: [Experiments] Experiments section: The abstract asserts SOTA results and segment-length-bias reduction on HuGaDB, LARa, and BABEL, yet the manuscript summary provides no quantitative metrics, baseline details, ablation studies, or error analysis. Without these load-bearing elements, the magnitude of improvement and the validity of the unsupervised clustering cannot be assessed.

    Authors: The full manuscript contains a dedicated Experiments section that reports quantitative metrics, comparisons against multiple unsupervised baselines, ablation studies on the hierarchical and spatiotemporal components, and analysis of segment-length bias across HuGaDB, LARa, and BABEL. However, we recognize that the initial submission summary did not sufficiently foreground these results. In the revision we will (1) include a concise summary table of key metrics in the introduction or abstract, (2) expand the error analysis with per-dataset breakdowns and statistical significance tests, and (3) ensure all baseline details and ablation configurations are clearly tabulated. These changes will make the magnitude of improvement and the validity of the clustering directly verifiable. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper defines an unsupervised hierarchical spatiotemporal VQ model whose objective is explicit reconstruction of input skeletons plus timestamps at two levels. Clustering emerges as a byproduct of discretization for reconstruction fidelity, but the claimed SOTA performance and segment-length-bias reduction are reported as outcomes of separate empirical evaluation on HuGaDB, LARa, and BABEL. No equations equate the learned codes to semantic subactions by construction, no fitted parameters are relabeled as predictions, and no load-bearing self-citations appear in the provided derivation chain. The semantic-alignment assumption is therefore an independent empirical claim rather than a definitional tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on standard unsupervised clustering assumptions and reconstruction objectives; no new entities or heavily fitted parameters are introduced in the abstract.

axioms (2)
  • domain assumption Vector quantization can map skeleton poses to discrete subaction codes that aggregate meaningfully into actions
    Invoked by the lower- and higher-level VQ stages described in the abstract.
  • domain assumption Reconstruction loss on skeletons and timestamps suffices to learn segmentation boundaries
    Central to both the spatial-only and spatiotemporal variants.

pith-pipeline@v0.9.0 · 5469 in / 1218 out tokens · 28958 ms · 2026-05-10T11:21:22.699694+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

69 extracted references · 5 canonical work pages · 2 internal anchors

  1. [1]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    Alayrac, J.B., Bojanowski, P., Agrawal, N., Sivic, J., Laptev, I., Lacoste-Julien, S.: Unsuper- vised learning from narrated instruction videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4575–4583 (2016)

  2. [2]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Ali, A.S., Mahmood, S.A., Saeed, M., Konin, A., Zia, M.Z., Tran, Q.H.: Joint self-supervised video alignment and action segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10807–10818 (2025)

  3. [3]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Bueno-Benito, E., Dimiccoli, M.: Clot: Closed loop optimal transport for unsupervised ac- tion segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10719–10729 (2025)

  4. [4]

    In: Proceedings of the European conference on computer vision (ECCV)

    Caron, M., Bojanowski, P., Joulin, A., Douze, M.: Deep clustering for unsupervised learning of visual features. In: Proceedings of the European conference on computer vision (ECCV). pp. 132–149 (2018)

  5. [5]

    Advances in neural information processing systems33, 9912–9924 (2020)

    Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., Joulin, A.: Unsupervised learn- ing of visual features by contrasting cluster assignments. Advances in neural information processing systems33, 9912–9924 (2020)

  6. [6]

    In: Proceedings of the IEEE/CVF interna- tional conference on computer vision

    Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF interna- tional conference on computer vision. pp. 9650–9660 (2021)

  7. [7]

    Neurocomputing p

    Chen, E., Wang, X., Guo, X.: Masked reconstruction model of latent space vector quantiza- tion for human skeleton-based action recognition. Neurocomputing p. 132126 (2025)

  8. [8]

    In: International conference on machine learning

    Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International conference on machine learning. pp. 1597–1607. PmLR (2020)

  9. [9]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Chen, Y ., Ge, Y ., Tang, W., Li, Y ., Ge, Y ., Ding, M., Shan, Y ., Liu, X.: Moto: Latent motion token as the bridging language for learning robot manipulation from videos. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 19752–19763 (2025)

  10. [10]

    In: International Conference on Analysis of Images, Social Networks and Texts

    Chereshnev, R., Kertész-Farkas, A.: Hugadb: Human gait database for activity recognition from wearable inertial sensor networks. In: International Conference on Analysis of Images, Social Networks and Texts. pp. 131–141. Springer (2017)

  11. [11]

    Jukebox: A Generative Model for Music

    Dhariwal, P., Jun, H., Payne, C., Kim, J.W., Radford, A., Sutskever, I.: Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341 (2020)

  12. [12]

    IEEE Transactions on Pattern Analysis and Machine Intelligence46(2), 1011–1030 (2023)

    Ding, G., Sener, F., Yao, A.: Temporal action segmentation: An analysis of modern tech- niques. IEEE Transactions on Pattern Analysis and Machine Intelligence46(2), 1011–1030 (2023)

  13. [13]

    arXiv preprint (2017)

    Ding, L., Xu, C.: Tricornet: A hybrid temporal convolutional and recurrent network for video action segmentation. arXiv preprint (2017)

  14. [14]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Farha, Y .A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action seg- mentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3575–3584 (2019)

  15. [15]

    Advances in neural information processing systems35, 35946–35958 (2022) 16 Ahmed et al

    Feichtenhofer, C., Li, Y ., He, K., et al.: Masked autoencoders as spatiotemporal learners. Advances in neural information processing systems35, 35946–35958 (2022) 16 Ahmed et al

  16. [16]

    IEEE Transactions on Emerging Top- ics in Computing12(1), 202–212 (2022)

    Filtjens, B., Vanrumste, B., Slaets, P.: Skeleton-based action segmentation with multi-stage spatial-temporal graph convolutional neural networks. IEEE Transactions on Emerging Top- ics in Computing12(1), 202–212 (2022)

  17. [17]

    In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision

    Gökay, U., Spurio, F., Bach, D.R., Gall, J.: Skeleton motion words for unsupervised skeleton- based temporal action segmentation. In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision. pp. 12101–12111 (2025)

  18. [18]

    Advances in neural information processing systems33, 21271–21284 (2020)

    Grill, J.B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent-a new ap- proach to self-supervised learning. Advances in neural information processing systems33, 21271–21284 (2020)

  19. [19]

    In: Proceedings of the AAAI conference on artificial intelligence

    Guo, T., Liu, H., Chen, Z., Liu, M., Wang, T., Ding, R.: Contrastive learning from extremely augmented skeleton sequences for self-supervised action recognition. In: Proceedings of the AAAI conference on artificial intelligence. vol. 36, pp. 762–770 (2022)

  20. [20]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    He, K., Chen, X., Xie, S., Li, Y ., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16000–16009 (2022)

  21. [21]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    He, K., Fan, H., Wu, Y ., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9729–9738 (2020)

  22. [22]

    Data Science and Engineering5(2), 126–139 (2020)

    Hosseini, B., Montagne, R., Hammer, B.: Deep-aligned convolutional neural network for skeleton-based action recognition and segmentation. Data Science and Engineering5(2), 126–139 (2020)

  23. [23]

    In: 2024 IEEE International Conference on Robotics and Automation (ICRA)

    Hyder, S.W., Usama, M., Zafar, A., Naufil, M., Fateh, F.J., Konin, A., Zia, M.Z., Tran, Q.H.: Action segmentation using 2d skeleton heatmaps and multi-modality fusion. In: 2024 IEEE International Conference on Robotics and Automation (ICRA). pp. 1048–1055. IEEE (2024)

  24. [24]

    In: European Conference on Computer Vision

    Ji, H., Chen, B., Xu, X., Ren, W., Wang, Z., Liu, H.: Language-assisted skeleton action understanding for skeleton-based temporal action segmentation. In: European Conference on Computer Vision. pp. 400–417. Springer (2024)

  25. [25]

    In: 2022 IEEE/RSJ In- ternational Conference on Intelligent Robots and Systems (IROS)

    Khan, H., Haresh, S., Ahmed, A., Siddiqui, S., Konin, A., Zia, M.Z., Tran, Q.H.: Timestamp- supervised action segmentation with graph convolutional networks. In: 2022 IEEE/RSJ In- ternational Conference on Intelligent Robots and Systems (IROS). pp. 10619–10626. IEEE (2022)

  26. [26]

    Adam: A Method for Stochastic Optimization

    Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  27. [27]

    IEEE transactions on pattern analysis and machine intelligence 42(4), 765–779 (2018)

    Kuehne, H., Richard, A., Gall, J.: A hybrid rnn-hmm approach for weakly supervised tem- poral action segmentation. IEEE transactions on pattern analysis and machine intelligence 42(4), 765–779 (2018)

  28. [28]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Kukleva, A., Kuehne, H., Sener, F., Gall, J.: Unsupervised learning of action classes with continuous temporal embedding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 12066–12074 (2019)

  29. [29]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Kumar, S., Haresh, S., Ahmed, A., Konin, A., Zia, M.Z., Tran, Q.H.: Unsupervised action segmentation by joint representation learning and online clustering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 20174– 20185 (2022)

  30. [30]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Kwon, T., Tekin, B., Tang, S., Pollefeys, M.: Context-aware sequence alignment using 4d skeletal augmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8172–8182 (2022)

  31. [31]

    In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR)

    Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR). pp. 156–165 (2017) Hierarchical Spatiotemporal Vector Quantization 17

  32. [32]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Lei, P., Todorovic, S.: Temporal deformable residual networks for action segmentation in videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 6742–6751 (2018)

  33. [33]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition

    Li, J., Todorovic, S.: Action shuffle alternating learning for unsupervised action segmenta- tion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition. pp. 12628–12636 (2021)

  34. [34]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Li, L., Wang, M., Ni, B., Wang, H., Yang, J., Zhang, W.: 3d human action representation learning via cross-view consistency pursuit. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4741–4750 (2021)

  35. [35]

    IEEE Transactions on Circuits and Systems for Video Technology34(1), 647–660 (2023)

    Li, Y .H., Liu, K.Y ., Liu, S.L., Feng, L., Qiao, H.: Involving distinguished temporal graph convolutional networks for skeleton-based temporal action segmentation. IEEE Transactions on Circuits and Systems for Video Technology34(1), 647–660 (2023)

  36. [36]

    arXiv preprint arXiv:2312.05830 (2023)

    Li, Y ., Li, Z., Gao, S., Wang, Q., Hou, Q., Cheng, M.M.: A decoupled spatio-temporal frame- work for skeleton-based action segmentation. arXiv preprint arXiv:2312.05830 (2023)

  37. [37]

    In: Proceedings of the 28th ACM international conference on mul- timedia

    Lin, L., Song, S., Yang, W., Liu, J.: Ms2l: Multi-task self-supervised learning for skeleton based action recognition. In: Proceedings of the 28th ACM international conference on mul- timedia. pp. 2490–2498 (2020)

  38. [38]

    In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition

    Lin, L., Zhang, J., Liu, J.: Actionlet-dependent contrastive learning for unsupervised skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition. pp. 2363–2372 (2023)

  39. [39]

    IEEE Signal Processing Letters29, 1883–1887 (2022)

    Liu, K., Li, Y ., Xu, Y ., Liu, S., Liu, S.: Spatial focus attention for fine-grained skeleton-based action tasks. IEEE Signal Processing Letters29, 1883–1887 (2022)

  40. [40]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: Amass: Archive of motion capture as surface shapes. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 5442–5451 (2019)

  41. [41]

    In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

    Mahmood, S.A., Ali, A.S., Ahmed, U., Fateh, F.J., Zia, M.Z., Tran, Q.H.: Procedure learn- ing via regularized gromov-wasserstein optimal transport. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 6925–6935 (2026)

  42. [42]

    Sensors20(15), 4083 (2020)

    Niemann, F., Reining, C., Moya Rueda, F., Nair, N.R., Steffens, J.A., Fink, G.A., Ten Hom- pel, M.: Lara: Creating a dataset for human activity recognition in logistics using semantic attributes. Sensors20(15), 4083 (2020)

  43. [43]

    arXiv preprint arXiv:2204.10312 (2022)

    Paoletti, G., Cavazza, J., Beyan, C., Del Bue, A.: Unsupervised human action recogni- tion with skeletal graph laplacian and self-supervised viewpoints invariance. arXiv preprint arXiv:2204.10312 (2022)

  44. [44]

    In: Proceedings of the IEEE/CVF winter conference on applications of computer vision

    Parsa, B., Banerjee, A.G.: A multi-task learning approach for human activity segmentation and ergonomics risk assessment. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 2352–2362 (2021)

  45. [45]

    In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

    Parsa, B., Dariush, B., et al.: Spatio-temporal pyramid graph convolutions for human action recognition and postural assessment. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 1080–1090 (2020)

  46. [46]

    In: Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition

    Punnakkal, A.R., Chandrasekaran, A., Athanasiou, N., Quiros-Ramirez, A., Black, M.J.: Ba- bel: Bodies, action and behavior with english labels. In: Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition. pp. 722–731 (2021)

  47. [47]

    Advances in neural information processing systems32(2019)

    Razavi, A., Van den Oord, A., Vinyals, O.: Generating diverse high-fidelity images with vq-vae-2. Advances in neural information processing systems32(2019)

  48. [48]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    Sener, F., Yao, A.: Unsupervised learning and segmentation of complex activities from video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 8368–8376 (2018)

  49. [49]

    In: Proceedings of the IEEE International Conference on Computer Vision

    Sener, O., Zamir, A.R., Savarese, S., Saxena, A.: Unsupervised semantic parsing of video collections. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 4480–4488 (2015) 18 Ahmed et al

  50. [50]

    In: Proceedings of the AAAI Conference on Artificial Intelli- gence

    Spurio, F., Bahrami, E., Francesca, G., Gall, J.: Hierarchical vector quantization for unsu- pervised action segmentation. In: Proceedings of the AAAI Conference on Artificial Intelli- gence. vol. 39, pp. 6996–7005 (2025)

  51. [51]

    In: European conference on computer vision

    Stoffl, L., Bonnetto, A., d’Ascoli, S., Mathis, A.: Elucidating the hierarchical nature of be- havior with masked autoencoders. In: European conference on computer vision. pp. 106–

  52. [52]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition

    Su, K., Liu, X., Shlizerman, E.: Predict & cluster: Unsupervised skeleton based action recog- nition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition. pp. 9631–9640 (2020)

  53. [53]

    In: Chinese Conference on Pattern Recogni- tion and Computer Vision (PRCV)

    Tan, C., Sun, T., Fu, T., Wang, Y ., Xu, M., Liu, S.: Hierarchical spatial-temporal network for skeleton-based temporal action segmentation. In: Chinese Conference on Pattern Recogni- tion and Computer Vision (PRCV). pp. 28–39. Springer (2023)

  54. [54]

    In: 2023 IEEE International Con- ference on Multimedia and Expo Workshops (ICMEW)

    Tian, X., Jin, Y ., Zhang, Z., Liu, P., Tang, X.: Stga-net: Spatial-temporal graph attention network for skeleton-based temporal action segmentation. In: 2023 IEEE International Con- ference on Multimedia and Expo Workshops (ICMEW). pp. 218–223. IEEE (2023)

  55. [55]

    Multimedia Tools and Applications83(15), 44273–44297 (2024)

    Tian, X., Jin, Y ., Zhang, Z., Liu, P., Tang, X.: Spatial-temporal graph transformer network for skeleton-based temporal action segmentation. Multimedia Tools and Applications83(15), 44273–44297 (2024)

  56. [56]

    In: European Conference on Computer Vision

    Tran, Q.H., Ahmed, M., Popattia, M., Ahmed, M.H., Konin, A., Zia, M.Z.: Learning by align- ing 2d skeleton sequences and multi-modality fusion. In: European Conference on Computer Vision. pp. 141–161. Springer (2024)

  57. [57]

    In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

    Tran, Q.H., Mehmood, A., Ahmed, M., Naufil, M., Zafar, A., Konin, A., Zia, Z.: Permutation- aware activity segmentation via unsupervised frame-to-segment alignment. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 6426–6436 (2024)

  58. [58]

    Advances in neural information processing systems30(2017)

    Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. Advances in neural information processing systems30(2017)

  59. [59]

    In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

    VidalMata, R.G., Scheirer, W.J., Kukleva, A., Cox, D., Kuehne, H.: Joint visual-temporal embedding for unsupervised learning of actions in untrimmed sequences. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 1238–1247 (2021)

  60. [60]

    In: 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

    Vuong, A.D., Vu, M.N., An, D., Reid, I.: Action tokenizer matters in in-context imitation learning. In: 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 13490–13496. IEEE (2025)

  61. [61]

    Computer Vision and Image Understanding232, 103707 (2023)

    Xu, L., Wang, Q., Lin, X., Yuan, L.: An efficient framework for few-shot skeleton-based tem- poral action segmentation. Computer Vision and Image Understanding232, 103707 (2023)

  62. [62]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Xu, M., Gould, S.: Temporally consistent unbalanced optimal transport for unsupervised action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 14618–14627 (2024)

  63. [63]

    Advances in Neural Information Processing Systems34, 3205– 3217 (2021)

    Xu, Z., Shen, X., Wong, Y ., Kankanhalli, M.S.: Unsupervised motion representation learning with capsule autoencoders. Advances in Neural Information Processing Systems34, 3205– 3217 (2021)

  64. [64]

    In: Proceedings of the AAAI conference on artificial intelligence

    Yan, S., Xiong, Y ., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the AAAI conference on artificial intelligence. vol. 32 (2018)

  65. [65]

    VideoGPT: Video Generation using VQ-VAE and Transformers

    Yan, W., Zhang, Y ., Abbeel, P., Srinivas, A.: Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157 (2021)

  66. [66]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Yang, D., Wang, Y ., Dantcheva, A., Kong, Q., Garattoni, L., Francesca, G., Bremond, F.: Lac-latent action composition for skeleton-based action segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 13679–13690 (2023) Hierarchical Spatiotemporal Vector Quantization 19

  67. [67]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Yu, Q., Fujiwara, K.: Frame-level label refinement for skeleton-based weakly-supervised ac- tion recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 37, pp. 3322–3330 (2023)

  68. [68]

    In: European Conference on Computer Vision

    Zhang, H., Hou, Y ., Zhang, W., Li, W.: Contrastive positive mining for unsupervised 3d action representation learning. In: European Conference on Computer Vision. pp. 36–51. Springer (2022)

  69. [69]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Zheng, N., Wen, J., Liu, R., Long, L., Dai, J., Gong, Z.: Unsupervised representation learning with long-term dynamics for skeleton based action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 32 (2018)