pith. machine review for the scientific record. sign in

arxiv: 2604.27508 · v1 · submitted 2026-04-30 · 💻 cs.RO

Recognition: unknown

SASI: Leveraging Sub-Action Semantics for Robust Early Action Recognition in Human-Robot Interaction

Hyuno Kim, Masahiro Hirano, Yongpeng Cao, Yuji Yamakawa

Authors on Pith no claims yet

Pith reviewed 2026-05-07 10:24 UTC · model grok-4.3

classification 💻 cs.RO
keywords early action recognitionsub-action semanticshuman-robot interactiongraph convolution networksskeleton dataBABEL datasetcross-modal fusionreal-time recognition
0
0 comments X

The pith

SASI fuses sub-action semantics with skeleton graph features to raise early action recognition accuracy on partial sequences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SASI to let robots recognize human actions sooner from incomplete data by drawing on the natural breakdown of actions into smaller meaningful parts. It combines a sub-action segmentation model with existing graph convolution networks so that fine semantic details from the parts fuse with overall spatial movement patterns from skeleton data. This runs in real time and is evaluated on the BABEL dataset with frame-level labels, where it outperforms standard holistic approaches especially when only part of an action has been seen. A reader would care because quicker understanding supports proactive robot responses that make human-robot collaboration smoother and safer. The gains are expected to grow as segmentation quality advances.

Core claim

SASI integrates graph convolution networks to fuse spatiotemporal features with sub-action semantics captured by a segmentation model on skeleton input. It retains both the detailed cues from sub-action units and the broader spatial context while processing at 29 Hz. On the BABEL dataset, the approach yields higher recognition accuracy than conventional methods and shows stronger results on partial action sequences, demonstrating its utility for early recognition needed in proactive human-robot interaction.

What carries the argument

SASI cross-modal fusion that pairs sub-action segmentation semantics with skeleton-based graph convolution networks

Load-bearing premise

Sub-action segmentation supplies reliable semantic cues that improve fusion with spatiotemporal features without adding noise or needing extra tuning to produce the observed gains.

What would settle it

An ablation test on BABEL showing that SASI without the sub-action segmentation module matches or exceeds full SASI accuracy on partial sequences would falsify the contribution of the semantic integration.

Figures

Figures reproduced from arXiv: 2604.27508 by Hyuno Kim, Masahiro Hirano, Yongpeng Cao, Yuji Yamakawa.

Figure 1
Figure 1. Figure 1: A comparison between the conventional holistic approach and pro view at source ↗
Figure 2
Figure 2. Figure 2: Framework overview of the proposed method. The MoCap motion data () view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of data interpolation. its sub-action annotations to motion sequences from AMASS dataset [39]. BABEL’s raw labels suffer from redundancy and noise due to multi-dataset aggregation. To address this, we apply the strategy proposed in TEMOS [40] to merge similar labels by comparing the cosine similarity using the pre￾trained language model. In our implementation, we choose to use the pre-trained … view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of attention weights in the cross-modal fusion module for four sample sequences comparing the cross-attention outputs of the view at source ↗
read the original abstract

Understanding human actions is critical for advancing behavior analysis in human-robot interaction. Particularly in tasks that demand quick and proactive feedback, robots must recognize human actions as early as possible from incomplete observations. \textit{Sub-actions} offer the semantic and hierarchical cues needed for this, since human actions are inherently structured and can be decomposed into smaller, meaningful units. However, conventional approaches focus primarily on holistic actions and often overlook the rich semantic structure embedded in sub-actions, making them poorly suited for early recognition. To address this gap, we introduce SASI (Sub-Action Semantics Integrated cross-modal fusion), a novel framework that integrates existing graph convolution networks to fuse spatiotemporal features with sub-action semantics. SASI exploits a segmentation model with a traditional skeleton-based graph convolution network, capturing both fine-grained sub-action semantics and overall spatial context, while operating in real-time at 29 Hz. Experiments on BABEL, a skeleton-based dataset with frame-level annotations, demonstrate that our method improves recognition accuracy over conventional approaches, with additional gains expected as the quality of sub-action segmentation improves. Notably, SASI also achieves superior performance in understanding partial action sequences, revealing its capability for early recognition, which is essential for proactive and seamless Human-Robot Interaction (HRI). Code is available at https://anonymous.4open.science/r/SASI .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces SASI, a framework that fuses sub-action semantics extracted by a segmentation model with spatiotemporal features from a skeleton-based graph convolutional network for early action recognition in human-robot interaction. It claims improved accuracy over conventional approaches on the BABEL dataset (with gains expected to increase with better segmentation quality) and superior performance on partial action sequences, while operating in real time at 29 Hz. Code is provided via an anonymous link.

Significance. If the reported gains are substantiated with proper controls, the approach could meaningfully advance early recognition for proactive HRI by exploiting hierarchical sub-action structure rather than holistic actions alone. The real-time capability and open code are practical strengths for robotics applications.

major comments (2)
  1. [Abstract] Abstract: The claim that the method 'improves recognition accuracy over conventional approaches' and achieves 'superior performance in understanding partial action sequences' is presented without any numerical results, baselines, error bars, or statistical details. This is load-bearing because the abstract itself states that gains depend on sub-action segmentation quality, yet supplies no evidence to support the superiority assertion.
  2. [Experiments] Experiments section: No ablation is reported that isolates the contribution of the proposed cross-modal fusion step from the sub-action segmentation input itself (e.g., simple concatenation vs. learned attention, or GCN with vs. without semantic cues). Segmentation accuracy metrics on BABEL are also absent. This directly undermines the central claim, as any downstream GCN could appear improved if the segmentation model already supplies strong frame-level signals.
minor comments (1)
  1. [Abstract] The code link is to an anonymous repository; a permanent, non-anonymous link should be provided in the final version to support reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and outline revisions to strengthen the presentation of results and experimental validation.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that the method 'improves recognition accuracy over conventional approaches' and achieves 'superior performance in understanding partial action sequences' is presented without any numerical results, baselines, error bars, or statistical details. This is load-bearing because the abstract itself states that gains depend on sub-action segmentation quality, yet supplies no evidence to support the superiority assertion.

    Authors: We agree that the abstract would be strengthened by including quantitative support for the claims. In the revised version, we will update the abstract to report specific accuracy improvements (with baselines and standard deviations) on the BABEL dataset for both full and partial sequences, while retaining the note on dependence on segmentation quality and referencing the corresponding experimental metrics. revision: yes

  2. Referee: [Experiments] Experiments section: No ablation is reported that isolates the contribution of the proposed cross-modal fusion step from the sub-action segmentation input itself (e.g., simple concatenation vs. learned attention, or GCN with vs. without semantic cues). Segmentation accuracy metrics on BABEL are also absent. This directly undermines the central claim, as any downstream GCN could appear improved if the segmentation model already supplies strong frame-level signals.

    Authors: We concur that an explicit ablation isolating the fusion mechanism is necessary to substantiate the contribution of SASI. We will add this to the experiments section, comparing the full model against ablated variants (GCN without semantics, and simple concatenation versus learned cross-modal fusion). We will also include the segmentation model's frame-level accuracy on BABEL to quantify input quality and show that downstream gains arise from the integration rather than segmentation alone. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline without derivation or self-referential reduction

full rationale

The paper describes SASI as an integration of off-the-shelf graph convolution networks with a sub-action segmentation model for fusing spatiotemporal features and semantics on the BABEL dataset. No equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations appear in the provided text or abstract. Claims of improved accuracy and early recognition are presented as experimental outcomes rather than reductions by construction. The method is self-contained as a standard empirical framework whose validity rests on external benchmarks, not internal definitional loops.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The approach rests on standard assumptions of graph convolution networks and segmentation models already published elsewhere; no new free parameters, axioms, or invented entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5547 in / 1008 out tokens · 41750 ms · 2026-05-07T10:24:12.975112+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 8 canonical work pages · 3 internal anchors

  1. [1]

    Semi-Supervised Classification with Graph Convolutional Networks

    T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,”arXiv preprint arXiv:1609.02907, 2016

  2. [3]

    Action recognition by hierarchical mid-level action elements,

    T. Lan, Y . Zhu, A. R. Zamir, and S. Savarese, “Action recognition by hierarchical mid-level action elements,” inProceedings of the IEEE international conference on computer vision, 2015, pp. 4552–4560

  3. [4]

    Action recognition by hierarchical mid-level action elements,

    ——, “Action recognition by hierarchical mid-level action elements,” inProceedings of the IEEE international conference on computer vision, 2015, pp. 4552–4560

  4. [5]

    Discovering motion primitives for unsupervised grouping and one-shot learning of human actions, gestures, and expressions,

    Y . Yang, I. Saleemi, and M. Shah, “Discovering motion primitives for unsupervised grouping and one-shot learning of human actions, gestures, and expressions,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 7, pp. 1635–1648, 2013

  5. [6]

    Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments,

    C. Ionescu, D. Papava, V . Olaru, and C. Sminchisescu, “Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments,”IEEE transactions on pattern analysis and machine intelligence, vol. 36, no. 7, pp. 1325–1339, 2013

  6. [7]

    Hierarchical recurrent neural network for skeleton based action recognition,

    Y . Du, W. Wang, and L. Wang, “Hierarchical recurrent neural network for skeleton based action recognition,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1110–1118

  7. [8]

    Skeleton- based human action recognition with global context-aware attention lstm networks,

    J. Liu, G. Wang, L.-Y . Duan, K. Abdiyeva, and A. C. Kot, “Skeleton- based human action recognition with global context-aware attention lstm networks,”IEEE Transactions on Image Processing, vol. 27, no. 4, pp. 1586–1599, 2017

  8. [9]

    Spatio-temporal attention-based lstm networks for 3d action recognition and detection,

    S. Song, C. Lan, J. Xing, W. Zeng, and J. Liu, “Spatio-temporal attention-based lstm networks for 3d action recognition and detection,” IEEE Transactions on image processing, vol. 27, no. 7, pp. 3459–3471, 2018

  9. [10]

    Skeleton based action recognition with convolutional neural network,

    Y . Du, Y . Fu, and L. Wang, “Skeleton based action recognition with convolutional neural network,” inProceedings of the 3rd IAPR Asian Conference on Pattern Recognition (ACPR), 2015, pp. 579–583

  10. [11]

    Spatial temporal graph convolutional networks for skeleton-based action recognition,

    S. Yan, Y . Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeleton-based action recognition,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018

  11. [12]

    Channel- wise topology refinement graph convolution for skeleton-based action recognition,

    Y . Chen, Z. Zhang, C. Yuan, B. Li, Y . Deng, and W. Hu, “Channel- wise topology refinement graph convolution for skeleton-based action recognition,” inProceedings of the IEEE/CVF international confer- ence on computer vision, 2021, pp. 13 359–13 368

  12. [13]

    Skeleton- based action recognition with shift graph convolutional network,

    K. Cheng, Y . Zhang, X. He, W. Chen, J. Cheng, and H. Lu, “Skeleton- based action recognition with shift graph convolutional network,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 183–192

  13. [14]

    Degcn: Deformable graph convolutional networks for skeleton-based action recognition,

    W. Myung, N. Su, J.-H. Xue, and G. Wang, “Degcn: Deformable graph convolutional networks for skeleton-based action recognition,”IEEE Transactions on Image Processing, vol. 33, pp. 2477–2490, 2024

  14. [15]

    Blockgcn: Redefine topology awareness for skeleton-based action recognition,

    Y . Zhou, X. Yan, Z.-Q. Cheng, Y . Yan, Q. Dai, and X.-S. Hua, “Blockgcn: Redefine topology awareness for skeleton-based action recognition,” inProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2024, pp. 2049–2058

  15. [16]

    Revealing key details to see differences: A novel prototypical perspective for skeleton-based action recognition,

    H. Liu, Y . Liu, M. Ren, H. Wang, Y . Wang, and Z. Sun, “Revealing key details to see differences: A novel prototypical perspective for skeleton-based action recognition,”arXiv preprint arXiv:2411.18941, 2024

  16. [17]

    Infogcn: Representation learning for human skeleton-based action recognition,

    H.-g. Chi, M. H. Ha, S. Chi, S. W. Lee, Q. Huang, and K. Ramani, “Infogcn: Representation learning for human skeleton-based action recognition,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 20 186–20 196

  17. [18]

    Two-stream adaptive graph convolutional networks for skeleton-based action recognition,

    L. Shi, Y . Zhang, J. Cheng, and H. Lu, “Two-stream adaptive graph convolutional networks for skeleton-based action recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019

  18. [19]

    Temporal decoupling graph convolutional network for skeleton-based gesture recognition,

    J. Liu, X. Wang, C. Wang, Y . Gao, and M. Liu, “Temporal decoupling graph convolutional network for skeleton-based gesture recognition,” IEEE Transactions on Multimedia, vol. 26, pp. 811–823, 2023

  19. [20]

    Infogcn++: Learning representation by predicting the future for online human skeleton- based action recognition,

    S. Chi, H.-g. Chi, Q. Huang, and K. Ramani, “Infogcn++: Learning representation by predicting the future for online human skeleton- based action recognition,”arXiv preprint arXiv:2310.10547, 2023

  20. [21]

    Ntu rgb+d: A large scale dataset for 3d human activity analysis,

    A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “Ntu rgb+d: A large scale dataset for 3d human activity analysis,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1010–1019

  21. [22]

    Contact-aware human mo- tion forecasting,

    W. Mao, R. I. Hartley, and M. Salzmann, “Contact-aware human mo- tion forecasting,”Advances in Neural Information Processing Systems, vol. 35, pp. 7356–7367, 2022

  22. [23]

    Exploiting three- dimensional gaze tracking for action recognition during bimanual manipulation to enhance human-robot collaboration,

    A. Haji Fathaliyan, X. Wang, and V . J. Santos, “Exploiting three- dimensional gaze tracking for action recognition during bimanual manipulation to enhance human-robot collaboration,”Frontiers in Robotics and AI, vol. 5, p. 25, 2018

  23. [24]

    Spatiotemporal multimodal learning with 3d cnns for video action recognition,

    H. Wu, X. Ma, and Y . Li, “Spatiotemporal multimodal learning with 3d cnns for video action recognition,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 3, pp. 1250–1261, 2021

  24. [25]

    Revisiting skeleton- based action recognition,

    H. Duan, Y . Zhao, K. Chen, D. Lin, and B. Dai, “Revisiting skeleton- based action recognition,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2969–2978

  25. [26]

    Pevl: Pose-enhanced vision-language model for fine-grained human action recognition,

    H. Zhang, M. C. Leong, L. Li, and W. Lin, “Pevl: Pose-enhanced vision-language model for fine-grained human action recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 18 857–18 867

  26. [27]

    Marker-less kendo motion prediction using high-speed dual-camera system and lstm method,

    Y . Cao and Y . Yamakawa, “Marker-less kendo motion prediction using high-speed dual-camera system and lstm method,” in2022 IEEE/ASME International Conference on Advanced Intelligent Mecha- tronics (AIM), 2022, pp. 159–164

  27. [28]

    The wisdom of crowds: Temporal progressive attention for early action prediction,

    A. Stergiou and D. Damen, “The wisdom of crowds: Temporal progressive attention for early action prediction,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 14 709–14 719

  28. [29]

    Rich action-semantic consistent knowledge for early action prediction,

    X. Liu, J. Yin, D. Guo, and H. Liu, “Rich action-semantic consistent knowledge for early action prediction,”IEEE Transactions on Image Processing, vol. 33, pp. 479–492, 2023

  29. [30]

    Multimodal human action recognition in assistive human-robot interaction,

    I. Rodomagoulakis, N. Kardaris, V . Pitsikalis, E. Mavroudi, A. Kat- samanis, A. Tsiami, and P. Maragos, “Multimodal human action recognition in assistive human-robot interaction,” in2016 IEEE in- ternational conference on acoustics, speech and signal processing (ICASSP). IEEE, 2016, pp. 2702–2706

  30. [31]

    Probabilistic movement primitives for coordination of multiple human–robot collaborative tasks,

    G. J. Maeda, G. Neumann, M. Ewerton, R. Lioutikov, O. Kroemer, and J. Peters, “Probabilistic movement primitives for coordination of multiple human–robot collaborative tasks,”Autonomous Robots, vol. 41, no. 3, pp. 593–612, 2017

  31. [32]

    Anticipating many fu- tures: Online human motion prediction and synthesis for human-robot collaboration,

    J. B ¨utepage, H. Kjellstr ¨om, and D. Kragic, “Anticipating many fu- tures: Online human motion prediction and synthesis for human-robot collaboration,”arXiv preprint arXiv:1702.08212, 2017

  32. [33]

    Efficient and collision-free human–robot collaboration based on in- tention and trajectory prediction,

    J. Lyu, P. Ruppel, N. Hendrich, S. Li, M. G ¨orner, and J. Zhang, “Efficient and collision-free human–robot collaboration based on in- tention and trajectory prediction,”IEEE Transactions on Cognitive and Developmental Systems, vol. 15, no. 4, pp. 1853–1863, 2022

  33. [34]

    Interact: Trans- former models for human intent prediction conditioned on robot actions,

    K. Kedia, A. Bhardwaj, P. Dan, and S. Choudhury, “Interact: Trans- former models for human intent prediction conditioned on robot actions,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 621–628

  34. [35]

    Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation

    Z. Fu, T. Z. Zhao, and C. Finn, “Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation,”arXiv preprint arXiv:2401.02117, 2024

  35. [36]

    Finegym: A hierarchical video dataset for fine-grained action understanding,

    D. Shao, Y . Zhao, B. Dai, and D. Lin, “Finegym: A hierarchical video dataset for fine-grained action understanding,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 2616–2625

  36. [37]

    BABEL: Bodies, action and behavior with english labels,

    A. R. Punnakkal, A. Chandrasekaran, N. Athanasiou, A. Quiros- Ramirez, and M. J. Black, “BABEL: Bodies, action and behavior with english labels,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2021, pp. 722–731

  37. [38]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark,et al., “Learning transferable visual models from natural language supervision,” inInternational Conference on Machine Learning. PmLR, 2021, pp. 8748–8763

  38. [39]

    AMASS: Archive of motion capture as surface shapes,

    N. Mahmood, N. Ghorbani, N. F. Troje, G. Pons-Moll, and M. J. Black, “AMASS: Archive of motion capture as surface shapes,” in International Conference on Computer Vision, Oct. 2019, pp. 5442– 5451

  39. [40]

    Temos: Generating diverse human motions from textual descriptions,

    M. Petrovich, M. J. Black, and G. Varol, “Temos: Generating diverse human motions from textual descriptions,” inEuropean Conference on Computer Vision (ECCV). Springer, 2022, pp. 480–497

  40. [41]

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

    N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,”arXiv preprint arXiv:1908.10084, 2019

  41. [42]

    Humantomato: Text-aligned whole-body motion generation,

    S. Lu, L.-H. Chen, A. Zeng, J. Lin, R. Zhang, L. Zhang, and H.-Y . Shum, “Humantomato: Text-aligned whole-body motion generation,” arXiv preprint arXiv:2310.12978, 2023

  42. [43]

    Skeleton mixformer: Multivariate topology representation for skeleton-based action recognition,

    W. Xin, Q. Miao, Y . Liu, R. Liu, C.-M. Pun, and C. Shi, “Skeleton mixformer: Multivariate topology representation for skeleton-based action recognition,” inProceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 2211–2220

  43. [44]

    Skateformer: Skeletal-temporal transformer for human action recognition,

    J. Do and M. Kim, “Skateformer: Skeletal-temporal transformer for human action recognition,”arXiv preprint arXiv:2403.09508, 2024

  44. [45]

    Generative action description prompts for skeleton-based action recognition,

    W. Xiang, C. Li, Y . Zhou, B. Wang, and L. Zhang, “Generative action description prompts for skeleton-based action recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 10 276–10 285

  45. [46]

    Fact: Frame-action cross-attention temporal modeling for efficient action segmentation,

    Z. Lu and E. Elhamifar, “Fact: Frame-action cross-attention temporal modeling for efficient action segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 18 175–18 185

  46. [47]

    Multi-modality co-learning for efficient skeleton-based action recognition,

    J. Liu, C. Chen, and M. Liu, “Multi-modality co-learning for efficient skeleton-based action recognition,” inProceedings of the 32nd ACM international conference on multimedia, 2024, pp. 4909–4918