Robust Assembly State Reasoning from Action Recognition for Human-Robot Collaboration

James Fant-Male; Roel Pieters

arxiv: 2606.20150 · v1 · pith:E6FP4KL6new · submitted 2026-06-18 · 💻 cs.RO

Robust Assembly State Reasoning from Action Recognition for Human-Robot Collaboration

James Fant-Male , Roel Pieters This is my paper

Pith reviewed 2026-06-26 17:12 UTC · model grok-4.3

classification 💻 cs.RO

keywords human-robot collaborationaction recognitionassembly state trackinghidden Markov modelsneural networkslogic-based reasoningrobustness evaluation

0 comments

The pith

Logic-based methods track assembly states more robustly than NN or HMM approaches when human actions vary or repeat without extra sensors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates five approaches to inferring the current step in a collaborative assembly task from human action recognition outputs. It runs the methods on two different datasets, feeding them both clean simulated sequences with added noise at multiple levels and real outputs from an action recognition model. Results indicate that neural networks and hidden Markov models succeed mainly when the set of possible actions stays small and consistent, whereas logic-based reasoning holds up better when actions repeat or the sequence can branch in unexpected ways. Approaches that explicitly model how long each action should last also reduce errors in repeated-action settings. Accurate state tracking matters because it lets a robot decide its next move without needing constant confirmation from the human partner.

Core claim

Optimal assembly state tracking from action recognition is not uniform: neural network and hidden Markov model methods perform adequately in tasks with limited variability, while logic-based methods remain robust across scenarios with greater variability or repeated actions; methods that incorporate expected action duration further improve reliability when no additional sensing is available to disambiguate repeats.

What carries the argument

Systematic comparison of logic-based, hidden Markov model, and neural network state trackers that consume human action recognition inputs, tested under controlled noise and realistic model outputs on two assembly datasets.

If this is right

Tasks with repeated actions require duration modeling to avoid sequence errors when sensing is limited.
Logic-based trackers are preferable for human-robot collaboration processes that allow many valid action orders.
Neural network and hidden Markov model trackers suffice only when the assembly sequence has low branching.
Performance gaps between simulated and realistic inputs highlight the need to test trackers with actual recognition model errors.
Method selection for state tracking should be matched to measured task variability rather than applied uniformly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hybrid trackers that switch between logic and probabilistic methods based on observed variability could combine the strengths of each.
The same evaluation approach could be applied to multi-human or multi-robot assembly lines to check whether the robustness pattern holds.
Integration with robot motion planners would let the findings directly affect collision avoidance and task scheduling in physical setups.
Extending the noise models to include sensor dropouts or occlusions would test whether logic-based advantages persist under more realistic perception failures.

Load-bearing premise

The two chosen datasets plus simulated noise levels and realistic action recognition outputs together cover the variability found in actual human-robot assembly work.

What would settle it

A third assembly dataset containing higher action variability or longer repeated sequences in which the logic-based tracker records lower accuracy than the neural network or hidden Markov model trackers when fed realistic action recognition outputs.

Figures

Figures reproduced from arXiv: 2606.20150 by James Fant-Male, Roel Pieters.

**Figure 1.** Figure 1: State prediction F1 score with different noise level applied to action input for (a) HA4M and (b) IKEA datasets. [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

**Figure 3.** Figure 3: Confusion matrices for HA4M task state prediction with 0 noise added to input. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Confusion matrices for HA4M task state prediction with 0.3 noise added to input. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Confusion matrices for IKEA Lack Side Table task state prediction with 0 noise added to input. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Confusion matrices for IKEA Lack Side Table task state prediction with 0.3 noise added to input. [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Confusion matrices for HA4M task state prediction with HAR input. [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Confusion matrices for IKEA Lack Side Table task state prediction with HAR input. [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 9.** Figure 9: Example state predictions from baseline method and action inputs from the ST-GCN HAR model. [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

read the original abstract

Human Action Recognition (HAR) is frequently investigated in Human-Robot Collaboration (HRC) research to understand what actions have been performed and hence the state of a collaborative task. Accurately tracking an assembly state from HAR is however not fully investigated, and in realistic scenarios is not a trivial task. This research systematically investigates and compares methods for tracking assembly state using action recognition inputs. Investigations using two diverse datasets and five state tracking approaches, including logic-based, Hidden Markov Model (HMM), and neural network (NN) methods, show that optimal approaches are not uniform across different tasks and that different methods fail under different circumstances. Testing is performed using both simulated inputs with varying noise levels and realistic inputs from a HAR model. Results show NN and HMM methods can perform well in tasks with limited variability, but for other scenarios logic-based approaches can be more robust. Methods which model expected action duration are also important for tasks with repeated actions where no additional sensing is provided.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows that state tracking from HAR works best when the method matches task traits like low variability or repeated actions, based on tests across two datasets.

read the letter

The main takeaway is that no single approach wins for turning action recognition into reliable assembly state estimates. NN and HMM methods handle tasks with limited variation, logic-based ones hold up better in other cases, and modeling expected action durations matters when the same action repeats without extra sensors.

The paper runs a head-to-head comparison of five methods on two datasets. It tests both simulated inputs at varying noise levels and outputs from a real HAR model. This produces clear evidence that different methods fail under different conditions, which extends earlier HAR work into the state-tracking question.

The empirical setup is straightforward and the results back the task-dependent claim without obvious circularity. That part is done cleanly and gives people in HRC a practical basis for choosing trackers.

The soft spot is scope. The findings rest on exactly two datasets and simulated noise rather than measured HAR error distributions from the field. If those tasks and noise patterns do not match broader assembly variability, including partial observability or inter-task links, the advice that optimal methods are not uniform stays tied to this testbed.

This is for researchers building or evaluating human-robot collaboration systems who need guidance on state tracking choices. Readers in that subfield will get usable pointers from the comparison.

It shows clear thinking on the experimental design and engages the literature on its own terms. I would send it to peer review so referees can check whether the robustness conclusions hold beyond the two datasets.

Referee Report

1 major / 1 minor

Summary. The paper claims that a systematic comparison of five assembly state tracking methods (logic-based, HMM, and NN variants) on two diverse datasets, using both simulated inputs at varying noise levels and realistic outputs from a HAR model, demonstrates that optimal methods are not uniform across tasks: NN and HMM perform well in low-variability settings while logic-based methods are more robust elsewhere, and that modeling expected action durations is important for repeated actions without additional sensing.

Significance. If the comparative results hold, the work provides actionable guidance for method selection in HRC assembly state reasoning by identifying task-dependent failure modes and the value of duration modeling. Strengths include the use of both simulated and realistic HAR inputs plus multiple method classes, which allows direct head-to-head evaluation rather than isolated testing.

major comments (1)

[Abstract] Abstract: the headline claim that 'for other scenarios logic-based approaches can be more robust' and that duration modeling 'is also important' generalizes from experiments on exactly two datasets with simulated noise; the manuscript does not demonstrate that the chosen tasks or noise model span the relevant real-world failure modes (sensor noise distributions, action duration variance, partial observability) needed to support the non-uniform optimality conclusion beyond the tested cases.

minor comments (1)

[Abstract] Abstract: results are summarized at a high level without any quantitative metrics, error bars, or concrete performance numbers, which reduces the ability to gauge effect sizes from the abstract alone.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and constructive criticism of the abstract. We address the major comment below by agreeing to revise the abstract's wording to avoid overgeneralization while preserving the core empirical findings from the two datasets.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim that 'for other scenarios logic-based approaches can be more robust' and that duration modeling 'is also important' generalizes from experiments on exactly two datasets with simulated noise; the manuscript does not demonstrate that the chosen tasks or noise model span the relevant real-world failure modes (sensor noise distributions, action duration variance, partial observability) needed to support the non-uniform optimality conclusion beyond the tested cases.

Authors: We agree that the abstract's phrasing implies broader applicability than the experiments directly support. The two datasets were selected for diversity in assembly tasks and variability, with testing under both controlled simulated noise and realistic HAR outputs, but we acknowledge these do not exhaustively cover all sensor distributions, duration variances, or partial observability cases. We will revise the abstract to state that results demonstrate task-dependent robustness and the value of duration modeling within the evaluated scenarios and noise models, without claiming non-uniform optimality beyond the tested cases. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method comparison on external datasets

full rationale

The paper reports experimental results from applying five state-tracking methods (logic-based, HMM, NN) to two diverse datasets using both simulated noisy inputs and realistic HAR outputs. Conclusions about relative robustness, suitability for low-variability tasks, and importance of duration modeling follow directly from those measured performance differences. No equations, fitted parameters renamed as predictions, self-definitional relations, or load-bearing self-citations appear in the derivation chain. The work is self-contained against the reported benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; all details on methods and data are absent.

pith-pipeline@v0.9.1-grok · 5693 in / 1012 out tokens · 24472 ms · 2026-06-26T17:12:46.622492+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references

[1]

Industry 4.0 and Industry 5.0—Inception, conception and perception,

X. Xu, Y . Lu, B. V ogel-Heuser, and L. Wang, “Industry 4.0 and Industry 5.0—Inception, conception and perception,”Journal of man- ufacturing systems, vol. 61, pp. 530–535, 2021

2021
[2]

A review of personalisation in human- robot collaboration and future perspectives towards industry 5.0,

J. Fant-Male and R. Pieters, “A review of personalisation in human- robot collaboration and future perspectives towards industry 5.0,” in 34th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN). IEEE, 2025, pp. 223–230

2025
[3]

Probabilistic human action prediction and wait-sensitive planning for responsive human-robot collaboration,

K. P. Hawkins, N. V o, S. Bansal, and A. F. Bobick, “Probabilistic human action prediction and wait-sensitive planning for responsive human-robot collaboration,” in13th IEEE-RAS International Confer- ence on Humanoid Robots (Humanoids). IEEE, 2013, pp. 499–506

2013
[4]

Prediction of human activity patterns for human–robot collaborative assembly tasks,

A. M. Zanchettin, A. Casalino, L. Piroddi, and P. Rocco, “Prediction of human activity patterns for human–robot collaborative assembly tasks,”IEEE Transactions on Industrial Informatics, vol. 15, no. 7, pp. 3934–3942, 2018

2018
[5]

Prediction of Assembly Intent for Human-Robot Collaboration Based on Video Analytics and Hidden Markov Model,

J. Qu, Y . Li, C. Liu, W. Wang, and W. Fu, “Prediction of Assembly Intent for Human-Robot Collaboration Based on Video Analytics and Hidden Markov Model,”Computers, Materials, & Continua, vol. 84, no. 2, p. 3787, 2025

2025
[6]

Prediction of high- level actions from the sequence of atomic actions in assembly line workstations,

S. K. Dwivedi, H. Nagayoshi, and H. Ohashi, “Prediction of high- level actions from the sequence of atomic actions in assembly line workstations,” inIEEE 29th International Conference on Emerging Technologies and Factory Automation (ETFA). IEEE, 2024, pp. 1–8

2024
[7]

Hybrid machine learning for human action recognition and prediction in assembly,

J. Zhang, P. Wang, and R. X. Gao, “Hybrid machine learning for human action recognition and prediction in assembly,”Robotics and Computer-Integrated Manufacturing, vol. 72, p. 102184, 2021

2021
[8]

Intelligent disassembly scenario understanding for human behavior and intention recognition towards self-perception human-robot col- laboration system,

J. Xiao, B. Wang, K. Huang, S. Terzi, W. Wang, and M. Macchi, “Intelligent disassembly scenario understanding for human behavior and intention recognition towards self-perception human-robot col- laboration system,”Journal of Manufacturing Systems, vol. 83, pp. 937–962, 2025

2025
[9]

Deep learning-based human action recognition to leverage con- text awareness in collaborative assembly,

D. Moutinho, L. F. Rocha, C. M. Costa, L. F. Teixeira, and G. Veiga, “Deep learning-based human action recognition to leverage con- text awareness in collaborative assembly,”Robotics and Computer- Integrated Manufacturing, vol. 80, p. 102449, 2023

2023
[10]

A fusion-based spiking neural network approach for predicting collaboration request in human-robot collaboration,

R. Zhang, J. Li, P. Zheng, Y . Lu, J. Bao, and X. Sun, “A fusion-based spiking neural network approach for predicting collaboration request in human-robot collaboration,”Robotics and Computer-Integrated Manufacturing, vol. 78, p. 102383, 2022

2022
[11]

Deep learning based robot cognitive architecture for collaborative assembly tasks,

J. Male and U. Martinez-Hernandez, “Deep learning based robot cognitive architecture for collaborative assembly tasks,”Robotics and Computer-Integrated Manufacturing, vol. 83, p. 102572, 2023

2023
[12]

Praxis: A framework for AI-driven human action recogni- tion in assembly,

C. Gkournelos, C. Konstantinou, P. Angelakis, E. Tzavara, and S. Makris, “Praxis: A framework for AI-driven human action recogni- tion in assembly,”Journal of Intelligent Manufacturing, vol. 35, no. 8, pp. 3697–3711, 2024

2024
[13]

Intelligent assembly operations monitoring with the ability to detect non-value-added ac- tivities as out-of-distribution (OOD) instances,

V . Selvaraj, M. Al-Amin, W. Tao, and S. Min, “Intelligent assembly operations monitoring with the ability to detect non-value-added ac- tivities as out-of-distribution (OOD) instances,”CIRP Annals, vol. 72, no. 1, pp. 413–416, 2023

2023
[14]

Real-time action localization of manual assembly operations using deep learning and augmented inference state machines,

V . Selvaraj, M. Al-Amin, X. Yu, W. Tao, and S. Min, “Real-time action localization of manual assembly operations using deep learning and augmented inference state machines,”Journal of Manufacturing Systems, vol. 72, pp. 504–518, 2024

2024
[15]

The HA4M dataset: Multi-Modal Monitoring of an assembly task for Human Action recognition in Manufacturing,

G. Cicirelli, R. Marani, L. Romeo, M. G. Dom ´ınguez, J. Heras, A. G. Perri, and T. D’Orazio, “The HA4M dataset: Multi-Modal Monitoring of an assembly task for Human Action recognition in Manufacturing,” Scientific Data, vol. 9, no. 1, p. 745, 2022

2022
[16]

The IKEA ASM Dataset: Understanding People Assembling Furniture Through Actions, Objects and Pose,

Y . Ben-Shabat, X. Yu, F. Saleh, D. Campbell, C. Rodriguez-Opazo, H. Li, and S. Gould, “The IKEA ASM Dataset: Understanding People Assembling Furniture Through Actions, Objects and Pose,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 847–859

2021
[17]

Skeleton-based action recognition for manufacturing assembly task through graph convo- lution network,

M. Soleymani, M. Bonyani, and C. Wang, “Skeleton-based action recognition for manufacturing assembly task through graph convo- lution network,”Journal of Manufacturing Systems, vol. 82, pp. 362– 375, 2025

2025
[18]

Spatial temporal graph convolutional networks for skeleton-based action recognition,

S. Yan, Y . Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeleton-based action recognition,” inProceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018

2018

[1] [1]

Industry 4.0 and Industry 5.0—Inception, conception and perception,

X. Xu, Y . Lu, B. V ogel-Heuser, and L. Wang, “Industry 4.0 and Industry 5.0—Inception, conception and perception,”Journal of man- ufacturing systems, vol. 61, pp. 530–535, 2021

2021

[2] [2]

A review of personalisation in human- robot collaboration and future perspectives towards industry 5.0,

J. Fant-Male and R. Pieters, “A review of personalisation in human- robot collaboration and future perspectives towards industry 5.0,” in 34th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN). IEEE, 2025, pp. 223–230

2025

[3] [3]

Probabilistic human action prediction and wait-sensitive planning for responsive human-robot collaboration,

K. P. Hawkins, N. V o, S. Bansal, and A. F. Bobick, “Probabilistic human action prediction and wait-sensitive planning for responsive human-robot collaboration,” in13th IEEE-RAS International Confer- ence on Humanoid Robots (Humanoids). IEEE, 2013, pp. 499–506

2013

[4] [4]

Prediction of human activity patterns for human–robot collaborative assembly tasks,

A. M. Zanchettin, A. Casalino, L. Piroddi, and P. Rocco, “Prediction of human activity patterns for human–robot collaborative assembly tasks,”IEEE Transactions on Industrial Informatics, vol. 15, no. 7, pp. 3934–3942, 2018

2018

[5] [5]

Prediction of Assembly Intent for Human-Robot Collaboration Based on Video Analytics and Hidden Markov Model,

J. Qu, Y . Li, C. Liu, W. Wang, and W. Fu, “Prediction of Assembly Intent for Human-Robot Collaboration Based on Video Analytics and Hidden Markov Model,”Computers, Materials, & Continua, vol. 84, no. 2, p. 3787, 2025

2025

[6] [6]

Prediction of high- level actions from the sequence of atomic actions in assembly line workstations,

S. K. Dwivedi, H. Nagayoshi, and H. Ohashi, “Prediction of high- level actions from the sequence of atomic actions in assembly line workstations,” inIEEE 29th International Conference on Emerging Technologies and Factory Automation (ETFA). IEEE, 2024, pp. 1–8

2024

[7] [7]

Hybrid machine learning for human action recognition and prediction in assembly,

J. Zhang, P. Wang, and R. X. Gao, “Hybrid machine learning for human action recognition and prediction in assembly,”Robotics and Computer-Integrated Manufacturing, vol. 72, p. 102184, 2021

2021

[8] [8]

Intelligent disassembly scenario understanding for human behavior and intention recognition towards self-perception human-robot col- laboration system,

J. Xiao, B. Wang, K. Huang, S. Terzi, W. Wang, and M. Macchi, “Intelligent disassembly scenario understanding for human behavior and intention recognition towards self-perception human-robot col- laboration system,”Journal of Manufacturing Systems, vol. 83, pp. 937–962, 2025

2025

[9] [9]

Deep learning-based human action recognition to leverage con- text awareness in collaborative assembly,

D. Moutinho, L. F. Rocha, C. M. Costa, L. F. Teixeira, and G. Veiga, “Deep learning-based human action recognition to leverage con- text awareness in collaborative assembly,”Robotics and Computer- Integrated Manufacturing, vol. 80, p. 102449, 2023

2023

[10] [10]

A fusion-based spiking neural network approach for predicting collaboration request in human-robot collaboration,

R. Zhang, J. Li, P. Zheng, Y . Lu, J. Bao, and X. Sun, “A fusion-based spiking neural network approach for predicting collaboration request in human-robot collaboration,”Robotics and Computer-Integrated Manufacturing, vol. 78, p. 102383, 2022

2022

[11] [11]

Deep learning based robot cognitive architecture for collaborative assembly tasks,

J. Male and U. Martinez-Hernandez, “Deep learning based robot cognitive architecture for collaborative assembly tasks,”Robotics and Computer-Integrated Manufacturing, vol. 83, p. 102572, 2023

2023

[12] [12]

Praxis: A framework for AI-driven human action recogni- tion in assembly,

C. Gkournelos, C. Konstantinou, P. Angelakis, E. Tzavara, and S. Makris, “Praxis: A framework for AI-driven human action recogni- tion in assembly,”Journal of Intelligent Manufacturing, vol. 35, no. 8, pp. 3697–3711, 2024

2024

[13] [13]

Intelligent assembly operations monitoring with the ability to detect non-value-added ac- tivities as out-of-distribution (OOD) instances,

V . Selvaraj, M. Al-Amin, W. Tao, and S. Min, “Intelligent assembly operations monitoring with the ability to detect non-value-added ac- tivities as out-of-distribution (OOD) instances,”CIRP Annals, vol. 72, no. 1, pp. 413–416, 2023

2023

[14] [14]

Real-time action localization of manual assembly operations using deep learning and augmented inference state machines,

V . Selvaraj, M. Al-Amin, X. Yu, W. Tao, and S. Min, “Real-time action localization of manual assembly operations using deep learning and augmented inference state machines,”Journal of Manufacturing Systems, vol. 72, pp. 504–518, 2024

2024

[15] [15]

The HA4M dataset: Multi-Modal Monitoring of an assembly task for Human Action recognition in Manufacturing,

G. Cicirelli, R. Marani, L. Romeo, M. G. Dom ´ınguez, J. Heras, A. G. Perri, and T. D’Orazio, “The HA4M dataset: Multi-Modal Monitoring of an assembly task for Human Action recognition in Manufacturing,” Scientific Data, vol. 9, no. 1, p. 745, 2022

2022

[16] [16]

The IKEA ASM Dataset: Understanding People Assembling Furniture Through Actions, Objects and Pose,

Y . Ben-Shabat, X. Yu, F. Saleh, D. Campbell, C. Rodriguez-Opazo, H. Li, and S. Gould, “The IKEA ASM Dataset: Understanding People Assembling Furniture Through Actions, Objects and Pose,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 847–859

2021

[17] [17]

Skeleton-based action recognition for manufacturing assembly task through graph convo- lution network,

M. Soleymani, M. Bonyani, and C. Wang, “Skeleton-based action recognition for manufacturing assembly task through graph convo- lution network,”Journal of Manufacturing Systems, vol. 82, pp. 362– 375, 2025

2025

[18] [18]

Spatial temporal graph convolutional networks for skeleton-based action recognition,

S. Yan, Y . Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeleton-based action recognition,” inProceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018

2018