pith. machine review for the scientific record. sign in

arxiv: 2604.10409 · v1 · submitted 2026-04-12 · 💻 cs.CV · cs.AI

Recognition: unknown

IMPACT: A Dataset for Multi-Granularity Human Procedural Action Understanding in Industrial Assembly

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:18 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords industrial assemblyprocedural action understandingRGB-D datasetbimanual annotationanomaly recoverymulti-view videohuman activity recognitioncompliance tracking
0
0 comments X

The pith

The IMPACT dataset records real angle-grinder assembly with five synchronized RGB-D views, hand-specific actions, state changes, and explicit error-recovery sequences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents IMPACT as a new benchmark dataset built on actual industrial assembly and disassembly of a commercial tool. It supplies 39.5 hours of footage from 112 trials across 13 participants, captured simultaneously from ego and external RGB-D cameras. Annotations connect fine-grained bimanual atomic actions to coarser procedural steps, component states, and compliance phases while separately marking six categories of anomalies and their recoveries. Multi-route execution follows a prerequisite graph, and cognitive load is noted via standard questionnaires. Baselines demonstrate that current methods encounter previously hidden failures when observations are incomplete, paths vary, or corrections are required.

Core claim

IMPACT is the first real industrial assembly benchmark that jointly provides synchronized ego-exo RGB-D capture, decoupled bimanual annotation, compliance-aware state tracking, and explicit anomaly-recovery supervision within a single real industrial workflow. It comprises 112 trials from 13 participants totaling 39.5 hours, with multi-route execution governed by a partial-order prerequisite graph, a six-category anomaly taxonomy, and operator cognitive load measured via NASA-TLX. The annotation hierarchy links hand-specific atomic actions to coarse procedural steps, component assembly states, and per-hand compliance phases, with synchronized null spans across views to decouple perceptual 1.

What carries the argument

The IMPACT dataset, which supplies synchronized five-view RGB-D video together with a hierarchy of annotations that links atomic hand actions to procedural steps, assembly states, compliance phases, and anomaly-recovery events.

If this is right

  • Existing action-recognition and state-tracking methods exhibit fundamental limitations under incomplete observations, variable execution orders, and corrective behavior.
  • The dataset enables joint evaluation of atomic action recognition, procedural step segmentation, component state tracking, and anomaly handling within one workflow.
  • Synchronized null spans allow separation of perceptual failures from algorithmic ones across camera views.
  • The partial-order graph and six-category anomaly taxonomy support testing of methods that must accommodate flexible routes and recovery sequences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Robotics researchers could use the bimanual and compliance annotations to train systems that anticipate and assist human operators during assembly.
  • The multi-granularity labels might support curriculum-style training where models first learn atomic actions before progressing to full procedures.
  • Extending the capture setup to additional tools or factories could test whether the observed limitations generalize beyond this specific workflow.
  • The NASA-TLX scores open the possibility of studying how cognitive load correlates with anomaly frequency or recovery time.

Load-bearing premise

The chosen angle-grinder assembly task and the pool of 13 participants sufficiently represent the diversity of real industrial procedures and worker behaviors.

What would settle it

A model achieving comparable accuracy on IMPACT using only single-view input and ignoring anomaly labels would indicate that the multi-view and recovery annotations do not expose new limitations beyond existing single-task benchmarks.

Figures

Figures reproduced from arXiv: 2604.10409 by Arash Ajoudani, Barbara Deml, Danda Pani Paudel, David Schneider, Di Wen, Jiahang Li, Jonas Hemmerich, Junwei Zheng, J\"urgen Beyerer, Kunyu Peng, Linus Kunzmann, Luc Van Gool, Manuel Zaremski, Patric Grauberger, Qiyi Tong, Rainer Stiefelhagen, Ruiping Liu, Sven Matthiesen, Yitian Shi, Yufan Chen, Zeyun Zhong.

Figure 1
Figure 1. Figure 1: Overview of the IMPACT dataset and benchmark. Top: recording setup with two angle grinder models, instruction [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Data acquisition and annotation pipeline of IMPACT. Left: dataset statistics and annotation coverage. Center: synchro [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
read the original abstract

We introduce IMPACT, a synchronized five-view RGB-D dataset for deployment-oriented industrial procedural understanding, built around real assembly and disassembly of a commercial angle grinder with professional-grade tools. To our knowledge, IMPACT is the first real industrial assembly benchmark that jointly provides synchronized ego-exo RGB-D capture, decoupled bimanual annotation, compliance-aware state tracking, and explicit anomaly--recovery supervision within a single real industrial workflow. It comprises 112 trials from 13 participants totaling 39.5 hours, with multi-route execution governed by a partial-order prerequisite graph, a six-category anomaly taxonomy, and operator cognitive load measured via NASA-TLX. The annotation hierarchy links hand-specific atomic actions to coarse procedural steps, component assembly states, and per-hand compliance phases, with synchronized null spans across views to decouple perceptual limitations from algorithmic failure. Systematic baselines reveal fundamental limitations that remain invisible to single-task benchmarks, particularly under realistic deployment conditions that involve incomplete observations, flexible execution paths, and corrective behavior. The full dataset, annotations, and evaluation code are available at https://github.com/Kratos-Wen/IMPACT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper introduces IMPACT, a synchronized five-view RGB-D dataset for industrial procedural action understanding built on real assembly/disassembly of a commercial angle grinder. It provides 112 trials (39.5 hours) from 13 participants with multi-granularity annotations (atomic bimanual actions, procedural steps, component states, compliance phases), a partial-order execution graph, a six-category anomaly taxonomy with recovery supervision, and NASA-TLX cognitive load measures. Baselines demonstrate limitations of existing methods under flexible paths, incomplete observations, and corrective behavior; the full dataset, annotations, and code are released publicly.

Significance. If the annotations prove reliable, IMPACT would be a valuable addition to the field by supplying the first real industrial workflow dataset that jointly offers ego-exo RGB-D synchronization, decoupled bimanual labels, compliance-aware state tracking, and explicit anomaly-recovery supervision. The public release with evaluation code and the demonstration of baseline shortcomings under realistic deployment conditions are clear strengths that could support reproducible research on multi-granularity procedural understanding.

major comments (3)
  1. [§4] §4 (Annotation Pipeline): Inter-annotator agreement scores are not reported for the decoupled bimanual atomic actions, compliance phases, or the six-category anomaly taxonomy. Without these metrics, the reliability of the multi-granularity hierarchy and the explicit anomaly-recovery supervision cannot be assessed, which is load-bearing for the dataset's claimed utility.
  2. [§5.2] §5.2 (Baseline Experiments): The quantitative results show performance degradation under incomplete observations and flexible paths, but the paper does not provide ablations isolating the contribution of each novel element (e.g., compliance-aware tracking vs. standard action labels). This weakens the claim that the observed limitations are specifically due to the dataset's unique features rather than general task difficulty.
  3. [Table 1] Table 1 and §3.1: The participant pool (13 individuals) and single assembly task are described, but no analysis is given of how well they represent broader industrial variability (tool types, workflow complexity, operator expertise). This directly affects the generalizability asserted for deployment-oriented use cases.
minor comments (3)
  1. [§2] The related-work section should include a more explicit tabular comparison against recent industrial or procedural datasets (e.g., on the dimensions of ego-exo sync, bimanual decoupling, and anomaly supervision) to strengthen the 'to our knowledge' novelty statement.
  2. [§3.3] NASA-TLX scores are collected but not correlated with annotation quality or baseline performance; a brief analysis would help readers understand the cognitive-load dimension.
  3. [Figure 3] Figure captions for the multi-view synchronization examples could be expanded to clarify how null spans are used to decouple perceptual from algorithmic failure.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the positive evaluation and recommendation for minor revision. The comments are constructive and help clarify the strengths and scope of IMPACT. We address each major comment point-by-point below, indicating revisions where they strengthen the manuscript without misrepresenting the work.

read point-by-point responses
  1. Referee: [§4] §4 (Annotation Pipeline): Inter-annotator agreement scores are not reported for the decoupled bimanual atomic actions, compliance phases, or the six-category anomaly taxonomy. Without these metrics, the reliability of the multi-granularity hierarchy and the explicit anomaly-recovery supervision cannot be assessed, which is load-bearing for the dataset's claimed utility.

    Authors: We agree that quantitative inter-annotator agreement (IAA) is essential to substantiate annotation reliability. The original manuscript described the multi-stage pipeline and quality controls but did not include the numerical IAA results. We have computed Cohen's kappa and raw agreement percentages on a held-out subset of trials independently annotated by two annotators for atomic bimanual actions, compliance phases, and anomaly categories. These scores (all above 0.75) and a brief discussion of annotation consistency will be inserted into §4 of the revised manuscript, directly supporting the utility of the multi-granularity hierarchy and anomaly-recovery supervision. revision: yes

  2. Referee: [§5.2] §5.2 (Baseline Experiments): The quantitative results show performance degradation under incomplete observations and flexible paths, but the paper does not provide ablations isolating the contribution of each novel element (e.g., compliance-aware tracking vs. standard action labels). This weakens the claim that the observed limitations are specifically due to the dataset's unique features rather than general task difficulty.

    Authors: We acknowledge that explicit ablations would more precisely attribute performance drops to specific dataset features. The baselines in §5.2 were chosen to expose challenges that standard single-task benchmarks overlook (incomplete views, partial-order flexibility, corrective actions). In the revision we will add a targeted ablation in §5.2 comparing a compliance-aware model against its non-compliance counterpart on the same splits; the results demonstrate a measurable gap attributable to compliance state tracking. This addition clarifies that the reported limitations arise from IMPACT's distinctive characteristics rather than generic task hardness. revision: yes

  3. Referee: [Table 1] Table 1 and §3.1: The participant pool (13 individuals) and single assembly task are described, but no analysis is given of how well they represent broader industrial variability (tool types, workflow complexity, operator expertise). This directly affects the generalizability asserted for deployment-oriented use cases.

    Authors: The dataset is deliberately scoped to a single, representative industrial workflow (angle-grinder assembly/disassembly) performed by 13 participants with documented expertise variation, as stated in §3.1 and Table 1. A full cross-tool, cross-complexity analysis would require an expanded multi-task corpus that exceeds the present contribution. We have added a concise limitations paragraph in the revised §3.1 and conclusion that explicitly qualifies the generalizability claims, positions IMPACT as a controlled benchmark for deployment-oriented methods, and outlines planned extensions to additional industrial tasks. This clarifies scope without overstating breadth. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a dataset introduction paper whose central claim is the factual novelty of jointly supplying synchronized ego-exo RGB-D, decoupled bimanual annotation, compliance-aware state tracking, and anomaly-recovery supervision inside one real industrial workflow. No equations, parameters, predictions, or derivation steps appear in the provided text. The 'to our knowledge' phrasing is a standard qualified novelty statement rather than a reduction to prior fitted quantities or self-citations. No load-bearing steps reduce by construction to the paper's own inputs, and the contribution is self-contained as a data release.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper contributes a new dataset without introducing free parameters, new mathematical axioms, or invented entities; it relies on standard computer vision data collection and annotation practices.

axioms (1)
  • domain assumption Standard practices for RGB-D data synchronization and multi-view annotation in action recognition datasets are valid and sufficient.
    Invoked implicitly in the description of synchronized capture and hierarchical annotation.

pith-pipeline@v0.9.0 · 5579 in / 1200 out tokens · 60718 ms · 2026-05-10T16:18:14.201102+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. IMPACT-Scribe: Interactive Temporal Action Segmentation with Boundary Scribbles and Query Planning

    cs.CV 2026-05 unverdicted novelty 5.0

    IMPACT-Scribe is a correction-driven interactive system that combines uncertainty-aware boundary scribbles, local proposal modeling, cost-aware query planning, structured propagation, and correction-driven adaptation ...

  2. IMPACT-HOI: Supervisory Control for Onset-Anchored Partial HOI Event Construction

    cs.CV 2026-05 unverdicted novelty 5.0

    IMPACT-HOI introduces a supervisory control framework for constructing partial HOI event graphs in procedural videos via trust-calibrated automation and atomic rollback to reduce manual annotation effort while preserv...

Reference graph

Works this paper leans on

48 extracted references · 9 canonical work pages · cited by 2 Pith papers · 3 internal anchors

  1. [1]

    Dustin Aganian, Benedict Stephan, Markus Eisenbach, Corinna Stretz, and Horst- Michael Gross. 2023. Attach dataset: Annotated two-handed assembly actions for human action understanding.arXiv preprint arXiv:2304.08210(2023)

  2. [2]

    Emad Bahrami, Gianpiero Francesca, and Juergen Gall. 2023. How much temporal long-term context is needed for action segmentation?. InProceedings of the IEEE/CVF International Conference on Computer Vision. 10351–10361

  3. [3]

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al . 2025. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631(2025)

  4. [4]

    Yizhak Ben-Shabat, Jonathan Paul, Eviatar Segev, Oren Shrout, and Stephen Gould. 2024. Ikea ego 3d dataset: Understanding furniture assembly actions from ego-view 3d point clouds. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 4355–4364

  5. [5]

    Yizhak Ben-Shabat, Xin Yu, Fatemeh Saleh, Dylan Campbell, Cristian Rodriguez- Opazo, Hongdong Li, and Stephen Gould. 2021. The ikea asm dataset: Un- derstanding people assembling furniture through actions, objects and pose. In W ACV

  6. [6]

    Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. Inproceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308

  7. [7]

    Vivek Chavan, Yasmina Imgrund, Tung Dao, Sanwantri Bai, Bosong Wang, Ze Lu, Oliver Heimann, and Jörg Krüger. 2025. IndEgo: A Dataset of Industrial Scenarios and Collaborative Work for Egocentric Assistants.arXiv preprint arXiv:2511.19684(2025)

  8. [8]

    Grazia Cicirelli, Roberto Marani, Laura Romeo, Manuel García Domínguez, Jó- nathan Heras, Anna G Perri, and Tiziana D’Orazio. 2022. The HA4M dataset: Multi-Modal Monitoring of an assembly task for Human Action recognition in Manufacturing.Scientific Data9, 1 (2022), 745

  9. [9]

    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, , Antonino Furnari, Jian Ma, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. 2022. Rescaling Egocentric Vision.International Journal of Computer Vision130, 1 (2022), 33–55

  10. [10]

    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. 2018. Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of the European conference on computer vision (ECCV). 720–736

  11. [11]

    Yazan Abu Farha and Jurgen Gall. 2019. Ms-tcn: Multi-stage temporal convolu- tional network for action segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3575–3584

  12. [12]

    Alessandro Flaborea, Guido Maria D’Amely Di Melendugno, Leonardo Plini, Luca Scofano, Edoardo De Matteis, Antonino Furnari, Giovanni Maria Farinella, and Fabio Galasso. 2024. Prego: online mistake detection in procedural egocentric videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18483–18492

  13. [13]

    Ziliang Gan, Lei Jin, Lei Nie, Zheng Wang, Li Zhou, Liang Li, Zhecan Wang, Jianshu Li, Junliang Xing, and Jian Zhao. 2024. ASQuery: A query-based model for action segmentation. In2024 IEEE International Conference on Multimedia and Expo (ICME). IEEE, i–vi

  14. [14]

    Rohit Girdhar and Kristen Grauman. 2021. Anticipative video transformer. In Proceedings of the IEEE/CVF international conference on computer vision. 13505– 13515

  15. [15]

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. 2022. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 18995–19012

  16. [16]

    Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. 2024. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InCVPR

  17. [17]

    Wei-Jin Huang, Yuan-Ming Li, Zhi-Wei Xia, Yu-Ming Tang, Kun-Yu Lin, Jian-Fang Hu, and Wei-Shi Zheng. 2025. Modeling multiple normal action representations for error detection in procedural tasks. InProceedings of the Computer Vision and Pattern Recognition Conference. 27794–27804

  18. [18]

    Yifei Huang, Guo Chen, Jilan Xu, Mingfang Zhang, Lijin Yang, Baoqi Pei, Hongjie Zhang, Lu Dong, Yali Wang, Limin Wang, et al. 2024. Egoexolearn: A dataset for bridging asynchronous ego-and exo-centric view of procedural activities in real world. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22072–22086

  19. [19]

    Sanghwan Kim, Daoji Huang, Yongqin Xian, Otmar Hilliges, Luc Van Gool, and Xi Wang. 2024. PALM: Predicting Actions through Language Models. InECCV. 140–158. doi:10.1007/978-3-031-73007-8_9

  20. [20]

    Colin Lea, Michael D Flynn, Rene Vidal, Austin Reiter, and Gregory D Hager

  21. [21]

    Inproceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    Temporal convolutional networks for action segmentation and detection. Inproceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 156–165

  22. [22]

    Shih-Po Lee and Ehsan Elhamifar. 2025. Error recognition in procedural videos using generalized task graph. InProceedings of the IEEE/CVF International Con- ference on Computer Vision. 10009–10021

  23. [23]

    Shih-Po Lee, Zijia Lu, Zekun Zhang, Minh Hoai, and Ehsan Elhamifar. 2024. Error detection in egocentric procedural task videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18655–18666

  24. [24]

    Shi-Jie Li, Yazan AbuFarha, Yun Liu, Ming-Ming Cheng, and Juergen Gall. 2020. MS-TCN++: Multi-Stage Temporal Convolutional Network for Action Segmenta- tion.IEEE Transactions on Pattern Analysis and Machine Intelligence(2020), 1–1. doi:10.1109/TPAMI.2020.3021756

  25. [25]

    Yanghao Li, Tushar Nagarajan, Bo Xiong, and Kristen Grauman. 2021. Ego-exo: Transferring visual representations from third-person to first-person videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion. 6943–6953

  26. [26]

    Yanghao Li, Chao-Yuan Wu, Haoqi Fan, Karttikeya Mangalam, Bo Xiong, Jitendra Malik, and Christoph Feichtenhofer. 2022. Mvitv2: Improved multiscale vision transformers for classification and detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4804–4814

  27. [27]

    Daochang Liu, Qiyue Li, Anh-Dung Dinh, Tingting Jiang, Mubarak Shah, and Chang Xu. 2023. Diffusion action segmentation. InProceedings of the IEEE/CVF international conference on computer vision. 10139–10149

  28. [28]

    Zijia Lu and Ehsan Elhamifar. 2024. Fact: Frame-action cross-attention tempo- ral modeling for efficient action segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18175–18185

  29. [29]

    Jungin Park, Jiyoung Lee, and Kwanghoon Sohn. 2025. Bootstrap your own views: Masked ego-exo modeling for fine-grained view-invariant video repre- sentations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13661–13670

  30. [30]

    Rohith Peddi, Shivvrat Arya, Bharath Challa, Likhitha Pallapothula, Akshay Vyas, Bhavya Gouripeddi, Qifan Zhang, Jikai Wang, Vasundhara Komaragiri, Eric Ragan, et al. 2024. Captaincook4d: A dataset for understanding errors in procedural activities.Advances in Neural Information Processing Systems37 (2024), 135626–135679

  31. [31]

    Camillo Quattrocchi, Antonino Furnari, Daniele Di Mauro, Mario Valerio Giuf- frida, and Giovanni Maria Farinella. 2024. Synchronization is all you need: Exocentric-to-egocentric transfer for temporal action segmentation with un- labeled synchronized video pairs. InEuropean Conference on Computer Vision. Springer, 253–270

  32. [32]

    Francesco Ragusa, Antonino Furnari, and Giovanni Maria Farinella. 2023. Mec- cano: A multimodal egocentric dataset for humans behavior understanding in the industrial-like domain.Computer vision and image understanding235 (2023), 103764

  33. [33]

    Francesco Ragusa, Rosario Leonardi, Michele Mazzamuto, Claudia Bonanno, Rosario Scavo, Antonino Furnari, and Giovanni Maria Farinella. 2024. Enigma-51: Towards a fine-grained understanding of human behavior in industrial scenarios. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 4549–4559

  34. [34]

    Tim J Schoonbeek, Tim Houben, Hans Onvlee, Fons Van der Sommen, et al. 2024. Industreal: A dataset for procedure step recognition handling execution errors in egocentric videos in an industrial-like setting. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 4365–4374

  35. [35]

    Tim J Schoonbeek, Shao-Hsuan Hung, Dan Lehman, Hans Onvlee, Jacek Kustra, Peter HN de With, and Fons Van der Sommen. 2025. Learning to recognize correctly completed procedure steps in egocentric assembly videos through spatio-temporal modeling.Computer Vision and Image Understanding(2025), 104528

  36. [36]

    Fadime Sener, Dibyadip Chatterjee, Daniel Shelepov, Kun He, Dipika Singhania, Robert Wang, and Angela Yao. 2022. Assembly101: A large-scale multi-view video dataset for understanding procedural activities. InCVPR

  37. [37]

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. 2023. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805(2023)

  38. [38]

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, et al . 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models.arXiv preprint arXiv:2307.09288 (2023)

  39. [39]

    Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, and Yu Qiao. 2023. VideoMAE V2: Scaling Video Masked Autoencoders With Dual Masking. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 14549–14560

  40. [40]

    Xin Wang, Taein Kwon, Mahdi Rad, Bowen Pan, Ishani Chakraborty, Sean An- drist, Dan Bohus, Ashley Feniello, Bugra Tekin, Felipe Vieira Frujeri, et al. 2023. Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world. InProceedings of the IEEE/CVF International Conference on Computer Vision. 20270–20281

  41. [41]

    Di Wen, Kunyu Peng, Junwei Zheng, Yufan Chen, Yitian Shi, Jiale Wei, Ruiping Liu, Kailun Yang, and Rainer Stiefelhagen. 2025. Mica: Multi-agent industrial coordination assistant.arXiv preprint arXiv:2509.15237(2025). IMPACT: A Dataset for Multi-Granularity Human Procedural Action Understanding in Industrial Assembly

  42. [42]

    Di Wen, Junwei Zheng, Ruiping Liu, Yi Xu, Kunyu Peng, and Rainer Stiefelhagen

  43. [43]

    Snap, segment, deploy: A visual data and detection pipeline for wearable industrial assistants.arXiv preprint arXiv:2507.21072(2025)

  44. [44]

    Zihui Xue, Kumar Ashutosh, and Kristen Grauman. 2024. Learning object state changes in videos: An open-world perspective. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18493–18503

  45. [45]

    Zihui Sherry Xue and Kristen Grauman. 2023. Learning fine-grained view- invariant representations from unpaired ego-exo videos via temporal alignment. Advances in Neural Information Processing Systems36 (2023), 53688–53710

  46. [46]

    Qi Zhao, Shijie Wang, Ce Zhang, Changcheng Fu, Minh Quan Do, Nakul Agarwal, Kwonjoon Lee, and Chen Sun. 2024. AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?. InICLR

  47. [47]

    Hao Zheng, Regina Lee, and Yuqian Lu. 2023. Ha-vid: A human assembly video dataset for comprehensive assembly knowledge understanding.NeurIPS(2023)

  48. [48]

    Zeyun Zhong, Manuel Martin, David Schneider, David J Lerch, Chengzhi Wu, Frederik Diederichs, Juergen Gall, and Jürgen Beyerer. 2026. Scalable Video Action Anticipation with Cross Linear Attentive Memory. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 8113–8123