arxiv: 2605.01666 · v1 · submitted 2026-05-03 · 💻 cs.CV · cs.AI· cs.RO

Recognition: unknown

IMPACT-HOI: Supervisory Control for Onset-Anchored Partial HOI Event Construction

Haoshen Zhang , Di Wen , Kunyu Peng , David Schneider , Zeyun Zhong , Alexander Jaus , Zdravko Marinov , Jiale Wei

show 7 more authors

Ruiping Liu Junwei Zheng Yufan Chen Yufeng Zhang Yuanhao Luo Lei Qi Rainer Stiefelhagen

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:22 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.RO

keywords mixed-initiative annotationhuman-object interactionevent graph constructionegocentric videosupervisory controltrust calibrationatomic rollbackrobot learning from demonstration

0 comments

The pith

A supervisory control framework for annotating human-object interactions reduces manual actions by 13.5 percent while preserving zero confirmed-field violations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents IMPACT-HOI as a mixed-initiative system that treats annotation of egocentric procedural videos as the step-by-step completion of onset-anchored partial event graphs for human-object interactions. It introduces a trust-calibrated controller that picks between direct human queries, confirmed suggestions, and conservative automated completions, together with a risk-bounded protocol that uses atomic rollback to protect earlier human decisions. The work is motivated by the need for clean structured supervision to train robots on manipulation tasks from human demonstrations. A controlled study with nine annotators measured a 13.5 percent drop in manual steps and a 46.67 percent event match rate with no violations of confirmed fields. The result matters because lower-effort, high-integrity annotation pipelines could make large-scale robot learning datasets more practical to build.

Core claim

IMPACT-HOI frames the task as incremental resolution of a partially specified, onset-anchored event state. A trust-calibrated controller selects among direct queries, human-confirmed suggestions, and conservative completions on the basis of empirical annotator behavior and evidence quality. A risk-bounded execution protocol that employs atomic rollback ensures human-confirmed decisions remain safe from later automated conflicts. Under this protocol a user study with nine participants recorded a 13.5 percent reduction in manual annotation actions, a 46.67 percent event match rate, and zero confirmed-field violations.

What carries the argument

The trust-calibrated controller combined with the risk-bounded execution protocol and atomic rollback, which together supervise the incremental construction of onset-anchored partial HOI event graphs.

If this is right

Structured HOI event graphs can be produced with measurably less manual work while keeping all human-confirmed fields intact.
Automated suggestions integrate into the annotation process without overwriting prior human decisions.
Onset-anchored incremental construction becomes feasible for procedural videos intended as robot demonstration data.
The protocol bounds risk so that the final event graphs remain suitable for downstream robot manipulation learning.
Annotation throughput improves under the tested trust-calibration and rollback rules.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the controller generalizes, the cost of creating large-scale imitation-learning datasets for robotics could fall substantially.
Atomic rollback offers a reusable pattern for any mixed-initiative system that must protect confirmed user inputs from later model updates.
The same onset-anchored framing might apply to annotation of other temporal interaction structures beyond HOI.
Pairing the controller with stronger vision models could raise the event match rate without increasing violation risk.

Load-bearing premise

The trust-calibrated controller and risk-bounded execution protocol with atomic rollback will generalize beyond the studied annotators and video domains without introducing undetected errors in the constructed event graphs.

What would settle it

A replication study on new annotators or different egocentric video domains that produces one or more confirmed-field violations or fails to reduce manual annotation actions would falsify the reported safety and efficiency claims.

Figures

Figures reproduced from arXiv: 2605.01666 by Alexander Jaus, David Schneider, Di Wen, Haoshen Zhang, Jiale Wei, Junwei Zheng, Kunyu Peng, Lei Qi, Rainer Stiefelhagen, Ruiping Liu, Yuanhao Luo, Yufan Chen, Yufeng Zhang, Zdravko Marinov, Zeyun Zhong.

**Figure 1.** Figure 1: IMPACT-HOI starts with a partial event state and video evidence. The Lock-aware Partial Event Completion (LPEC) module resolves open [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Main user study results of IMPACT-HOI. (a) shows efficiency and final event quality. (b) shows supervisory behavior and safety outcomes. (c) [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

read the original abstract

We present IMPACT-HOI, a mixed-initiative framework for annotating egocentric procedural video by constructing structured event graphs for Human-Object Interactions (HOI), motivated by the need for high-quality structured supervision for learning robot manipulation from human demonstration. IMPACT-HOI frames this task as the incremental resolution of a partially specified, onset-anchored event state. A trust-calibrated controller selects among direct queries, human-confirmed suggestions, and conservative completions based on empirical annotator behavior and evidence quality. A risk-bounded execution protocol, utilizing atomic rollback, ensures that human-confirmed decisions are preserved against conflicting automated updates. A user study with 9 participants shows a 13.5% reduction in manual annotation actions, a 46.67% event match rate, and zero confirmed-field violations under the studied protocol. The code will be made publicly available at https://github.com/541741106/IMPACT_HOI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a concrete mixed-initiative protocol for building partial onset-anchored HOI event graphs with trust calibration and atomic rollback, but the n=9 user study is too thin to back strong claims on efficiency or reliability.

read the letter

The main thing here is a practical system for annotating egocentric procedural videos by growing partial HOI event graphs from onset anchors. The trust-calibrated controller picks between direct human queries, suggestions, and conservative completions, while the rollback protocol protects already-confirmed decisions from later automated changes. That combination is the actual new piece the abstract puts forward, and it lines up with the need for structured supervision in robot imitation learning from human demos. The reported study shows a 13.5% drop in manual actions, a 46.67% event match rate, and zero confirmed-field violations, which at least demonstrates the protocol can run without obvious breakage in the tested setting. The code release is also a plus for anyone who wants to try the controller themselves. The evaluation is the clear weak point. Nine participants is a small number, and the abstract gives no information on study design, baselines, statistical tests, inter-annotator agreement, or how the match rate was calculated. Without those details it is hard to know whether the efficiency gain and zero-violation result are stable or just tied to the particular videos and annotators. The abstract also does not compare the onset-anchored partial-graph approach against existing mixed-initiative annotation tools, so the novelty claim stays hard to judge. This paper is mainly for groups working on video annotation pipelines for HOI datasets or robot learning from demonstration. A reader who needs ideas for incremental event construction or risk-bounded automation would pick up usable details. The thinking looks straightforward and the claims are not overreached in the abstract, so the work is worth sending to peer review for feedback on the evaluation and positioning. I would accept it for review rather than desk-reject.

Referee Report

2 major / 1 minor

Summary. The manuscript presents IMPACT-HOI, a mixed-initiative framework for annotating egocentric procedural videos via incremental construction of onset-anchored partial HOI event graphs. It introduces a trust-calibrated controller that selects among direct queries, human-confirmed suggestions, and conservative completions, paired with a risk-bounded execution protocol using atomic rollback to preserve human decisions. Evaluation consists of a user study with 9 participants reporting a 13.5% reduction in manual annotation actions, a 46.67% event match rate, and zero confirmed-field violations; the code is promised to be released publicly.

Significance. If the efficiency gains and safety properties hold under broader conditions, the framework could meaningfully lower the cost of producing structured HOI supervision for robot learning from demonstration. The planned public code release supports reproducibility. However, the small participant count and limited domain coverage in the reported study constrain the strength of any generalization claims.

major comments (2)

[User study results] User study evaluation: the abstract and corresponding results section report a 13.5% reduction in manual actions and 46.67% event match rate from 9 participants, yet provide no details on study design (videos selected, task instructions, baseline condition), statistical tests, inter-annotator agreement, or the precise definition and computation of the match rate. These omissions are load-bearing because the central empirical claim rests entirely on these metrics.
[Framework and evaluation] Generalization of the trust-calibrated controller and atomic-rollback protocol: the framework is tuned to observed annotator behavior in the studied videos, but no cross-annotator variance, cross-domain testing, or sensitivity analysis is presented. With N=9 and no reported error bars or failure-mode analysis, the zero confirmed-field violations result cannot be assessed for stability beyond the specific protocol and participants.

minor comments (1)

[Abstract] The abstract states that code will be released at a GitHub link, but the manuscript does not include a reproducibility checklist or data-release statement; adding these would strengthen the submission.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the user study presentation and the generalization of the framework. We address each major comment below and will revise the manuscript accordingly to strengthen the empirical claims while acknowledging the inherent limitations of the preliminary study.

read point-by-point responses

Referee: User study evaluation: the abstract and corresponding results section report a 13.5% reduction in manual actions and 46.67% event match rate from 9 participants, yet provide no details on study design (videos selected, task instructions, baseline condition), statistical tests, inter-annotator agreement, or the precise definition and computation of the match rate. These omissions are load-bearing because the central empirical claim rests entirely on these metrics.

Authors: We agree that the current manuscript omits critical details on the user study protocol. In the revised version, we will expand the User Study section (and update the abstract if space permits) to include: the specific 9 egocentric procedural videos selected from the source dataset along with selection criteria; the exact task instructions and interface provided to participants; the baseline condition (purely manual annotation without any automation); the statistical tests applied (paired t-tests with p-values and effect sizes for the 13.5% reduction); inter-annotator agreement where relevant; and the precise definition of the event match rate, computed as the fraction of events for which the constructed onset-anchored partial graph exactly matches ground-truth labels on action, object, and temporal onset fields. We will also add error bars and failure-mode breakdowns. These additions directly address the load-bearing nature of the metrics. revision: yes
Referee: Generalization of the trust-calibrated controller and atomic-rollback protocol: the framework is tuned to observed annotator behavior in the studied videos, but no cross-annotator variance, cross-domain testing, or sensitivity analysis is presented. With N=9 and no reported error bars or failure-mode analysis, the zero confirmed-field violations result cannot be assessed for stability beyond the specific protocol and participants.

Authors: We acknowledge that the small participant count (N=9) and single-domain focus limit strong generalization claims, and that the zero confirmed-field violations must be interpreted within the studied protocol. The trust thresholds were derived from pilot observations of annotator behavior, but the manuscript does not report variance or sensitivity. In revision we will add: (i) per-participant variance and error bars on all reported metrics, (ii) a sensitivity analysis on the trust-calibration parameters using the collected data, (iii) explicit discussion of failure modes observed during the study, and (iv) a strengthened limitations paragraph stating that cross-domain validation and larger-scale testing remain future work. The atomic-rollback mechanism itself is protocol-independent by design—it guarantees preservation of any human-confirmed decision regardless of subsequent automated suggestions—but we will clarify this distinction and avoid overclaiming stability. revision: partial

Circularity Check

0 steps flagged

No circularity detected in derivation or evaluation chain

full rationale

The paper describes a mixed-initiative annotation framework (IMPACT-HOI) for constructing onset-anchored partial HOI event graphs and evaluates it through an external user study with 9 participants reporting observed metrics (13.5% action reduction, 46.67% event match rate, zero violations). No equations, fitted parameters, or mathematical derivations are present that could reduce to inputs by construction. The trust-calibrated controller and risk-bounded protocol are presented as design choices motivated by annotator behavior observed in the study itself, with no self-referential predictions or self-citation chains invoked to justify core claims. The evaluation rests on direct empirical observation rather than internal model outputs, rendering the chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The framework itself is presented as a new contribution without detailing underlying mathematical assumptions or fitted constants.

pith-pipeline@v0.9.0 · 5518 in / 1190 out tokens · 49632 ms · 2026-05-10T16:22:53.583298+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 9 canonical work pages · 3 internal anchors

[1]

Visuomotor policy learning via action diffusion, September 4 2025

Cheng Chi, Siyuan Feng, Zhenjia Xu, Eric A Cousineau, Benjamin Burchfiel, Shuran Song, et al. Visuomotor policy learning via action diffusion, September 4 2025. US Patent App. 18/594,842

2025
[2]

Egomimic: Scaling imitation learning via egocentric video

Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Danfei Xu. Egomimic: Scaling imitation learning via egocentric video. InICRA, pages 13226–13233. IEEE, 2025

2025
[3]

Ego4d: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InCVPR, pages 18995–19012, 2022

2022
[4]

Epic- kitchens visor benchmark: Video segmentations and object relations

Ahmad Darkhalil, Dandan Shan, Bin Zhu, Jian Ma, Amlan Kar, Richard Higgins, Sanja Fidler, David Fouhey, and Dima Damen. Epic- kitchens visor benchmark: Video segmentations and object relations. NeurIPS, 35:13745–13758, 2022

2022
[5]

Hotr: End-to-end human-object interaction detection with transformers

Bumsoo Kim, Junhyun Lee, Jaewoo Kang, Eun-Sol Kim, and Hyun- woo J Kim. Hotr: End-to-end human-object interaction detection with transformers. InCVPR, pages 74–83, 2021

2021
[6]

Qpic: Query- based pairwise human-object interaction detection with image-wide contextual information

Masato Tamura, Hiroki Ohashi, and Tomoaki Yoshinaga. Qpic: Query- based pairwise human-object interaction detection with image-wide contextual information. InCVPR, pages 10410–10419, 2021

2021
[7]

Gen-vlkt: Simplify association and enhance interaction understanding for hoi detection

Yue Liao, Aixi Zhang, Miao Lu, Yongliang Wang, Xiaobo Li, and Si Liu. Gen-vlkt: Simplify association and enhance interaction understanding for hoi detection. InCVPR, pages 20123–20132, 2022

2022
[8]

Computer vision annotation tool (cvat), 2023

CV AT.ai Corporation. Computer vision annotation tool (cvat), 2023

2023
[9]

The via annotation software for images, audio and video

Abhishek Dutta and Andrew Zisserman. The via annotation software for images, audio and video. InACM MM, pages 2276–2279, 2019

2019
[10]

LIGHTEN: Learning interactions with graph and hierarchical temporal networks for HOI in videos

Sai Praneeth Reddy Sunkesula, Rishabh Dabral, and Ganesh Ramakr- ishnan. LIGHTEN: Learning interactions with graph and hierarchical temporal networks for HOI in videos. InACM MM, pages 691–699, 2020

2020
[11]

Spatio-temporal interaction graph parsing networks for human-object interaction recognition

Ning Wang, Guangming Zhu, Liang Zhang, Peiyi Shen, Hongsheng Li, and Cong Hua. Spatio-temporal interaction graph parsing networks for human-object interaction recognition. InACM MM, pages 4985– 4993, 2021

2021
[12]

Detecting human- object relationships in videos

Jingwei Ji, Rishi Desai, and Juan Carlos Niebles. Detecting human- object relationships in videos. InICCV, pages 8106–8116, 2021

2021
[13]

Video-based human-object interaction detection from tubelet tokens

Danyang Tu, Wei Sun, Xiongkuo Min, Guangtao Zhai, and Wei Shen. Video-based human-object interaction detection from tubelet tokens. InNeurIPS, 2022

2022
[14]

ST-HOI: A spatial-temporal baseline for human- object interaction detection in videos

Meng-Jiun Chiou, Chun-Yu Liao, Li-Wei Wang, Roger Zimmermann, and Jiashi Feng. ST-HOI: A spatial-temporal baseline for human- object interaction detection in videos. InACM Workshop ICXDAR, pages 9–17, 2021

2021
[15]

Learning asynchronous and sparse human-object interaction in videos

Romero Morais, Vuong Le, Svetha Venkatesh, and Truyen Tran. Learning asynchronous and sparse human-object interaction in videos. InCVPR, pages 16041–16050, 2021

2021
[16]

Rethinking Video Human-Object Interaction: Set Prediction over Time for Unified Detection and Anticipation

Yuanhao Luo, Di Wen, Kunyu Peng, Ruiping Liu, Junwei Zheng, Yufan Chen, Jiale Wei, and Rainer Stiefelhagen. Rethinking video human-object interaction: Set prediction over time for unified detection and anticipation.arXiv preprint arXiv:2604.10397, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[17]

RoHOI: Robustness benchmark for human-object interaction detection.arXiv preprint arXiv:2507.09111, 2025

Di Wen, Kunyu Peng, Kailun Yang, Yufan Chen, Ruiping Liu, Junwei Zheng, Alina Roitberg, and Rainer Stiefelhagen. RoHOI: Robustness benchmark for human-object interaction detection.arXiv preprint arXiv:2507.09111, 2025

work page arXiv 2025
[18]

EgoSound: Benchmarking Sound Understanding in Egocentric Videos

Bingwen Zhu, Yuqian Fu, Qiaole Dong, Guolei Sun, Tianwen Qian, Yuzheng Wu, Danda Pani Paudel, Xiangyang Xue, and Yanwei Fu. Egosound: Benchmarking sound understanding in egocentric videos. arXiv preprint arXiv:2602.14122, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

Egonight: Towards egocentric vision understanding at night with a challenging benchmark

Deheng Zhang, Yuqian Fu, Runyi Yang, Yang Miao, Tianwen Qian, Xu Zheng, Guolei Sun, Ajad Chhatkuli, Xuanjing Huang, Yu-Gang Jiang, et al. Egonight: Towards egocentric vision understanding at night with a challenging benchmark.arXiv preprint arXiv:2510.06218, 2025

work page arXiv 2025
[20]

Rescaling egocen- tric vision: Collection, pipeline and challenges for EPIC-KITCHENS- 100.IJCV, 130:33–55, 2022

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Evangelos Kazakos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Rescaling egocen- tric vision: Collection, pipeline and challenges for EPIC-KITCHENS- 100.IJCV, 130:33–55, 2022

2022
[21]

HOI4D: A 4D egocentric dataset for category-level human-object interaction

Yunze Liu, Yun Liu, Che Jiang, Kangbo Lyu, Weikang Wan, Hao Shen, Boqiang Liang, Zhoujie Fu, He Wang, and Li Yi. HOI4D: A 4D egocentric dataset for category-level human-object interaction. In CVPR, pages 20981–20990, 2022

2022
[22]

Black, and Otmar Hilliges

Zicong Fan, Omid Taheri, Dimitrios Tzionas, Muhammed Kocabas, Manuel Kaufmann, Michael J. Black, and Otmar Hilliges. ARCTIC: A dataset for dexterous bimanual hand-object manipulation. InCVPR, pages 12943–12954, 2023

2023
[23]

IMPACT: A Dataset for Multi-Granularity Human Procedural Action Understanding in Industrial Assembly

Di Wen, Zeyun Zhong, David Schneider, Manuel Zaremski, Linus Kunzmann, Yitian Shi, Ruiping Liu, Yufan Chen, Junwei Zheng, Jiahang Li, Jonas Hemmerich, Qiyi Tong, Patric Grauberger, Arash Ajoudani, Danda Pani Paudel, Sven Matthiesen, Barbara Deml, J ¨urgen Beyerer, Luc Van Gool, Rainer Stiefelhagen, and Kunyu Peng. IM- PACT: A dataset for multi-granularity...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[24]

Saquib Sarfraz, Rainer Stiefel- hagen, and Alina Roitberg

Kunyu Peng, Jianyang Fu, Kailun Yang, Di Wen, Yufan Chen, Ruiping Liu, Junwei Zheng, Jiaming Zhang, M. Saquib Sarfraz, Rainer Stiefel- hagen, and Alina Roitberg. Referring atomic video action recognition. InECCV, 2024

2024
[25]

HopaDIFF: Holistic-partial aware fourier con- ditioned diffusion for referring human action segmentation in multi- person scenarios.NeurIPS, 2025

Kunyu Peng, Junchao Huang, Xiangsheng Huang, Di Wen, Junwei Zheng, Yufan Chen, Kailun Yang, Jiamin Wu, Chongqing Hao, and Rainer Stiefelhagen. HopaDIFF: Holistic-partial aware fourier con- ditioned diffusion for referring human action segmentation in multi- person scenarios.NeurIPS, 2025

2025
[26]

Anticipative feature fusion transformer for multi- modal action anticipation

Zeyun Zhong, David Schneider, Michael V oit, Rainer Stiefelhagen, and J ¨urgen Beyerer. Anticipative feature fusion transformer for multi- modal action anticipation. InWACV, pages 6057–6066, 2023

2023
[27]

Human-object interaction prediction in videos through gaze following

Zhifan Ni, Esteve Valls Mascar ´o, Hyemin Ahn, and Dongheui Lee. Human-object interaction prediction in videos through gaze following. Comput. Vis. Image Underst., 233:103741, 2023

2023
[28]

Victoria Manousaki, Konstantinos Bacharidis, Filippos Gouidis, Kon- stantinos Papoutsakis, Dimitris Plexousakis, and Antonis A. Argyros. Anticipating object state changes in long procedural videos.arXiv preprint arXiv:2405.12789, 2024

work page arXiv 2024
[29]

Saquib Sarfraz, Rainer Stiefelha- gen, and Alina Roitberg

Kunyu Peng, Cheng Yin, Junwei Zheng, Ruiping Liu, David Schnei- der, Jiaming Zhang, Kailun Yang, M. Saquib Sarfraz, Rainer Stiefelha- gen, and Alina Roitberg. Navigating open set scenarios for skeleton- based action recognition. InAAAI, pages 4487–4496, 2024

2024
[30]

Skeleton-based human action recognition with noisy labels.arXiv preprint arXiv:2403.09975, 2024

Yi Xu, Kunyu Peng, Di Wen, Ruiping Liu, Junwei Zheng, Yufan Chen, Jiaming Zhang, Alina Roitberg, Kailun Yang, and Rainer Stiefelhagen. Skeleton-based human action recognition with noisy labels.arXiv preprint arXiv:2403.09975, 2024

work page arXiv 2024
[31]

Papadopoulos

Thanos Delatolas, Vicky Kalogeiton, and Dim P. Papadopoulos. Learn- ing the what and how of annotation in video object segmentation. In WACV, pages 6936–6946, 2024

2024
[32]

Aayush Jung Rana and Yogesh S. Rawat. Hybrid active learning via deep clustering for video action detection. InCVPR, pages 18867– 18877, 2023

2023
[33]

Semi-supervised active learning for video action detection

Ayush Singh, Aayush Jung Rana, Akash Kumar, Shruti Vyas, and Yogesh Singh Rawat. Semi-supervised active learning for video action detection. InAAAI, pages 4891–4899, 2024

2024
[34]

Boosting point-supervised temporal action localization through integrating query reformation and optimal transport

Mengnan Liu, Le Wang, Sanping Zhou, Kun Xia, Xiaolong Sun, and Gang Hua. Boosting point-supervised temporal action localization through integrating query reformation and optimal transport. InCVPR, pages 13865–13875, 2025

2025
[35]

Xinyu Yang, Haozheng Yu, Yihong Sun, Bharath Hariharan, and Jennifer J. Sun. Live interactive training for video segmentation.arXiv preprint arXiv:2603.26929, 2026

work page arXiv 2026
[36]

Adapting the segment anything model during usage in novel situations

Robin Sch ¨on, Julian Lorenz, Katja Ludwig, and Rainer Lienhart. Adapting the segment anything model during usage in novel situations. InCVPRW, pages 3616–3626, 2024

2024
[37]

Mica: Multi-agent industrial coordination assis- tant,

Di Wen, Kunyu Peng, Junwei Zheng, Yufan Chen, Yitian Shi, Jiale Wei, Ruiping Liu, Kailun Yang, and Rainer Stiefelhagen. MICA: Multi-agent industrial coordination assistant.arXiv preprint arXiv:2509.15237, 2025

work page arXiv 2025
[38]

Snap, segment, deploy: A visual data and detection pipeline for wearable industrial assistants

Di Wen, Junwei Zheng, Ruiping Liu, Yi Xu, Kunyu Peng, and Rainer Stiefelhagen. Snap, segment, deploy: A visual data and detection pipeline for wearable industrial assistants. InIEEE SMC, pages 1270– 1276, 2025

2025
[39]

Assembly101: A large- scale multi-view video dataset for understanding procedural activities

Fadime Sener, Dibyadip Chatterjee, Daniel Shelepov, Kun He, Dipika Singhania, Robert Wang, and Angela Yao. Assembly101: A large- scale multi-view video dataset for understanding procedural activities. InCVPR, pages 21064–21074, 2022

2022