An empirical evaluation of a multi-modal touch detector using MediaPipe, HSV skin filtering, motion differencing, and Canny edges finds low F1 scores on staged video and excessive false positives on real videos, concluding the approach does not enable reliable keystroke reconstruction outside contro
Detecting Precise Hand Touch Moments in Egocentric Video
1 Pith paper cite this work. Polarity classification is still indexing.
abstract
We address the challenging task of detecting the precise moment when hands make contact with objects in egocentric videos. This frame-level detection is crucial for augmented reality, human-computer interaction, assistive technologies, and robot learning applications, where contact onset signals action initiation or completion. Temporally precise detection is particularly challenging due to subtle hand motion variations near contact, frequent occlusions, fine-grained manipulation patterns, and the inherent motion dynamics of first-person perspectives. To tackle these challenges, we propose a Hand-informed Context Enhanced module (HiCE; pronounced `high-see') that leverages spatiotemporal features from hand regions and their surrounding context through cross-attention mechanisms, learning to identify potential contact patterns. Our approach is further refined with a grasp-aware loss and soft label that emphasizes hand pose patterns and movement dynamics characteristic of touch events, enabling the model to distinguish between near-contact and actual contact frames. We also introduce TouchMoment, an egocentric dataset containing 4,021 videos and 8,456 annotated contact moments spanning over one million frames. Experiments on TouchMoment show that, under a strict evaluation criterion that counts a prediction as correct only if it falls within a two-frame tolerance of the ground-truth moment, our method achieves substantial gains and outperforms state-of-the-art event-spotting baselines by 16.91% average precision.
fields
cs.CV 1years
2026 1verdicts
ACCEPT 1representative citing papers
citing papers explorer
-
Empirical Evaluation of Multi-Modal Touch Detection in Over-the-Shoulder Video Surveillance
An empirical evaluation of a multi-modal touch detector using MediaPipe, HSV skin filtering, motion differencing, and Canny edges finds low F1 scores on staged video and excessive false positives on real videos, concluding the approach does not enable reliable keystroke reconstruction outside contro