Face versus Body Tracking for Human-Robot Interaction: An Egocentric Dataset
Pith reviewed 2026-06-28 09:45 UTC · model grok-4.3
The pith
Body tracking with re-identification reduces identity switches by 49 percent in egocentric robot interactions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that, on an egocentric dataset of close-range human-robot encounters, body tracking augmented with appearance re-identification and extended temporal memory yields a 49 percent reduction in identity switches compared with a standard tracking-by-detection baseline, while the same re-identification step increases identity switches when applied to face detections because of sensitivity to profile angles.
What carries the argument
The custom-annotated egocentric dataset together with the modular pipeline that isolates detection from tracking logic and tests the separate contributions of temporal memory and re-identification.
If this is right
- Increasing temporal memory reduces prolonged occlusions but does not resolve complex dynamic events.
- Re-identification substantially improves body tracking stability yet causes facial identity switches to rise.
- The optimized pipeline that combines these elements reduces identity switches by 49 percent over a tracking-by-detection baseline.
- Standard surveillance or driving benchmarks lack the dense, close-quarter occlusions typical of social-robot scenes.
Where Pith is reading between the lines
- Robots may maintain steadier engagement by relying on body features rather than faces when users turn or move close.
- Perception models for social interaction require training and test data recorded from the robot's own viewpoint rather than from overhead or side cameras.
- A hybrid cue that switches between or fuses face and body information might avoid the profile-angle problem observed with faces alone.
- The same memory-plus-re-identification pattern could be tested in other close-range egocentric settings such as wearable cameras or handheld devices.
Load-bearing premise
The interactions recorded in the dataset capture the nonlinear movements, occlusions, and re-entries that occur in ordinary human-robot conversations.
What would settle it
Applying the identical optimized pipeline to an independent egocentric dataset collected from a different robot platform and measuring whether identity switches still fall by roughly 49 percent would directly test the reported gain.
Figures
read the original abstract
Meaningful human-robot interaction (HRI) requires a robot to continuously assess user engagement through persistent user tracking. However, state-of-the-art Multi-Object Tracking models are heavily optimized for surveillance or autonomous driving. A social robot faces distinct egocentric challenges, such as humans moving in unpredictable nonlinear patterns, obstructing each other, or leaving and reentering the scene. These dynamics trigger frequent identity switches (IDSW), causing the robot to lose its footing mid-conversation. To address this, we introduce a focused, custom-annotated egocentric dataset collected via the Furhat robot. We present a systematic evaluation isolating detection errors from tracking logic, comparing face versus body tracking, and assessing the impact of extended memory and appearance re-identification (ReID). Results indicate that increasing temporal memory mitigates prolonged occlusions but fails on complex dynamic events. Integrating ReID resolves complex switches but exhibits opposing effects: it substantially improves body tracking stability, yet causes facial IDSW to spike due to profile angle sensitivity. Ultimately, our optimized pipeline reduces IDSW by 49% compared to a standard tracking-by-detection baseline, effectively mitigating interaction breakdowns. As standard benchmarks lack dense, close-quarter occlusions, this work highlights the critical need for natively captured social dynamics to truly validate HRI perception models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a custom-annotated egocentric dataset collected via the Furhat robot for HRI tracking scenarios involving nonlinear movements, occlusions, and re-entries. It conducts a systematic evaluation comparing face versus body tracking, assessing the effects of extended temporal memory and appearance ReID, and reports that an optimized pipeline reduces IDSW by 49% relative to a tracking-by-detection baseline.
Significance. If the isolation of detection from tracking components is rigorously demonstrated and the dataset dynamics are representative, the results on opposing ReID effects for face versus body tracking would usefully inform HRI perception design. The dataset could address gaps in standard benchmarks that lack dense close-quarter social interactions.
major comments (2)
- [Abstract] Abstract: The manuscript states quantitative results including a 49% IDSW reduction and a 'systematic evaluation isolating detection errors from tracking logic' but supplies no dataset size, annotation protocol, statistical tests, or error analysis. This prevents verification of the central quantitative claim.
- [Abstract and evaluation] Abstract and evaluation: The 49% IDSW reduction is attributed to extended memory and ReID, yet face and body tracking employ distinct detectors with differing error profiles. No explicit ablation is described that holds the detector fixed while varying only the tracking logic (or reports separate detection-only metrics), so the improvement cannot be confidently assigned to the pipeline elements rather than detector differences.
minor comments (1)
- [Abstract] Abstract: 'reentering' should be hyphenated as 're-entering'.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We address each major comment below and outline the revisions we will make.
read point-by-point responses
-
Referee: [Abstract] Abstract: The manuscript states quantitative results including a 49% IDSW reduction and a 'systematic evaluation isolating detection errors from tracking logic' but supplies no dataset size, annotation protocol, statistical tests, or error analysis. This prevents verification of the central quantitative claim.
Authors: The abstract is a concise summary subject to strict length limits and therefore omits supporting details that appear in the body of the manuscript. Section 3 fully specifies the dataset size, collection protocol, and annotation procedure; Sections 4 and 5 present the statistical tests, error analysis, and detection-only metrics. To improve standalone readability of the abstract, we will add the dataset size and a one-sentence reference to the evaluation protocol in the revised version. revision: yes
-
Referee: [Abstract and evaluation] Abstract and evaluation: The 49% IDSW reduction is attributed to extended memory and ReID, yet face and body tracking employ distinct detectors with differing error profiles. No explicit ablation is described that holds the detector fixed while varying only the tracking logic (or reports separate detection-only metrics), so the improvement cannot be confidently assigned to the pipeline elements rather than detector differences.
Authors: We acknowledge that an ablation keeping the detector identical while varying only the tracker would make the attribution clearer. The manuscript already reports separate detection metrics (precision/recall) for each detector before applying the tracking components; the 49 % figure is measured on the combined pipeline. Because face and body tracking are compared as they would actually be deployed in HRI, a full cross-detector swap was not performed. We will revise the evaluation section to state this design choice explicitly, add a dedicated paragraph describing how detection errors are isolated from tracking logic, and include a limited additional ablation if space allows. revision: partial
Circularity Check
No circularity: purely empirical dataset and ablation study
full rationale
The paper introduces a new egocentric dataset and reports empirical results from tracking experiments (face vs. body, memory length, ReID). The central 49% IDSW reduction is presented as an observed outcome of the optimized pipeline versus baseline on this dataset. No equations, parameter fits, derivations, or self-citations appear in the provided text that would reduce any claim to its own inputs by construction. The evaluation is self-contained against external benchmarks (standard tracking-by-detection) and does not rely on load-bearing prior results from the same authors. This is the expected finding for an empirical robotics dataset paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard multi-object tracking metrics such as IDSW are appropriate measures for HRI engagement failures
Reference graph
Works this paper leans on
-
[1]
A. Sorrentino, L. Fiorini, and F. Cavallo, “From the Definition to the Automatic Assessment of Engagement in Human–Robot Interaction: A Systematic Review,”International Journal of Social Robotics, vol. 16, no. 7, pp. 1641–1663, July 2024. [Online]. Available: https://doi.org/10.1007/s12369-024-01146-w
-
[2]
Are You Still With Me? Continuous Engagement Assessment From a Robot’s Point of View,
F. Del Duchetto, P. Baxter, and M. Hanheide, “Are You Still With Me? Continuous Engagement Assessment From a Robot’s Point of View,”Frontiers in Robotics and AI, vol. 7, Sept. 2020. [Online]. Available: https://www.frontiersin.org/journals/robotics-and-ai/articles /10.3389/frobt.2020.00116/full
-
[3]
Footing in human-robot conversations: how robots might shape participant roles using gaze cues,
B. Mutlu, T. Shiwa, T. Kanda, H. Ishiguro, and N. Hagita, “Footing in human-robot conversations: how robots might shape participant roles using gaze cues,” inProceedings of the 4th ACM/IEEE international conference on Human robot interaction, ser. HRI ’09. New York, NY , USA: Association for Computing Machinery, Mar. 2009, pp. 61–68. [Online]. Available: ...
-
[4]
J. Yang, D. Feng, Y . Gao, and C. Liu, “Online Multi-Object Tracking Based on Record Confidence and Hierarchical Association for Cyber-Physical Social Intelligence,”Big Data Mining and Analytics, vol. 8, no. 4, pp. 851–866, Aug. 2025. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/11002437
-
[5]
A Taxonomy of Social Errors in Human-Robot Interaction,
L. Tian and S. Oviatt, “A Taxonomy of Social Errors in Human-Robot Interaction,”J. Hum.-Robot Interact., vol. 10, no. 2, pp. 13:1–13:32, Feb. 2021. [Online]. Available: https://dl.acm.org/doi/10.1145/34397 20
-
[6]
REGROUP: A Robot-Centric Group Detection and Tracking System,
A. Taylor and L. D. Riek, “REGROUP: A Robot-Centric Group Detection and Tracking System,” in2022 17th ACM/IEEE International Conference on Human-Robot Interaction (HRI). Sapporo, Japan: IEEE, Mar. 2022, pp. 412–421. [Online]. Available: https://ieeexplore.ieee.org/document/9889634/
-
[7]
BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning,
F. Yu, H. Chen, X. Wang, W. Xian, Y . Chen, F. Liu, V . Madhavan, and T. Darrell, “BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning,” in2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, W A, USA: IEEE, June 2020, pp. 2633–2642. [Online]. Available: https://ieeexplore.ieee.org/document/9156329/
-
[8]
B. Stoler, M. Jana, S. Hwang, and J. Oh, “T2FPV: Dataset and Method for Correcting First-Person View Errors in Pedestrian Trajectory Prediction,” Mar. 2023, arXiv:2209.11294 [cs]. [Online]. Available: http://arxiv.org/abs/2209.11294
-
[9]
A real-time and unsupervised face re-identification system for human-robot interaction,
Y . Wang, J. Shen, S. Petridis, and M. Pantic, “A real-time and unsupervised face re-identification system for human-robot interaction,”Pattern Recognition Letters, vol. 128, pp. 559–568, Dec
-
[10]
Available: https://www.sciencedirect.com/science/arti cle/pii/S0167865518301296
[Online]. Available: https://www.sciencedirect.com/science/arti cle/pii/S0167865518301296
-
[11]
Face Recognition and Tracking Framework for Human–Robot Interaction,
A. Khalifa, A. A. Abdelrahman, D. Strazdas, J. Hintz, T. Hempel, and A. Al-Hamadi, “Face Recognition and Tracking Framework for Human–Robot Interaction,”Applied Sciences, vol. 12, no. 11, May
-
[12]
Available: https://www.mdpi.com/2076-3417/12/11/ 5568
[Online]. Available: https://www.mdpi.com/2076-3417/12/11/ 5568
2076
-
[13]
Face, Body, V oice: Video Person-Clustering With Multiple Modalities,
A. Brown, V . Kalogeiton, and A. Zisserman, “Face, Body, V oice: Video Person-Clustering With Multiple Modalities,” 2021, pp. 3184–
2021
-
[14]
Available: https://openaccess.thecvf.com/content/IC CV2021W/CVEU/html/Brown Face Body Voice Video Person-Clust ering With Multiple Modalities ICCVW 2021 paper.html
[Online]. Available: https://openaccess.thecvf.com/content/IC CV2021W/CVEU/html/Brown Face Body Voice Video Person-Clust ering With Multiple Modalities ICCVW 2021 paper.html
2021
-
[15]
BoT-FaceSORT: Bag-of-Tricks for Robust Multi-face Tracking in Unconstrained Videos,
J. Kim, C.-Y . Ju, G.-W. Kim, and D.-H. Lee, “BoT-FaceSORT: Bag-of-Tricks for Robust Multi-face Tracking in Unconstrained Videos,” inComputer Vision – ACCV 2024, M. Cho, I. Laptev, D. Tran, A. Yao, and H. Zha, Eds. Singapore: Springer Nature Singapore, 2025, vol. 15473, pp. 278–294, series Title: Lecture Notes in Computer Science. [Online]. Available: htt...
-
[16]
Simple online and realtime tracking,
A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, “Simple online and realtime tracking,” in2016 IEEE International Conference on Image Processing (ICIP), Sept. 2016, pp. 3464–3468, iSSN: 2381-8549. [Online]. Available: https://ieeexplore.ieee.org/document /7533003/
2016
-
[17]
ByteTrack: Multi-object Tracking by Associating Every Detection Box,
Y . Zhang, P. Sun, Y . Jiang, D. Yu, F. Weng, Z. Yuan, P. Luo, W. Liu, and X. Wang, “ByteTrack: Multi-object Tracking by Associating Every Detection Box,” inComputer Vision – ECCV 2022, S. Avidan, G. Brostow, M. Ciss ´e, G. M. Farinella, and T. Hassner, Eds. Cham: Springer Nature Switzerland, 2022, pp. 1–21. [Online]. Available: https://doi.org/10.1007/97...
-
[18]
Bot-sort: Robust associa- tions multi-pedestrian tracking,
N. Aharon, R. Orfaig, and B.-Z. Bobrovsky, “BoT-SORT: Robust Associations Multi-Pedestrian Tracking,” July 2022, arXiv:2206.14651 [cs]. [Online]. Available: http://arxiv.org/abs/2206.14651
-
[19]
RGB-D-based human motion recognition with deep learning: A survey,
P. Wang, W. Li, P. Ogunbona, J. Wan, and S. Escalera, “RGB-D-based human motion recognition with deep learning: A survey,”Computer Vision and Image Understanding, vol. 171, pp. 118–139, June 2018. [Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/S1077 314218300663
2018
-
[20]
Multiple Human Association and Tracking From Egocentric and Complementary Top Views,
R. Han, W. Feng, Y . Zhang, J. Zhao, and S. Wang, “Multiple Human Association and Tracking From Egocentric and Complementary Top Views,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 9, pp. 5225–5242, Sept. 2022. [Online]. Available: https://ieeexplore.ieee.org/document/9394804/
-
[21]
A Joint Tracking System: Robot is Online to Access Surveillance Views,
Z. Lin, S. Ji, W. Wang, M. Qin, R. Yang, M. Wan, J. Gu, T. Li, and C. Zhang, “A Joint Tracking System: Robot is Online to Access Surveillance Views,” in2023 IEEE International Conference on Robotics and Biomimetics (ROBIO), Dec. 2023, pp. 1–6. [Online]. Available: https://ieeexplore.ieee.org/document/10354902/
-
[22]
F. Mohsen and A. Safa, “Real-Time Human-Robot Interaction Intent Detection Using RGB-based Pose and Emotion Cues with Cross-Camera Model Generalization,” Dec. 2025, arXiv:2512.17958 [cs]. [Online]. Available: http://arxiv.org/abs/2512.17958
-
[23]
Q-Tracking: A Robust Visual Human Following for Quadruped Robots in Dynamic Environments,
Y . Su, C. Cun, H. Xia, Y . Feng, B. He, Q. Sun, J. Zhong, and Z. Li, “Q-Tracking: A Robust Visual Human Following for Quadruped Robots in Dynamic Environments,” in2025 International Conference on Advanced Robotics and Mechatronics (ICARM), Aug. 2025, pp. 1–6, iSSN: 2993-4990. [Online]. Available: https://ieeexplore.ieee.org/document/11293732/
-
[24]
JRDB: A Dataset and Benchmark of Egocentric Robot Visual Perception of Humans in Built Environments,
R. Mart ´ın-Mart´ın, M. Patel, H. Rezatofighi, A. Shenoi, J. Gwak, E. Frankel, A. Sadeghian, and S. Savarese, “JRDB: A Dataset and Benchmark of Egocentric Robot Visual Perception of Humans in Built Environments,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 6, pp. 6748–6765, June 2023. [Online]. Available: https://ieeexplore...
-
[25]
Following the Human Thread in Social Navigation,
L. Scofano, A. Sampieri, T. Campari, V . Sacco, I. Spinelli, L. Ballan, and F. Galasso, “Following the Human Thread in Social Navigation,” Feb. 2025, arXiv:2404.11327 [cs]. [Online]. Available: http://arxiv.org/abs/2404.11327
-
[26]
MOTChallenge: A Benchmark for Single-Camera Multiple Target Tracking,
P. Dendorfer, A. O ˇsep, A. Milan, K. Schindler, D. Cremers, I. Reid, S. Roth, and L. Leal-Taix ´e, “MOTChallenge: A Benchmark for Single-Camera Multiple Target Tracking,”International Journal of Computer Vision, vol. 129, no. 4, pp. 845–881, Apr. 2021. [Online]. Available: https://doi.org/10.1007/s11263-020-01393-0
-
[27]
H. Ye, Y . Zhan, W. Situ, G. Chen, J. Yu, Z. Zhao, K. Cai, A. Ajoudani, and H. Zhang, “TPT-Bench: A Large-Scale, Long-Term and Robot-Egocentric Dataset for Benchmarking Target Person Tracking,” July 2025, arXiv:2505.07446 [cs]. [Online]. Available: http://arxiv.org/abs/2505.07446
-
[28]
MOT20: A benchmark for multi object tracking in crowded scenes,
P. Dendorfer, H. Rezatofighi, A. Milan, J. Shi, D. Cremers, I. Reid, S. Roth, K. Schindler, and L. Leal-Taix ´e, “MOT20: A benchmark for multi object tracking in crowded scenes,” Mar. 2020, arXiv:2003.09003 [cs]. [Online]. Available: http: //arxiv.org/abs/2003.09003
-
[29]
DanceTrack: Multi-Object Tracking in Uniform Appearance and Diverse Motion,
P. Sun, J. Cao, Y . Jiang, Z. Yuan, S. Bai, K. Kitani, and P. Luo, “DanceTrack: Multi-Object Tracking in Uniform Appearance and Diverse Motion,” in2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, LA, USA: IEEE, June 2022, pp. 20 961–20 970. [Online]. Available: https://ieeexplore.ieee.org/document/9879192/
-
[30]
CrowdHuman: A Benchmark for Detecting Human in a Crowd
S. Shao, Z. Zhao, B. Li, T. Xiao, G. Yu, X. Zhang, and J. Sun, “CrowdHuman: A Benchmark for Detecting Human in a Crowd,” Apr. 2018, arXiv:1805.00123 [cs]. [Online]. Available: http://arxiv.org/abs/1805.00123
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[31]
Furhat: A Back-Projected Human-Like Robot Head for Multiparty Human- Machine Interaction,
S. Al Moubayed, J. Beskow, G. Skantze, and B. Granstr ¨om, “Furhat: A Back-Projected Human-Like Robot Head for Multiparty Human- Machine Interaction,” inCognitive Behavioural Systems, A. Esposito, A. M. Esposito, A. Vinciarelli, R. Hoffmann, and V . C. M ¨uller, Eds. Berlin, Heidelberg: Springer, 2012, pp. 114–130. [Online]. Available: https://doi.org/10....
-
[32]
Computer Vision Annotation Tool (CV AT),
CV AT.ai Corporation, “Computer Vision Annotation Tool (CV AT),”
-
[33]
Available: https://github.com/cvat-ai/cvat
[Online]. Available: https://github.com/cvat-ai/cvat
-
[34]
YOLOX: Exceeding YOLO Series in 2021
Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, “YOLOX: Exceeding YOLO Series in 2021,” Aug. 2021, arXiv:2107.08430 [cs]. [Online]. Available: http://arxiv.org/abs/2107.08430
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[35]
RetinaFace: Single-Shot Multi-Level Face Localisation in the Wild,
J. Deng, J. Guo, E. Ververas, I. Kotsia, and S. Zafeiriou, “RetinaFace: Single-Shot Multi-Level Face Localisation in the Wild,” in2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, W A, USA: IEEE, June 2020, pp. 5202–5211. [Online]. Available: https://ieeexplore.ieee.org/document/9157330/
-
[36]
Pedestrian Detection: An Evaluation of the State of the Art,
P. Dollar, C. Wojek, B. Schiele, and P. Perona, “Pedestrian Detection: An Evaluation of the State of the Art,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 4, pp. 743–761, Apr. 2012. [Online]. Available: https: //ieeexplore.ieee.org/document/5975165/
-
[37]
WIDER FACE: A Face Detection Benchmark,
S. Yang, P. Luo, C. C. Loy, and X. Tang, “WIDER FACE: A Face Detection Benchmark,” in2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV , USA: IEEE, June 2016, pp. 5525–5533. [Online]. Available: https://ieeexplore.ieee.org/document/7780965/
-
[38]
HOTA: A Higher Order Metric for Evaluating Multi-object Tracking,
J. Luiten, A. O ˇsep, P. Dendorfer, P. Torr, A. Geiger, L. Leal-Taix ´e, and B. Leibe, “HOTA: A Higher Order Metric for Evaluating Multi-object Tracking,”International Journal of Computer Vision, vol. 129, no. 2, pp. 548–578, Feb. 2021. [Online]. Available: https://doi.org/10.1007/s11263-020-01375-2
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.