Face versus Body Tracking for Human-Robot Interaction: An Egocentric Dataset

Gabriel Skantze; Jessica Wenninger

arxiv: 2606.03694 · v2 · pith:HY3GIGY5new · submitted 2026-06-02 · 💻 cs.RO · cs.CV· cs.HC

Face versus Body Tracking for Human-Robot Interaction: An Egocentric Dataset

Jessica Wenninger , Gabriel Skantze This is my paper

Pith reviewed 2026-06-28 09:45 UTC · model grok-4.3

classification 💻 cs.RO cs.CVcs.HC

keywords egocentric trackinghuman-robot interactionmulti-object trackingidentity switchesface trackingbody trackingre-identificationtemporal memory

0 comments

The pith

Body tracking with re-identification reduces identity switches by 49 percent in egocentric robot interactions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that standard multi-object tracking models, built for surveillance or driving, produce frequent identity switches when a social robot must follow users who move unpredictably, block one another, or leave and return to view. It supplies a new dataset recorded from the robot's own camera and runs a controlled comparison of face-based versus body-based detection, longer temporal memory, and appearance re-identification. Longer memory reduces breaks from occlusions but leaves complex motion events unsolved. Re-identification improves body tracks yet raises face switches because profile views change appearance sharply. The best combination of these components cuts identity switches by 49 percent relative to a plain tracking-by-detection baseline.

Core claim

The paper claims that, on an egocentric dataset of close-range human-robot encounters, body tracking augmented with appearance re-identification and extended temporal memory yields a 49 percent reduction in identity switches compared with a standard tracking-by-detection baseline, while the same re-identification step increases identity switches when applied to face detections because of sensitivity to profile angles.

What carries the argument

The custom-annotated egocentric dataset together with the modular pipeline that isolates detection from tracking logic and tests the separate contributions of temporal memory and re-identification.

If this is right

Increasing temporal memory reduces prolonged occlusions but does not resolve complex dynamic events.
Re-identification substantially improves body tracking stability yet causes facial identity switches to rise.
The optimized pipeline that combines these elements reduces identity switches by 49 percent over a tracking-by-detection baseline.
Standard surveillance or driving benchmarks lack the dense, close-quarter occlusions typical of social-robot scenes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Robots may maintain steadier engagement by relying on body features rather than faces when users turn or move close.
Perception models for social interaction require training and test data recorded from the robot's own viewpoint rather than from overhead or side cameras.
A hybrid cue that switches between or fuses face and body information might avoid the profile-angle problem observed with faces alone.
The same memory-plus-re-identification pattern could be tested in other close-range egocentric settings such as wearable cameras or handheld devices.

Load-bearing premise

The interactions recorded in the dataset capture the nonlinear movements, occlusions, and re-entries that occur in ordinary human-robot conversations.

What would settle it

Applying the identical optimized pipeline to an independent egocentric dataset collected from a different robot platform and measuring whether identity switches still fall by roughly 49 percent would directly test the reported gain.

Figures

Figures reproduced from arXiv: 2606.03694 by Gabriel Skantze, Jessica Wenninger.

**Figure 1.** Figure 1: The Egocentric HRI Tracking Challenge. Left: The experimental setup with the Furhat robot in a real-world office environment. Right: A representative frame from the robot’s egocentric perspective. The scene highlights the difficulty of maintaining consistent identities for multiple actors despite dynamic background motion and severe occlusions. puter vision and HRI. State-of-the-art models are heavily opt… view at source ↗

**Figure 3.** Figure 3: Impact of Memory and Appearance on Tracking Stability. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 2.** Figure 2: Mitigation of qualitative failure modes for body tracking (see [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of tracking stability during a complex “U-Turn” occlusion event. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Tracking during an “ID Takeover” event. Top Row (B-GT-BT30): Moving Bystander occludes seated overhearer at the back. The heavy body bounding box overlap causes the Kalman filter to swap their identities. Bottom Row (F-GT-BT-30): Smaller, spatially distinct face boxes minimize IoU overlap, allowing the tracker to correctly maintain individual identities through the dynamic occlusion. as a stable visual an… view at source ↗

read the original abstract

Meaningful human-robot interaction (HRI) requires a robot to continuously assess user engagement through persistent user tracking. However, state-of-the-art Multi-Object Tracking models are heavily optimized for surveillance or autonomous driving. A social robot faces distinct egocentric challenges, such as humans moving in unpredictable nonlinear patterns, obstructing each other, or leaving and reentering the scene. These dynamics trigger frequent identity switches (IDSW), causing the robot to lose its footing mid-conversation. To address this, we introduce a focused, custom-annotated egocentric dataset collected via the Furhat robot. We present a systematic evaluation isolating detection errors from tracking logic, comparing face versus body tracking, and assessing the impact of extended memory and appearance re-identification (ReID). Results indicate that increasing temporal memory mitigates prolonged occlusions but fails on complex dynamic events. Integrating ReID resolves complex switches but exhibits opposing effects: it substantially improves body tracking stability, yet causes facial IDSW to spike due to profile angle sensitivity. Ultimately, our optimized pipeline reduces IDSW by 49% compared to a standard tracking-by-detection baseline, effectively mitigating interaction breakdowns. As standard benchmarks lack dense, close-quarter occlusions, this work highlights the critical need for natively captured social dynamics to truly validate HRI perception models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New egocentric HRI dataset with face-body comparison and ReID effects is the real addition, but the 49% IDSW reduction lacks the controls needed to attribute it cleanly to the pipeline.

read the letter

The paper's main points are the introduction of a custom-annotated egocentric dataset from the Furhat robot and the observation that ReID improves body tracking stability while increasing facial IDSW due to profile sensitivity.

It does a solid job identifying HRI-specific problems like nonlinear paths, mutual occlusions, and re-entries that standard MOT benchmarks ignore. Collecting data directly from a social robot and running the face-versus-body comparison plus memory and ReID variants gives a focused, practical angle that prior work does not cover.

The soft spot is the central claim. The abstract states a 49% IDSW drop from the optimized pipeline versus a tracking-by-detection baseline and mentions isolating detection errors from tracking logic, yet supplies no dataset size, annotation protocol, detection-only metrics, or explicit ablation that holds the detector fixed. The stress-test concern holds up on the available text: without those controls it is difficult to assign the gain to extended memory and ReID rather than differences in the underlying face and body detectors. The opposing ReID effects are interesting but rest on the same unshown isolation.

This work is for researchers building or evaluating perception stacks for social robots in close quarters. Someone already working on egocentric tracking or ReID in HRI would find the dataset and the face-body differential useful as a reference point.

It deserves peer review because the dataset is new and the reported differential is specific, even though the evaluation details will need tightening.

Referee Report

2 major / 1 minor

Summary. The paper introduces a custom-annotated egocentric dataset collected via the Furhat robot for HRI tracking scenarios involving nonlinear movements, occlusions, and re-entries. It conducts a systematic evaluation comparing face versus body tracking, assessing the effects of extended temporal memory and appearance ReID, and reports that an optimized pipeline reduces IDSW by 49% relative to a tracking-by-detection baseline.

Significance. If the isolation of detection from tracking components is rigorously demonstrated and the dataset dynamics are representative, the results on opposing ReID effects for face versus body tracking would usefully inform HRI perception design. The dataset could address gaps in standard benchmarks that lack dense close-quarter social interactions.

major comments (2)

[Abstract] Abstract: The manuscript states quantitative results including a 49% IDSW reduction and a 'systematic evaluation isolating detection errors from tracking logic' but supplies no dataset size, annotation protocol, statistical tests, or error analysis. This prevents verification of the central quantitative claim.
[Abstract and evaluation] Abstract and evaluation: The 49% IDSW reduction is attributed to extended memory and ReID, yet face and body tracking employ distinct detectors with differing error profiles. No explicit ablation is described that holds the detector fixed while varying only the tracking logic (or reports separate detection-only metrics), so the improvement cannot be confidently assigned to the pipeline elements rather than detector differences.

minor comments (1)

[Abstract] Abstract: 'reentering' should be hyphenated as 're-entering'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below and outline the revisions we will make.

read point-by-point responses

Referee: [Abstract] Abstract: The manuscript states quantitative results including a 49% IDSW reduction and a 'systematic evaluation isolating detection errors from tracking logic' but supplies no dataset size, annotation protocol, statistical tests, or error analysis. This prevents verification of the central quantitative claim.

Authors: The abstract is a concise summary subject to strict length limits and therefore omits supporting details that appear in the body of the manuscript. Section 3 fully specifies the dataset size, collection protocol, and annotation procedure; Sections 4 and 5 present the statistical tests, error analysis, and detection-only metrics. To improve standalone readability of the abstract, we will add the dataset size and a one-sentence reference to the evaluation protocol in the revised version. revision: yes
Referee: [Abstract and evaluation] Abstract and evaluation: The 49% IDSW reduction is attributed to extended memory and ReID, yet face and body tracking employ distinct detectors with differing error profiles. No explicit ablation is described that holds the detector fixed while varying only the tracking logic (or reports separate detection-only metrics), so the improvement cannot be confidently assigned to the pipeline elements rather than detector differences.

Authors: We acknowledge that an ablation keeping the detector identical while varying only the tracker would make the attribution clearer. The manuscript already reports separate detection metrics (precision/recall) for each detector before applying the tracking components; the 49 % figure is measured on the combined pipeline. Because face and body tracking are compared as they would actually be deployed in HRI, a full cross-detector swap was not performed. We will revise the evaluation section to state this design choice explicitly, add a dedicated paragraph describing how detection errors are isolated from tracking logic, and include a limited additional ablation if space allows. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical dataset and ablation study

full rationale

The paper introduces a new egocentric dataset and reports empirical results from tracking experiments (face vs. body, memory length, ReID). The central 49% IDSW reduction is presented as an observed outcome of the optimized pipeline versus baseline on this dataset. No equations, parameter fits, derivations, or self-citations appear in the provided text that would reduce any claim to its own inputs by construction. The evaluation is self-contained against external benchmarks (standard tracking-by-detection) and does not rely on load-bearing prior results from the same authors. This is the expected finding for an empirical robotics dataset paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that standard MOT metrics (IDSW) meaningfully capture interaction breakdowns and that the collected dataset adequately represents the stated egocentric challenges.

axioms (1)

domain assumption Standard multi-object tracking metrics such as IDSW are appropriate measures for HRI engagement failures
Invoked when claiming the 49% reduction mitigates interaction breakdowns.

pith-pipeline@v0.9.1-grok · 5763 in / 1233 out tokens · 28842 ms · 2026-06-28T09:45:41.867169+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 28 canonical work pages · 2 internal anchors

[1]

From the Definition to the Automatic Assessment of Engagement in Human–Robot Interaction: A Systematic Review,

A. Sorrentino, L. Fiorini, and F. Cavallo, “From the Definition to the Automatic Assessment of Engagement in Human–Robot Interaction: A Systematic Review,”International Journal of Social Robotics, vol. 16, no. 7, pp. 1641–1663, July 2024. [Online]. Available: https://doi.org/10.1007/s12369-024-01146-w

work page doi:10.1007/s12369-024-01146-w 2024
[2]

Are You Still With Me? Continuous Engagement Assessment From a Robot’s Point of View,

F. Del Duchetto, P. Baxter, and M. Hanheide, “Are You Still With Me? Continuous Engagement Assessment From a Robot’s Point of View,”Frontiers in Robotics and AI, vol. 7, Sept. 2020. [Online]. Available: https://www.frontiersin.org/journals/robotics-and-ai/articles /10.3389/frobt.2020.00116/full

work page doi:10.3389/frobt.2020.00116/full 2020
[3]

Footing in human-robot conversations: how robots might shape participant roles using gaze cues,

B. Mutlu, T. Shiwa, T. Kanda, H. Ishiguro, and N. Hagita, “Footing in human-robot conversations: how robots might shape participant roles using gaze cues,” inProceedings of the 4th ACM/IEEE international conference on Human robot interaction, ser. HRI ’09. New York, NY , USA: Association for Computing Machinery, Mar. 2009, pp. 61–68. [Online]. Available: ...

work page doi:10.1145/1514095.1514109 2009
[4]

Online Multi-Object Tracking Based on Record Confidence and Hierarchical Association for Cyber-Physical Social Intelligence,

J. Yang, D. Feng, Y . Gao, and C. Liu, “Online Multi-Object Tracking Based on Record Confidence and Hierarchical Association for Cyber-Physical Social Intelligence,”Big Data Mining and Analytics, vol. 8, no. 4, pp. 851–866, Aug. 2025. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/11002437

work page arXiv 2025
[5]

A Taxonomy of Social Errors in Human-Robot Interaction,

L. Tian and S. Oviatt, “A Taxonomy of Social Errors in Human-Robot Interaction,”J. Hum.-Robot Interact., vol. 10, no. 2, pp. 13:1–13:32, Feb. 2021. [Online]. Available: https://dl.acm.org/doi/10.1145/34397 20

work page doi:10.1145/34397 2021
[6]

REGROUP: A Robot-Centric Group Detection and Tracking System,

A. Taylor and L. D. Riek, “REGROUP: A Robot-Centric Group Detection and Tracking System,” in2022 17th ACM/IEEE International Conference on Human-Robot Interaction (HRI). Sapporo, Japan: IEEE, Mar. 2022, pp. 412–421. [Online]. Available: https://ieeexplore.ieee.org/document/9889634/

work page arXiv 2022
[7]

BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning,

F. Yu, H. Chen, X. Wang, W. Xian, Y . Chen, F. Liu, V . Madhavan, and T. Darrell, “BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning,” in2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, W A, USA: IEEE, June 2020, pp. 2633–2642. [Online]. Available: https://ieeexplore.ieee.org/document/9156329/

work page arXiv 2020
[8]

T2FPV: Dataset and Method for Correcting First-Person View Errors in Pedestrian Trajectory Prediction,

B. Stoler, M. Jana, S. Hwang, and J. Oh, “T2FPV: Dataset and Method for Correcting First-Person View Errors in Pedestrian Trajectory Prediction,” Mar. 2023, arXiv:2209.11294 [cs]. [Online]. Available: http://arxiv.org/abs/2209.11294

work page arXiv 2023
[9]

A real-time and unsupervised face re-identification system for human-robot interaction,

Y . Wang, J. Shen, S. Petridis, and M. Pantic, “A real-time and unsupervised face re-identification system for human-robot interaction,”Pattern Recognition Letters, vol. 128, pp. 559–568, Dec
[10]

Available: https://www.sciencedirect.com/science/arti cle/pii/S0167865518301296

[Online]. Available: https://www.sciencedirect.com/science/arti cle/pii/S0167865518301296
[11]

Face Recognition and Tracking Framework for Human–Robot Interaction,

A. Khalifa, A. A. Abdelrahman, D. Strazdas, J. Hintz, T. Hempel, and A. Al-Hamadi, “Face Recognition and Tracking Framework for Human–Robot Interaction,”Applied Sciences, vol. 12, no. 11, May
[12]

Available: https://www.mdpi.com/2076-3417/12/11/ 5568

[Online]. Available: https://www.mdpi.com/2076-3417/12/11/ 5568

2076
[13]

Face, Body, V oice: Video Person-Clustering With Multiple Modalities,

A. Brown, V . Kalogeiton, and A. Zisserman, “Face, Body, V oice: Video Person-Clustering With Multiple Modalities,” 2021, pp. 3184–

2021
[14]

Available: https://openaccess.thecvf.com/content/IC CV2021W/CVEU/html/Brown Face Body Voice Video Person-Clust ering With Multiple Modalities ICCVW 2021 paper.html

[Online]. Available: https://openaccess.thecvf.com/content/IC CV2021W/CVEU/html/Brown Face Body Voice Video Person-Clust ering With Multiple Modalities ICCVW 2021 paper.html

2021
[15]

BoT-FaceSORT: Bag-of-Tricks for Robust Multi-face Tracking in Unconstrained Videos,

J. Kim, C.-Y . Ju, G.-W. Kim, and D.-H. Lee, “BoT-FaceSORT: Bag-of-Tricks for Robust Multi-face Tracking in Unconstrained Videos,” inComputer Vision – ACCV 2024, M. Cho, I. Laptev, D. Tran, A. Yao, and H. Zha, Eds. Singapore: Springer Nature Singapore, 2025, vol. 15473, pp. 278–294, series Title: Lecture Notes in Computer Science. [Online]. Available: htt...

work page doi:10.1007/978-981-96-0901-7 2024
[16]

Simple online and realtime tracking,

A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, “Simple online and realtime tracking,” in2016 IEEE International Conference on Image Processing (ICIP), Sept. 2016, pp. 3464–3468, iSSN: 2381-8549. [Online]. Available: https://ieeexplore.ieee.org/document /7533003/

2016
[17]

ByteTrack: Multi-object Tracking by Associating Every Detection Box,

Y . Zhang, P. Sun, Y . Jiang, D. Yu, F. Weng, Z. Yuan, P. Luo, W. Liu, and X. Wang, “ByteTrack: Multi-object Tracking by Associating Every Detection Box,” inComputer Vision – ECCV 2022, S. Avidan, G. Brostow, M. Ciss ´e, G. M. Farinella, and T. Hassner, Eds. Cham: Springer Nature Switzerland, 2022, pp. 1–21. [Online]. Available: https://doi.org/10.1007/97...

work page doi:10.1007/978-3-031-20047-2 2022
[18]

Bot-sort: Robust associa- tions multi-pedestrian tracking,

N. Aharon, R. Orfaig, and B.-Z. Bobrovsky, “BoT-SORT: Robust Associations Multi-Pedestrian Tracking,” July 2022, arXiv:2206.14651 [cs]. [Online]. Available: http://arxiv.org/abs/2206.14651

work page arXiv 2022
[19]

RGB-D-based human motion recognition with deep learning: A survey,

P. Wang, W. Li, P. Ogunbona, J. Wan, and S. Escalera, “RGB-D-based human motion recognition with deep learning: A survey,”Computer Vision and Image Understanding, vol. 171, pp. 118–139, June 2018. [Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/S1077 314218300663

2018
[20]

Multiple Human Association and Tracking From Egocentric and Complementary Top Views,

R. Han, W. Feng, Y . Zhang, J. Zhao, and S. Wang, “Multiple Human Association and Tracking From Egocentric and Complementary Top Views,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 9, pp. 5225–5242, Sept. 2022. [Online]. Available: https://ieeexplore.ieee.org/document/9394804/

work page arXiv 2022
[21]

A Joint Tracking System: Robot is Online to Access Surveillance Views,

Z. Lin, S. Ji, W. Wang, M. Qin, R. Yang, M. Wan, J. Gu, T. Li, and C. Zhang, “A Joint Tracking System: Robot is Online to Access Surveillance Views,” in2023 IEEE International Conference on Robotics and Biomimetics (ROBIO), Dec. 2023, pp. 1–6. [Online]. Available: https://ieeexplore.ieee.org/document/10354902/

work page arXiv 2023
[22]

Real-Time Human-Robot Interaction Intent Detection Using RGB-based Pose and Emotion Cues with Cross-Camera Model Generalization,

F. Mohsen and A. Safa, “Real-Time Human-Robot Interaction Intent Detection Using RGB-based Pose and Emotion Cues with Cross-Camera Model Generalization,” Dec. 2025, arXiv:2512.17958 [cs]. [Online]. Available: http://arxiv.org/abs/2512.17958

work page arXiv 2025
[23]

Q-Tracking: A Robust Visual Human Following for Quadruped Robots in Dynamic Environments,

Y . Su, C. Cun, H. Xia, Y . Feng, B. He, Q. Sun, J. Zhong, and Z. Li, “Q-Tracking: A Robust Visual Human Following for Quadruped Robots in Dynamic Environments,” in2025 International Conference on Advanced Robotics and Mechatronics (ICARM), Aug. 2025, pp. 1–6, iSSN: 2993-4990. [Online]. Available: https://ieeexplore.ieee.org/document/11293732/

work page arXiv 2025
[24]

JRDB: A Dataset and Benchmark of Egocentric Robot Visual Perception of Humans in Built Environments,

R. Mart ´ın-Mart´ın, M. Patel, H. Rezatofighi, A. Shenoi, J. Gwak, E. Frankel, A. Sadeghian, and S. Savarese, “JRDB: A Dataset and Benchmark of Egocentric Robot Visual Perception of Humans in Built Environments,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 6, pp. 6748–6765, June 2023. [Online]. Available: https://ieeexplore...

work page arXiv 2023
[25]

Following the Human Thread in Social Navigation,

L. Scofano, A. Sampieri, T. Campari, V . Sacco, I. Spinelli, L. Ballan, and F. Galasso, “Following the Human Thread in Social Navigation,” Feb. 2025, arXiv:2404.11327 [cs]. [Online]. Available: http://arxiv.org/abs/2404.11327

work page arXiv 2025
[26]

MOTChallenge: A Benchmark for Single-Camera Multiple Target Tracking,

P. Dendorfer, A. O ˇsep, A. Milan, K. Schindler, D. Cremers, I. Reid, S. Roth, and L. Leal-Taix ´e, “MOTChallenge: A Benchmark for Single-Camera Multiple Target Tracking,”International Journal of Computer Vision, vol. 129, no. 4, pp. 845–881, Apr. 2021. [Online]. Available: https://doi.org/10.1007/s11263-020-01393-0

work page doi:10.1007/s11263-020-01393-0 2021
[27]

TPT-Bench: A Large-Scale, Long-Term and Robot-Egocentric Dataset for Benchmarking Target Person Tracking,

H. Ye, Y . Zhan, W. Situ, G. Chen, J. Yu, Z. Zhao, K. Cai, A. Ajoudani, and H. Zhang, “TPT-Bench: A Large-Scale, Long-Term and Robot-Egocentric Dataset for Benchmarking Target Person Tracking,” July 2025, arXiv:2505.07446 [cs]. [Online]. Available: http://arxiv.org/abs/2505.07446

work page arXiv 2025
[28]

MOT20: A benchmark for multi object tracking in crowded scenes,

P. Dendorfer, H. Rezatofighi, A. Milan, J. Shi, D. Cremers, I. Reid, S. Roth, K. Schindler, and L. Leal-Taix ´e, “MOT20: A benchmark for multi object tracking in crowded scenes,” Mar. 2020, arXiv:2003.09003 [cs]. [Online]. Available: http: //arxiv.org/abs/2003.09003

work page arXiv 2020
[29]

DanceTrack: Multi-Object Tracking in Uniform Appearance and Diverse Motion,

P. Sun, J. Cao, Y . Jiang, Z. Yuan, S. Bai, K. Kitani, and P. Luo, “DanceTrack: Multi-Object Tracking in Uniform Appearance and Diverse Motion,” in2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, LA, USA: IEEE, June 2022, pp. 20 961–20 970. [Online]. Available: https://ieeexplore.ieee.org/document/9879192/

work page arXiv 2022
[30]

CrowdHuman: A Benchmark for Detecting Human in a Crowd

S. Shao, Z. Zhao, B. Li, T. Xiao, G. Yu, X. Zhang, and J. Sun, “CrowdHuman: A Benchmark for Detecting Human in a Crowd,” Apr. 2018, arXiv:1805.00123 [cs]. [Online]. Available: http://arxiv.org/abs/1805.00123

work page internal anchor Pith review Pith/arXiv arXiv 2018
[31]

Furhat: A Back-Projected Human-Like Robot Head for Multiparty Human- Machine Interaction,

S. Al Moubayed, J. Beskow, G. Skantze, and B. Granstr ¨om, “Furhat: A Back-Projected Human-Like Robot Head for Multiparty Human- Machine Interaction,” inCognitive Behavioural Systems, A. Esposito, A. M. Esposito, A. Vinciarelli, R. Hoffmann, and V . C. M ¨uller, Eds. Berlin, Heidelberg: Springer, 2012, pp. 114–130. [Online]. Available: https://doi.org/10....

work page doi:10.1007/978-3-642-34584-5 2012
[32]

Computer Vision Annotation Tool (CV AT),

CV AT.ai Corporation, “Computer Vision Annotation Tool (CV AT),”
[33]

Available: https://github.com/cvat-ai/cvat

[Online]. Available: https://github.com/cvat-ai/cvat
[34]

YOLOX: Exceeding YOLO Series in 2021

Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, “YOLOX: Exceeding YOLO Series in 2021,” Aug. 2021, arXiv:2107.08430 [cs]. [Online]. Available: http://arxiv.org/abs/2107.08430

work page internal anchor Pith review Pith/arXiv arXiv 2021
[35]

RetinaFace: Single-Shot Multi-Level Face Localisation in the Wild,

J. Deng, J. Guo, E. Ververas, I. Kotsia, and S. Zafeiriou, “RetinaFace: Single-Shot Multi-Level Face Localisation in the Wild,” in2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, W A, USA: IEEE, June 2020, pp. 5202–5211. [Online]. Available: https://ieeexplore.ieee.org/document/9157330/

work page arXiv 2020
[36]

Pedestrian Detection: An Evaluation of the State of the Art,

P. Dollar, C. Wojek, B. Schiele, and P. Perona, “Pedestrian Detection: An Evaluation of the State of the Art,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 4, pp. 743–761, Apr. 2012. [Online]. Available: https: //ieeexplore.ieee.org/document/5975165/

work page arXiv 2012
[37]

WIDER FACE: A Face Detection Benchmark,

S. Yang, P. Luo, C. C. Loy, and X. Tang, “WIDER FACE: A Face Detection Benchmark,” in2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV , USA: IEEE, June 2016, pp. 5525–5533. [Online]. Available: https://ieeexplore.ieee.org/document/7780965/

work page arXiv 2016
[38]

HOTA: A Higher Order Metric for Evaluating Multi-object Tracking,

J. Luiten, A. O ˇsep, P. Dendorfer, P. Torr, A. Geiger, L. Leal-Taix ´e, and B. Leibe, “HOTA: A Higher Order Metric for Evaluating Multi-object Tracking,”International Journal of Computer Vision, vol. 129, no. 2, pp. 548–578, Feb. 2021. [Online]. Available: https://doi.org/10.1007/s11263-020-01375-2

work page doi:10.1007/s11263-020-01375-2 2021

[1] [1]

From the Definition to the Automatic Assessment of Engagement in Human–Robot Interaction: A Systematic Review,

A. Sorrentino, L. Fiorini, and F. Cavallo, “From the Definition to the Automatic Assessment of Engagement in Human–Robot Interaction: A Systematic Review,”International Journal of Social Robotics, vol. 16, no. 7, pp. 1641–1663, July 2024. [Online]. Available: https://doi.org/10.1007/s12369-024-01146-w

work page doi:10.1007/s12369-024-01146-w 2024

[2] [2]

Are You Still With Me? Continuous Engagement Assessment From a Robot’s Point of View,

F. Del Duchetto, P. Baxter, and M. Hanheide, “Are You Still With Me? Continuous Engagement Assessment From a Robot’s Point of View,”Frontiers in Robotics and AI, vol. 7, Sept. 2020. [Online]. Available: https://www.frontiersin.org/journals/robotics-and-ai/articles /10.3389/frobt.2020.00116/full

work page doi:10.3389/frobt.2020.00116/full 2020

[3] [3]

Footing in human-robot conversations: how robots might shape participant roles using gaze cues,

B. Mutlu, T. Shiwa, T. Kanda, H. Ishiguro, and N. Hagita, “Footing in human-robot conversations: how robots might shape participant roles using gaze cues,” inProceedings of the 4th ACM/IEEE international conference on Human robot interaction, ser. HRI ’09. New York, NY , USA: Association for Computing Machinery, Mar. 2009, pp. 61–68. [Online]. Available: ...

work page doi:10.1145/1514095.1514109 2009

[4] [4]

Online Multi-Object Tracking Based on Record Confidence and Hierarchical Association for Cyber-Physical Social Intelligence,

J. Yang, D. Feng, Y . Gao, and C. Liu, “Online Multi-Object Tracking Based on Record Confidence and Hierarchical Association for Cyber-Physical Social Intelligence,”Big Data Mining and Analytics, vol. 8, no. 4, pp. 851–866, Aug. 2025. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/11002437

work page arXiv 2025

[5] [5]

A Taxonomy of Social Errors in Human-Robot Interaction,

L. Tian and S. Oviatt, “A Taxonomy of Social Errors in Human-Robot Interaction,”J. Hum.-Robot Interact., vol. 10, no. 2, pp. 13:1–13:32, Feb. 2021. [Online]. Available: https://dl.acm.org/doi/10.1145/34397 20

work page doi:10.1145/34397 2021

[6] [6]

REGROUP: A Robot-Centric Group Detection and Tracking System,

A. Taylor and L. D. Riek, “REGROUP: A Robot-Centric Group Detection and Tracking System,” in2022 17th ACM/IEEE International Conference on Human-Robot Interaction (HRI). Sapporo, Japan: IEEE, Mar. 2022, pp. 412–421. [Online]. Available: https://ieeexplore.ieee.org/document/9889634/

work page arXiv 2022

[7] [7]

BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning,

F. Yu, H. Chen, X. Wang, W. Xian, Y . Chen, F. Liu, V . Madhavan, and T. Darrell, “BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning,” in2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, W A, USA: IEEE, June 2020, pp. 2633–2642. [Online]. Available: https://ieeexplore.ieee.org/document/9156329/

work page arXiv 2020

[8] [8]

T2FPV: Dataset and Method for Correcting First-Person View Errors in Pedestrian Trajectory Prediction,

B. Stoler, M. Jana, S. Hwang, and J. Oh, “T2FPV: Dataset and Method for Correcting First-Person View Errors in Pedestrian Trajectory Prediction,” Mar. 2023, arXiv:2209.11294 [cs]. [Online]. Available: http://arxiv.org/abs/2209.11294

work page arXiv 2023

[9] [9]

A real-time and unsupervised face re-identification system for human-robot interaction,

Y . Wang, J. Shen, S. Petridis, and M. Pantic, “A real-time and unsupervised face re-identification system for human-robot interaction,”Pattern Recognition Letters, vol. 128, pp. 559–568, Dec

[10] [10]

Available: https://www.sciencedirect.com/science/arti cle/pii/S0167865518301296

[Online]. Available: https://www.sciencedirect.com/science/arti cle/pii/S0167865518301296

[11] [11]

Face Recognition and Tracking Framework for Human–Robot Interaction,

A. Khalifa, A. A. Abdelrahman, D. Strazdas, J. Hintz, T. Hempel, and A. Al-Hamadi, “Face Recognition and Tracking Framework for Human–Robot Interaction,”Applied Sciences, vol. 12, no. 11, May

[12] [12]

Available: https://www.mdpi.com/2076-3417/12/11/ 5568

[Online]. Available: https://www.mdpi.com/2076-3417/12/11/ 5568

2076

[13] [13]

Face, Body, V oice: Video Person-Clustering With Multiple Modalities,

A. Brown, V . Kalogeiton, and A. Zisserman, “Face, Body, V oice: Video Person-Clustering With Multiple Modalities,” 2021, pp. 3184–

2021

[14] [14]

Available: https://openaccess.thecvf.com/content/IC CV2021W/CVEU/html/Brown Face Body Voice Video Person-Clust ering With Multiple Modalities ICCVW 2021 paper.html

[Online]. Available: https://openaccess.thecvf.com/content/IC CV2021W/CVEU/html/Brown Face Body Voice Video Person-Clust ering With Multiple Modalities ICCVW 2021 paper.html

2021

[15] [15]

BoT-FaceSORT: Bag-of-Tricks for Robust Multi-face Tracking in Unconstrained Videos,

J. Kim, C.-Y . Ju, G.-W. Kim, and D.-H. Lee, “BoT-FaceSORT: Bag-of-Tricks for Robust Multi-face Tracking in Unconstrained Videos,” inComputer Vision – ACCV 2024, M. Cho, I. Laptev, D. Tran, A. Yao, and H. Zha, Eds. Singapore: Springer Nature Singapore, 2025, vol. 15473, pp. 278–294, series Title: Lecture Notes in Computer Science. [Online]. Available: htt...

work page doi:10.1007/978-981-96-0901-7 2024

[16] [16]

Simple online and realtime tracking,

A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, “Simple online and realtime tracking,” in2016 IEEE International Conference on Image Processing (ICIP), Sept. 2016, pp. 3464–3468, iSSN: 2381-8549. [Online]. Available: https://ieeexplore.ieee.org/document /7533003/

2016

[17] [17]

ByteTrack: Multi-object Tracking by Associating Every Detection Box,

Y . Zhang, P. Sun, Y . Jiang, D. Yu, F. Weng, Z. Yuan, P. Luo, W. Liu, and X. Wang, “ByteTrack: Multi-object Tracking by Associating Every Detection Box,” inComputer Vision – ECCV 2022, S. Avidan, G. Brostow, M. Ciss ´e, G. M. Farinella, and T. Hassner, Eds. Cham: Springer Nature Switzerland, 2022, pp. 1–21. [Online]. Available: https://doi.org/10.1007/97...

work page doi:10.1007/978-3-031-20047-2 2022

[18] [18]

Bot-sort: Robust associa- tions multi-pedestrian tracking,

N. Aharon, R. Orfaig, and B.-Z. Bobrovsky, “BoT-SORT: Robust Associations Multi-Pedestrian Tracking,” July 2022, arXiv:2206.14651 [cs]. [Online]. Available: http://arxiv.org/abs/2206.14651

work page arXiv 2022

[19] [19]

RGB-D-based human motion recognition with deep learning: A survey,

P. Wang, W. Li, P. Ogunbona, J. Wan, and S. Escalera, “RGB-D-based human motion recognition with deep learning: A survey,”Computer Vision and Image Understanding, vol. 171, pp. 118–139, June 2018. [Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/S1077 314218300663

2018

[20] [20]

Multiple Human Association and Tracking From Egocentric and Complementary Top Views,

R. Han, W. Feng, Y . Zhang, J. Zhao, and S. Wang, “Multiple Human Association and Tracking From Egocentric and Complementary Top Views,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 9, pp. 5225–5242, Sept. 2022. [Online]. Available: https://ieeexplore.ieee.org/document/9394804/

work page arXiv 2022

[21] [21]

A Joint Tracking System: Robot is Online to Access Surveillance Views,

Z. Lin, S. Ji, W. Wang, M. Qin, R. Yang, M. Wan, J. Gu, T. Li, and C. Zhang, “A Joint Tracking System: Robot is Online to Access Surveillance Views,” in2023 IEEE International Conference on Robotics and Biomimetics (ROBIO), Dec. 2023, pp. 1–6. [Online]. Available: https://ieeexplore.ieee.org/document/10354902/

work page arXiv 2023

[22] [22]

Real-Time Human-Robot Interaction Intent Detection Using RGB-based Pose and Emotion Cues with Cross-Camera Model Generalization,

F. Mohsen and A. Safa, “Real-Time Human-Robot Interaction Intent Detection Using RGB-based Pose and Emotion Cues with Cross-Camera Model Generalization,” Dec. 2025, arXiv:2512.17958 [cs]. [Online]. Available: http://arxiv.org/abs/2512.17958

work page arXiv 2025

[23] [23]

Q-Tracking: A Robust Visual Human Following for Quadruped Robots in Dynamic Environments,

Y . Su, C. Cun, H. Xia, Y . Feng, B. He, Q. Sun, J. Zhong, and Z. Li, “Q-Tracking: A Robust Visual Human Following for Quadruped Robots in Dynamic Environments,” in2025 International Conference on Advanced Robotics and Mechatronics (ICARM), Aug. 2025, pp. 1–6, iSSN: 2993-4990. [Online]. Available: https://ieeexplore.ieee.org/document/11293732/

work page arXiv 2025

[24] [24]

JRDB: A Dataset and Benchmark of Egocentric Robot Visual Perception of Humans in Built Environments,

R. Mart ´ın-Mart´ın, M. Patel, H. Rezatofighi, A. Shenoi, J. Gwak, E. Frankel, A. Sadeghian, and S. Savarese, “JRDB: A Dataset and Benchmark of Egocentric Robot Visual Perception of Humans in Built Environments,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 6, pp. 6748–6765, June 2023. [Online]. Available: https://ieeexplore...

work page arXiv 2023

[25] [25]

Following the Human Thread in Social Navigation,

L. Scofano, A. Sampieri, T. Campari, V . Sacco, I. Spinelli, L. Ballan, and F. Galasso, “Following the Human Thread in Social Navigation,” Feb. 2025, arXiv:2404.11327 [cs]. [Online]. Available: http://arxiv.org/abs/2404.11327

work page arXiv 2025

[26] [26]

MOTChallenge: A Benchmark for Single-Camera Multiple Target Tracking,

P. Dendorfer, A. O ˇsep, A. Milan, K. Schindler, D. Cremers, I. Reid, S. Roth, and L. Leal-Taix ´e, “MOTChallenge: A Benchmark for Single-Camera Multiple Target Tracking,”International Journal of Computer Vision, vol. 129, no. 4, pp. 845–881, Apr. 2021. [Online]. Available: https://doi.org/10.1007/s11263-020-01393-0

work page doi:10.1007/s11263-020-01393-0 2021

[27] [27]

TPT-Bench: A Large-Scale, Long-Term and Robot-Egocentric Dataset for Benchmarking Target Person Tracking,

H. Ye, Y . Zhan, W. Situ, G. Chen, J. Yu, Z. Zhao, K. Cai, A. Ajoudani, and H. Zhang, “TPT-Bench: A Large-Scale, Long-Term and Robot-Egocentric Dataset for Benchmarking Target Person Tracking,” July 2025, arXiv:2505.07446 [cs]. [Online]. Available: http://arxiv.org/abs/2505.07446

work page arXiv 2025

[28] [28]

MOT20: A benchmark for multi object tracking in crowded scenes,

P. Dendorfer, H. Rezatofighi, A. Milan, J. Shi, D. Cremers, I. Reid, S. Roth, K. Schindler, and L. Leal-Taix ´e, “MOT20: A benchmark for multi object tracking in crowded scenes,” Mar. 2020, arXiv:2003.09003 [cs]. [Online]. Available: http: //arxiv.org/abs/2003.09003

work page arXiv 2020

[29] [29]

DanceTrack: Multi-Object Tracking in Uniform Appearance and Diverse Motion,

P. Sun, J. Cao, Y . Jiang, Z. Yuan, S. Bai, K. Kitani, and P. Luo, “DanceTrack: Multi-Object Tracking in Uniform Appearance and Diverse Motion,” in2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, LA, USA: IEEE, June 2022, pp. 20 961–20 970. [Online]. Available: https://ieeexplore.ieee.org/document/9879192/

work page arXiv 2022

[30] [30]

CrowdHuman: A Benchmark for Detecting Human in a Crowd

S. Shao, Z. Zhao, B. Li, T. Xiao, G. Yu, X. Zhang, and J. Sun, “CrowdHuman: A Benchmark for Detecting Human in a Crowd,” Apr. 2018, arXiv:1805.00123 [cs]. [Online]. Available: http://arxiv.org/abs/1805.00123

work page internal anchor Pith review Pith/arXiv arXiv 2018

[31] [31]

Furhat: A Back-Projected Human-Like Robot Head for Multiparty Human- Machine Interaction,

S. Al Moubayed, J. Beskow, G. Skantze, and B. Granstr ¨om, “Furhat: A Back-Projected Human-Like Robot Head for Multiparty Human- Machine Interaction,” inCognitive Behavioural Systems, A. Esposito, A. M. Esposito, A. Vinciarelli, R. Hoffmann, and V . C. M ¨uller, Eds. Berlin, Heidelberg: Springer, 2012, pp. 114–130. [Online]. Available: https://doi.org/10....

work page doi:10.1007/978-3-642-34584-5 2012

[32] [32]

Computer Vision Annotation Tool (CV AT),

CV AT.ai Corporation, “Computer Vision Annotation Tool (CV AT),”

[33] [33]

Available: https://github.com/cvat-ai/cvat

[Online]. Available: https://github.com/cvat-ai/cvat

[34] [34]

YOLOX: Exceeding YOLO Series in 2021

Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, “YOLOX: Exceeding YOLO Series in 2021,” Aug. 2021, arXiv:2107.08430 [cs]. [Online]. Available: http://arxiv.org/abs/2107.08430

work page internal anchor Pith review Pith/arXiv arXiv 2021

[35] [35]

RetinaFace: Single-Shot Multi-Level Face Localisation in the Wild,

J. Deng, J. Guo, E. Ververas, I. Kotsia, and S. Zafeiriou, “RetinaFace: Single-Shot Multi-Level Face Localisation in the Wild,” in2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, W A, USA: IEEE, June 2020, pp. 5202–5211. [Online]. Available: https://ieeexplore.ieee.org/document/9157330/

work page arXiv 2020

[36] [36]

Pedestrian Detection: An Evaluation of the State of the Art,

P. Dollar, C. Wojek, B. Schiele, and P. Perona, “Pedestrian Detection: An Evaluation of the State of the Art,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 4, pp. 743–761, Apr. 2012. [Online]. Available: https: //ieeexplore.ieee.org/document/5975165/

work page arXiv 2012

[37] [37]

WIDER FACE: A Face Detection Benchmark,

S. Yang, P. Luo, C. C. Loy, and X. Tang, “WIDER FACE: A Face Detection Benchmark,” in2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV , USA: IEEE, June 2016, pp. 5525–5533. [Online]. Available: https://ieeexplore.ieee.org/document/7780965/

work page arXiv 2016

[38] [38]

HOTA: A Higher Order Metric for Evaluating Multi-object Tracking,

J. Luiten, A. O ˇsep, P. Dendorfer, P. Torr, A. Geiger, L. Leal-Taix ´e, and B. Leibe, “HOTA: A Higher Order Metric for Evaluating Multi-object Tracking,”International Journal of Computer Vision, vol. 129, no. 2, pp. 548–578, Feb. 2021. [Online]. Available: https://doi.org/10.1007/s11263-020-01375-2

work page doi:10.1007/s11263-020-01375-2 2021