pith. sign in

arxiv: 2604.09658 · v1 · submitted 2026-03-30 · 💻 cs.HC · cs.CV

TinyGaze: Lightweight Gaze-Gesture Recognition on Commodity Mobile Devices

Pith reviewed 2026-05-14 22:01 UTC · model grok-4.3

classification 💻 cs.HC cs.CV
keywords gaze gesture recognitionmobile deviceslightweight modelshead pose trackingon-device inferencehuman-computer interactionARKit
0
0 comments X

The pith

A compact 46k-parameter model recognizes gaze gestures on mobile devices at 96 percent Macro F1 using ARKit head and eye data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates an end-to-end system for gaze gestures as hands-free input on ordinary phones. It pairs standard ARKit head and eye tracking with a scaffolded training protocol drawn from learning theory to help users acquire and recall five specific gestures. In a controlled pilot with four participants, the TinyHAR model reaches 0.960 Macro F1 on gesture recognition and 0.997 on user identification while using far fewer parameters than deeper alternatives. This result matters because it shows that head pose movements carry most of the distinguishing information, allowing efficient on-device processing. The work therefore points toward practical mobile interfaces that avoid heavy computation or external servers.

Core claim

The authors report that their compact time-series model TinyHAR, with only 46k parameters, attains Macro F1 scores of 0.960 for 5-way gaze gesture recognition and 0.997 for 4-way user identification when trained on ARKit head and eye transforms under a scaffolded guidance-to-recall protocol, matching or exceeding deeper baselines while depending primarily on head pose dynamics.

What carries the argument

TinyHAR, a compact time-series model that processes sequences of head and eye transforms from ARKit to classify gaze gestures and identify users.

Load-bearing premise

The accuracy measured in one controlled lab session with four participants will hold for diverse users and everyday mobile environments.

What would settle it

A follow-up experiment with twenty or more participants across multiple sessions in uncontrolled settings such as walking or varying lighting, measuring whether Macro F1 falls below 0.85 on the same gestures.

Figures

Figures reproduced from arXiv: 2604.09658 by Fergus Buchanan, Hyochan Cho, Juan Ye, Shijing He, Xinya Gong, Yaxiong Lei, Yuheng Wang.

Figure 1
Figure 1. Figure 1: TinyGaze overview. TinyGaze logs ARKit head and eye pose time series and represents each gaze gesture as a spatio [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Gesture set. Five gaze-gesture patterns with increas [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
read the original abstract

Gaze gestures can provide hands free input on mobile devices, but practical use requires (i) gestures users can learn and recall and (ii) recognition models that are efficient enough for on-device deployment. We present an end-to-end pipeline using commodity ARKit head/eye transforms and a scaffolded guidance-to-recall protocol grounded in learning theory. In a pilot feasibility study (N=4 participants; 240 trials; controlled single-session setting), we benchmark a compact time-series model (TinyHAR) against deeper baselines (DeepConvLSTM, SA-HAR) on 5-way gesture recognition and 4-way user identification. TinyHAR achieves strong performance in this pilot benchmark (Macro F1 = 0.960 for gesture recognition; Macro F1 = 0.997 for user identification) while using only 46k parameters. A modality analysis further indicates that head pose dynamics are highly informative for mobile gaze gestures, highlighting embodied head--eye coordination as a key design consideration. Although the small sample size and controlled setting limit generalizability, these results indicate a potential direction for further investigation into on-device gaze gesture recognition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents TinyGaze, an end-to-end pipeline for gaze-gesture recognition on commodity mobile devices that combines ARKit head/eye transforms with a scaffolded guidance-to-recall protocol. In a pilot feasibility study (N=4 participants, 240 trials, single controlled session), a compact time-series model (TinyHAR, 46k parameters) is benchmarked against DeepConvLSTM and SA-HAR on 5-way gesture recognition and 4-way user identification, reporting Macro F1 scores of 0.960 and 0.997 respectively. A modality analysis highlights the informativeness of head-pose dynamics for these gestures.

Significance. If the pilot performance generalizes, the work would demonstrate that very small models can support accurate on-device gaze gestures, lowering barriers to hands-free mobile input and emphasizing embodied head-eye coordination as a design factor. The current evidence, however, is confined to a narrow controlled setting, so the significance remains prospective pending larger-scale validation.

major comments (2)
  1. [Pilot feasibility study] Pilot study description: the Macro F1 scores of 0.960 (gesture) and 0.997 (user ID) are obtained from ~60 trials per participant in a single session with no reported cross-subject validation, error bars, or statistical significance tests; this leaves open the possibility that the model fits participant-specific idiosyncrasies rather than transferable signals, directly affecting the claim that the results indicate a viable direction for commodity deployment.
  2. [Modality analysis] Modality analysis: the statement that head-pose dynamics are 'highly informative' is presented without accompanying ablation results, feature-importance metrics, or quantitative comparison of head-pose-only versus eye-only inputs, making it impossible to assess how much of the reported performance depends on this modality.
minor comments (2)
  1. [Abstract] Clarify the exact relationship between the system name TinyGaze and the model name TinyHAR; the abstract uses both without explicit mapping.
  2. [Methods] Provide the precise list of the five gestures and the details of the scaffolded guidance-to-recall protocol so that the study can be replicated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our pilot feasibility study. We address each major comment below, clarifying the scope of our claims and outlining revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Pilot feasibility study] Pilot study description: the Macro F1 scores of 0.960 (gesture) and 0.997 (user ID) are obtained from ~60 trials per participant in a single session with no reported cross-subject validation, error bars, or statistical significance tests; this leaves open the possibility that the model fits participant-specific idiosyncrasies rather than transferable signals, directly affecting the claim that the results indicate a viable direction for commodity deployment.

    Authors: We agree that the current presentation lacks cross-subject validation, error bars, and statistical tests, which is a valid concern for assessing transferability in a small-N pilot. The manuscript already frames the work as a controlled feasibility study with explicit limitations on generalizability. In the revision we will add leave-one-participant-out cross-validation, report standard deviations across folds, and include statistical comparisons (e.g., McNemar tests or paired Wilcoxon tests) between TinyHAR and the baselines to better demonstrate whether performance reflects transferable signals rather than idiosyncrasies. revision: yes

  2. Referee: [Modality analysis] Modality analysis: the statement that head-pose dynamics are 'highly informative' is presented without accompanying ablation results, feature-importance metrics, or quantitative comparison of head-pose-only versus eye-only inputs, making it impossible to assess how much of the reported performance depends on this modality.

    Authors: We acknowledge that the modality analysis section currently lacks explicit ablation studies or quantitative comparisons. In the revised manuscript we will expand this section to report Macro F1 scores for head-pose-only, eye-only, and combined inputs, along with a simple feature-importance ranking derived from the time-series model, to provide direct quantitative support for the informativeness of head-pose dynamics. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pilot benchmark with direct measurements

full rationale

The paper reports results from a controlled pilot study (N=4, 240 trials) using a compact time-series model (TinyHAR) on ARKit head/eye data. No mathematical derivation chain, equations, or first-principles predictions exist. Reported Macro F1 scores (0.960 gesture, 0.997 user ID) are direct empirical measurements on the collected data, not quantities fitted to a subset and then renamed as predictions. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked to justify core claims. The work is self-contained as an empirical feasibility benchmark.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Performance claims depend on the trained weights of TinyHAR and the assumption that ARKit transforms capture sufficient signal for the gestures; no new physical entities or unstated mathematical axioms are introduced.

free parameters (1)
  • TinyHAR model weights
    46k parameters are learned from the 240-trial pilot dataset.
axioms (1)
  • domain assumption ARKit head and eye transforms provide reliable input for gaze gesture recognition
    The entire pipeline is built on commodity ARKit data without additional validation of sensor accuracy for this task.

pith-pipeline@v0.9.0 · 5516 in / 1343 out tokens · 47562 ms · 2026-05-14T22:01:09.445457+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 1 internal anchor

  1. [1]

    Hirotaka Aoki, John Paulin Hansen, and Kenji Itoh. 2008. Learning to interact with a computer by gaze.Behaviour & Information Technology27, 4 (2008), 339–344

  2. [2]

    Heiko Drewes and Albrecht Schmidt. 2007. Interacting with the computer using gaze gestures. InIfip conference on human-computer interaction. Springer, 475– 488

  3. [3]

    Carlos Elmadjian and Carlos H Morimoto. 2021. Gazebar: Exploiting the midas touch in gaze interaction. InExtended abstracts of the 2021 CHI conference on human factors in computing systems. 1–7

  4. [4]

    Kenko Fujii, Gauthier Gras, Antonino Salerno, and Guang-Zhong Yang. 2018. Gaze gesture based human robot interaction for laparoscopic surgery.Medical image analysis44 (2018), 196–214

  5. [5]

    Shijing He, Yaxiong Lei, Zihan Zhang, Yuzhou Sun, Shujun Li, Chi Zhang, and Juan Ye. 2025. Identity Deepfake Threats to Biometric Authentication Systems: Public and Expert Perspectives.arXiv preprint arXiv:2506.06825(2025)

  6. [6]

    Zhiming Hu, Daniel Haeufle, Syn Schmitt, and Andreas Bulling. 2025. Hoigaze: Gaze estimation during hand-object interactions in extended reality exploiting eye-hand-head coordination. InProceedings of the Special Interest Group on Com- puter Graphics and Interactive Techniques Conference Conference Papers. 1–10

  7. [7]

    Robert JK Jacob. 1991. The use of eye movements in human-computer interaction techniques: what you look at is what you get.ACM Transactions on Information Systems (TOIS)9, 2 (1991), 152–169

  8. [8]

    Christina Katsini, Yasmeen Abdrabou, George E Raptis, Mohamed Khamis, and Florian Alt. 2020. The role of eye gaze in security and privacy applications: Survey and future HCI research directions. InProceedings of the 2020 CHI conference on human factors in computing systems. 1–21

  9. [9]

    Mohamed Khamis, Florian Alt, and Andreas Bulling. 2018. The past, present, and future of gaze-enabled handheld mobile devices: Survey and lessons learned. In Proceedings of the 20th International Conference on Human-Computer Interaction with Mobile Devices and Services. 1–17

  10. [10]

    Andy Kong, Karan Ahuja, Mayank Goel, and Chris Harrison. 2021. Eyemu interactions: Gaze+ imu gestures on mobile devices. InProceedings of the 2021 International Conference on Multimodal Interaction. 577–585

  11. [11]

    Yaxiong Lei, Xinya Gong, Shijing He, Yafei Wang, Mohamed Khamis, and Juan Ye. 2026. The People’s Gaze: Co-Designing and Refining Gaze Gestures with Users and Experts. InProceedings of the 2026 CHI conference on human factors in computing systems

  12. [12]

    Yaxiong Lei, Shijing He, Huining Feng, Kaixing Zhao, Mohamed Khamis, and Juan Ye. 2023. Protecting Privacy in an Era of Pervasive Camera-Based De- vices: Challenges and Potential Directions. InProceedings of the Fifth UK Mobile, Wearable and Ubiquitous Systems Research Symposium

  13. [13]

    Yaxiong Lei, Shijing He, Mohamed Khamis, and Juan Ye. 2023. An end-to-end review of gaze estimation and its interactive applications on handheld mobile devices.Comput. Surveys56, 2 (2023), 1–38

  14. [14]

    Yaxiong Lei, Yuheng Wang, Fergus Buchanan, Mingyue Zhao, Yusuke Sugano, Shijing He, Mohamed Khamis, and Juan Ye. 2025. Quantifying the impact of motion on 2d gaze estimation in real-world mobile interactions.arXiv preprint arXiv:2502.10570(2025)

  15. [15]

    Yaxiong Lei, Yuheng Wang, Tyler Caslin, Alexander Wisowaty, Xu Zhu, Mohamed Khamis, and Juan Ye. 2023. DynamicRead: Exploring robust gaze interaction methods for reading on handheld mobile devices under dynamic conditions. Proceedings of the ACM on Human-Computer Interaction7, ETRA (2023), 1–17

  16. [16]

    Yaxiong Lei, Mingyue Zhao, Yuheng Wang, Shijing He, Yusuke Sugano, Yafei Wang, Kaixing Zhao, Mohamed Khamis, and Juan Ye. 2025. MAC-Gaze: Motion-Aware Continual Calibration for Mobile Gaze Tracking.arXiv preprint arXiv:2505.22769(2025)

  17. [17]

    Saif Mahmud, M Tanjid Hasan Tonmoy, Kishor Kumar Bhaumik, AKM Mah- bubur Rahman, M Ashraful Amin, Mohammad Shoyaib, Muhammad Asif Hos- sain Khan, and Amin Ahsan Ali. 2020. Human activity recognition from wearable sensor data using self-attention. InECAI 2020. IOS Press, 1332–1339

  18. [18]

    Pallavi Mohan, Wooi Boon Goh, Chi-Wing Fu, and Sai-Kit Yeung. 2018. DualGaze: Addressing the midas touch problem in gaze mediated VR interaction. In2018 IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR- Adjunct). IEEE, 79–84

  19. [19]

    Cristina Palmero, Javier Selva, Mohammad Ali Bagheri, and Sergio Escalera. 2018. Recurrent cnn for 3d gaze estimation using appearance and shape cues.arXiv preprint arXiv:1805.03064(2018)

  20. [20]

    T Maxwell Parker, Shervin Badihian, Ahmed Hassoon, Ali S Saber Tehrani, Nathan Farrell, David E Newman-Toker, and Jorge Otero-Millan. 2022. Eye and head movement recordings using smartphones for telemedicine applications: measurements of accuracy and precision.Frontiers in Neurology13 (2022), 789581

  21. [21]

    Henry L Roediger and Andrew C Butler. 2011. The critical role of retrieval practice in long-term retention.Trends in cognitive sciences15, 1 (2011), 20–27

  22. [22]

    Lei Shi, Cosmin Copot, and Steve Vanlanduit. 2021. Gaze gesture recognition by graph convolutional networks.Frontiers in Robotics and AI8 (2021)

  23. [23]

    Nachiappan Valliappan, Na Dai, Ethan Steinberg, Junfeng He, Kantwon Rogers, Venky Ramachandran, Pingmei Xu, Mina Shojaeizadeh, Li Guo, Kai Kohlhoff, et al. 2020. Accelerating eye movement research via accurate and affordable smartphone eye tracking.Nature communications11, 1 (2020)

  24. [24]

    Janneke Van de Pol, Monique Volman, and Jos Beishuizen. 2010. Scaffolding in teacher–student interaction: A decade of research.Educational psychology review 22, 3 (2010), 271–296

  25. [25]

    Renzhuo Wan, Shuping Mei, Jun Wang, Min Liu, and Fan Yang. 2019. Multi- variate temporal convolutional network: A deep neural networks approach for multivariate time series forecasting.Electronics8, 8 (2019), 876

  26. [26]

    Jacob O Wobbrock, Andrew D Wilson, and Yang Li. 2007. Gestures without libraries, toolkits or training: a 1 recognizer for user interface prototypes. In Proceedings of the 20th annual ACM symposium on User interface software and technology. 159–168. CHI EA ’26, April 13–17, 2026, Barcelona, Spain Yaxiong Lei et al

  27. [27]

    Shumin Zhai. 2003. What’s in the eyes for attentive input.Commun. ACM46, 3 (2003), 34–39

  28. [28]

    Wenhao Zhang, Melvyn L Smith, Lyndon N Smith, and Abdul Farooq. 2016. Gender and gaze gesture recognition for human-computer interaction.Computer Vision and Image Understanding149 (2016), 32–50

  29. [29]

    Yexu Zhou, Haibin Zhao, Yiran Huang, Till Riedel, Michael Hefenbrock, and Michael Beigl. 2022. Tinyhar: A lightweight deep learning model designed for human activity recognition. InProceedings of the 2022 ACM International Symposium on Wearable Computers. 89–93