pith. machine review for the scientific record. sign in

arxiv: 2604.02509 · v1 · submitted 2026-04-02 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Rapidly deploying on-device eye tracking by distilling visual foundation models

Cheng Jiang , Jogendra Kundu , David Colmenares , Fengting Yang , Joseph Robinson , Yatong An , Ali Behrooz

Authors on Pith no claims yet

Pith reviewed 2026-05-13 21:35 UTC · model grok-4.3

classification 💻 cs.CV
keywords eye trackinggaze estimationvisual foundation modelsknowledge distillationsynthetic datadomain adaptationon-device deploymentAR/VR
0
0 comments X

The pith

DistillGaze distills visual foundation models with synthetic labels and unlabeled real images to produce accurate 256K-parameter eye trackers deployable on new hardware.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to rapidly adapt visual foundation models for on-device eye tracking when camera placements and lighting change across device generations. It uses a two-stage process: first turning a foundation model into a specialized teacher through self-supervised learning on labeled synthetic data plus unlabeled real data, then training a compact student model under the teacher's guidance plus self-training. This closes the synthetic-to-real gap enough to cut median gaze error by 58.62 percent relative to synthetic-only baselines. A reader would care because the result is a lightweight model that runs in real time and adapts without large new labeled datasets for each hardware variant.

Core claim

DistillGaze proceeds in two stages. First, a visual foundation model is adapted into a domain-specialized teacher using self-supervised learning on labeled synthetic images and unlabeled real images, where synthetic data supplies gaze supervision and real data bridges the domain gap. Second, a lightweight student model is trained using both teacher guidance and self-training. On a large-scale crowd-sourced dataset with over 2,000 participants, the resulting 256K-parameter model reduces median gaze error by 58.62 percent compared with synthetic-only baselines while remaining suitable for real-time on-device deployment across varying hardware configurations.

What carries the argument

DistillGaze, a two-stage distillation framework in which a visual foundation model is first adapted via self-supervised learning on mixed synthetic and real data to create a teacher, then used to supervise a compact student model for gaze regression.

If this is right

  • Eye tracking models can be trained and deployed for successive AR/VR device generations without collecting large new labeled real datasets each time.
  • A 256K-parameter model supports real-time on-device inference while achieving substantially lower error than larger synthetic-only alternatives.
  • The same two-stage process handles changes in camera pose, placement, and illumination without retraining from scratch.
  • Synthetic data supplies scalable supervision while unlabeled real data supplies the domain adaptation signal needed for regression tasks.
  • Foundation models originally trained on natural images can be specialized for near-eye infrared imagery through this distillation recipe.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could transfer to other on-device regression problems such as hand tracking or facial landmark detection that also face synthetic-to-real gaps.
  • Further gains might come from testing whether different foundation model backbones yield better teachers for the same student size.
  • The method implies that improvements in the quality or diversity of synthetic eye images would directly raise the final accuracy ceiling.
  • Deployment on additional hardware variants with measured error reduction would confirm the claimed adaptability across device families.

Load-bearing premise

That self-supervised adaptation of a visual foundation model on labeled synthetic data plus unlabeled real data will close the synthetic-to-real domain gap enough for high-accuracy gaze estimation across new hardware configurations.

What would settle it

A new hardware setup with different camera placement or illumination where the distilled model shows no substantial drop in median gaze error relative to a synthetic-only baseline.

read the original abstract

Eye tracking (ET) plays a critical role in augmented and virtual reality applications. However, rapidly deploying high-accuracy, on-device gaze estimation for new products remains challenging because hardware configurations (e.g., camera placement, camera pose, and illumination) often change across device generations. Visual foundation models (VFMs) are a promising direction for rapid training and deployment, and they excel on natural-image benchmarks; yet we find that off-the-shelf VFMs still struggle to achieve high accuracy on specialized near-eye infrared imagery. To address this gap, we introduce DistillGaze, a framework that distills a foundation model by leveraging labeled synthetic data and unlabeled real data for rapid and high-performance on-device gaze estimation. DistillGaze proceeds in two stages. First, we adapt a VFM into a domain-specialized teacher using self-supervised learning on labeled synthetic and unlabeled real images. Synthetic data provides scalable, high-quality gaze supervision, while unlabeled real data helps bridge the synthetic-to-real domain gap. Second, we train an on-device student using both teacher guidance and self-training. Evaluated on a large-scale, crowd-sourced dataset spanning over 2,000 participants, DistillGaze reduces median gaze error by 58.62% relative to synthetic-only baselines while maintaining a lightweight 256K-parameter model suitable for real-time on-device deployment. Overall, DistillGaze provides an efficient pathway for training and deploying ET models that adapt to hardware changes, and offers a recipe for combining synthetic supervision with unlabeled real data in on-device regression tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces DistillGaze, a two-stage distillation framework that first adapts a visual foundation model into a domain-specialized teacher via self-supervised learning on labeled synthetic and unlabeled real near-eye infrared images, then trains a lightweight student model using teacher guidance and self-training. The central empirical claim is that this yields a 58.62% reduction in median gaze error relative to synthetic-only baselines on a crowd-sourced dataset spanning over 2,000 participants, while producing a 256K-parameter model suitable for real-time on-device deployment and adaptation to hardware changes.

Significance. If the performance gains and cross-hardware generalization hold under more rigorous validation, the work would provide a practical recipe for rapid on-device eye-tracking deployment in AR/VR by efficiently bridging synthetic-to-real gaps without large-scale labeled real data. The lightweight model size and emphasis on unlabeled real data for adaptation are clear strengths for on-device regression tasks.

major comments (2)
  1. [Evaluation] The experimental evaluation reports a 58.62% median error reduction but provides no error bars, ablation details on the contribution of each stage, or statistical tests, leaving the robustness of the central performance claim unassessable from the given results.
  2. [Evaluation] No explicit held-out hardware split, device-type ablation, or cross-device generalization test is described; the single crowd-sourced dataset evaluation therefore does not isolate whether the domain-gap closure works for truly novel camera geometries, poses, or illumination across device generations, which is load-bearing for the adaptation claim.
minor comments (1)
  1. The abstract and method description would benefit from explicitly naming the synthetic-only baselines and the precise self-supervised objectives used in each stage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important aspects of our evaluation that can be strengthened. We will revise the manuscript to address the concerns regarding robustness and generalization while maintaining the core contributions of DistillGaze.

read point-by-point responses
  1. Referee: The experimental evaluation reports a 58.62% median error reduction but provides no error bars, ablation details on the contribution of each stage, or statistical tests, leaving the robustness of the central performance claim unassessable from the given results.

    Authors: We agree that these elements are necessary for assessing robustness. In the revised manuscript, we will add error bars (computed as standard deviation over multiple training runs with different random seeds), detailed ablations breaking down the contribution of the self-supervised teacher adaptation stage versus the student self-training stage, and statistical significance tests (e.g., Wilcoxon signed-rank test) comparing DistillGaze against the synthetic-only baseline. These additions will directly support the reported 58.62% median error reduction. revision: yes

  2. Referee: No explicit held-out hardware split, device-type ablation, or cross-device generalization test is described; the single crowd-sourced dataset evaluation therefore does not isolate whether the domain-gap closure works for truly novel camera geometries, poses, or illumination across device generations, which is load-bearing for the adaptation claim.

    Authors: We acknowledge that an explicit held-out hardware split would more rigorously isolate cross-device generalization. Our crowd-sourced dataset inherently includes variations in camera geometries, poses, and illumination across 2,000+ participants, but we did not perform device-type partitioning. In revision, we will add a device-type ablation by grouping samples based on available participant metadata (e.g., inferred device characteristics) and report performance on held-out subsets. We will also expand the discussion to clarify how the two-stage framework enables adaptation to new hardware via unlabeled real data, while noting any limitations due to metadata availability as a direction for future datasets. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical gains measured on external dataset

full rationale

The paper describes a two-stage distillation process: self-supervised adaptation of a VFM on labeled synthetic plus unlabeled real data, followed by training a lightweight student model. The central performance claim (58.62% median error reduction) is evaluated against synthetic-only baselines on a large external crowd-sourced dataset (>2000 participants). No equations, fitted parameters, or self-citations are shown to reduce this gain to a quantity defined by the inputs themselves. The method is presented as a practical recipe rather than a closed-form derivation, and the reported improvement is falsifiable via the held-out test set. This is the most common honest outcome for an empirical distillation paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on standard machine-learning assumptions about the transferability of self-supervised adaptation and distillation; no new physical entities or ad-hoc constants are introduced beyond typical training hyperparameters.

axioms (1)
  • domain assumption Self-supervised learning on mixed synthetic labeled and real unlabeled near-eye images can produce a domain-specialized teacher that generalizes to new hardware
    Invoked in the first stage of DistillGaze as described in the abstract.

pith-pipeline@v0.9.0 · 5599 in / 1330 out tokens · 46867 ms · 2026-05-13T21:35:24.415807+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 6 internal anchors

  1. [1]

    Dig- itally prototype your eye tracker: Simulating hardware performance using 3d synthetic data.arXiv preprint arXiv:2503.16742, 2025

    Esther YH Lin, Yimin Ding, Jogendra Kundu, Yatong An, Mohamed T El-Haddad, and Alexander Fix. Dig- itally prototype your eye tracker: Simulating hardware performance using 3d synthetic data.arXiv preprint arXiv:2503.16742, 2025

  2. [2]

    Enabling eye tracking for crowd-sourced data collection with project aria.IEEE Access, 2025

    Yusuf Mansour, Ajoy Savio Fernandes, Kiran Somasundaram, Tarek Hefny, Mahsa Shakeri, Oleg Komogortsev, Abhishek Sharma, and Michael J Proulx. Enabling eye tracking for crowd-sourced data collection with project aria.IEEE Access, 2025

  3. [3]

    DINOv3

    Oriane Siméoni, Huy V Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025

  4. [4]

    SAM 3: Segment Anything with Concepts

    Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

  5. [5]

    Towards a general-purpose foundation model for computational pathology.Nature medicine, 30(3):850–862, 2024

    Richard J Chen, Tong Ding, Ming Y Lu, Drew FK Williamson, Guillaume Jaume, Andrew H Song, Bowen Chen, Andrew Zhang, Daniel Shao, Muhammad Shaban, et al. Towards a general-purpose foundation model for computational pathology.Nature medicine, 30(3):850–862, 2024

  6. [6]

    Foundation models for fast, label-free detection of glioma infiltration.Nature, 637(8045):439–445, 2025

    Akhil Kondepudi, Melike Pekmezci, Xinhai Hou, Katie Scotford, Cheng Jiang, Akshay Rao, Edward S Harake, Asadur Chowdury, Wajd Al-Holou, Lin Wang, et al. Foundation models for fast, label-free detection of glioma infiltration.Nature, 637(8045):439–445, 2025

  7. [7]

    Remoteclip: A vision language foundation model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, 62:1–16, 2024

    Fan Liu, Delong Chen, Zhangqingyun Guan, Xiaocong Zhou, Jiale Zhu, Qiaolin Ye, Liyong Fu, and Jun Zhou. Remoteclip: A vision language foundation model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, 62:1–16, 2024

  8. [8]

    Spectralgpt: Spectral remote sensing foundation model.IEEE transactions on pattern analysis and machine intelligence, 46(8):5227–5244, 2024

    Danfeng Hong, Bing Zhang, Xuyang Li, Yuxuan Li, Chenyu Li, Jing Yao, Naoto Yokoya, Hao Li, Pedram Ghamisi, Xiuping Jia, et al. Spectralgpt: Spectral remote sensing foundation model.IEEE transactions on pattern analysis and machine intelligence, 46(8):5227–5244, 2024

  9. [9]

    Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning

    Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M Alvarez. Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. In Proceedings of the computer vision and pattern recognition conference, pages 22442–22452, 2025

  10. [10]

    General theory of remote gaze estimation using the pupil center and corneal reflections.IEEE Transactions on biomedical engineering, 53(6):1124–1133, 2006

    Elias Daniel Guestrin and Moshe Eizenman. General theory of remote gaze estimation using the pupil center and corneal reflections.IEEE Transactions on biomedical engineering, 53(6):1124–1133, 2006. 11

  11. [11]

    Eye gaze tracking under natural head movements

    Zhiwei Zhu and Qiang Ji. Eye gaze tracking under natural head movements. In2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 1, pages 918–923. IEEE, 2005

  12. [12]

    3d gaze estimation for head-mounted eye tracking system with auto-calibration method.IEEE Access, 8:104207–104215, 2020

    Meng Liu, Youfu Li, and Hai Liu. 3d gaze estimation for head-mounted eye tracking system with auto-calibration method.IEEE Access, 8:104207–104215, 2020

  13. [13]

    Appearance-based gaze estimation in the wild

    Xucong Zhang, Yusuke Sugano, Mario Fritz, and Andreas Bulling. Appearance-based gaze estimation in the wild. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4511–4520, 2015

  14. [14]

    Eth-xgaze: A large scale dataset for gaze estimation under extreme head pose and gaze variation

    Xucong Zhang, Seonwook Park, Thabo Beeler, Derek Bradley, Siyu Tang, and Otmar Hilliges. Eth-xgaze: A large scale dataset for gaze estimation under extreme head pose and gaze variation. InEuropean Conference on Computer Vision, pages 365–381, 2020

  15. [15]

    Gaze estimation using transformer

    Yihua Cheng and Feng Lu. Gaze estimation using transformer. InInternational Conference on Pattern Recognition (ICPR), pages 3341–3347. IEEE, 2022

  16. [16]

    Puregaze: Purifying gaze feature for generalizable gaze estimation

    Yihua Cheng, Yiwei Bao, and Feng Lu. Puregaze: Purifying gaze feature for generalizable gaze estimation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 436–443, 2022

  17. [17]

    Gaze360: Physically unconstrained gaze estimation in the wild

    Petr Kellnhofer, Adria Recasens, Simon Stent, Wojciech Matusik, and Antonio Torralba. Gaze360: Physically unconstrained gaze estimation in the wild. InProceedings of the IEEE/CVF international conference on computer vision, pages 6912–6921, 2019

  18. [18]

    Nvgaze: An anatomically-informed dataset for low-latency, near-eye gaze estimation

    Joohwan Kim, Michael Stengel, Alexander Majercik, Shalini De Mello, David Dunn, Samuli Laine, Morgan McGuire, and David Luebke. Nvgaze: An anatomically-informed dataset for low-latency, near-eye gaze estimation. InProceedings of the 2019 CHI conference on human factors in computing systems, pages 1–12, 2019

  19. [19]

    Openeds2020: Open eyes dataset.arXiv preprint arXiv:2005.03876, 2020

    Cristina Palmero, Abhishek Sharma, Karsten Behrendt, Kapil Krishnakumar, Oleg V Komogortsev, and Sachin S Talathi. Openeds2020: Open eyes dataset.arXiv preprint arXiv:2005.03876, 2020

  20. [20]

    Teyed: Over 20 million real-world eye images with pupil, eyelid, and iris 2d and 3d segmentations, 2d and 3d landmarks, 3d eyeball, gaze vector, and eye movement types

    Wolfgang Fuhl, Gjergji Kasneci, and Enkelejda Kasneci. Teyed: Over 20 million real-world eye images with pupil, eyelid, and iris 2d and 3d segmentations, 2d and 3d landmarks, 3d eyeball, gaze vector, and eye movement types. arXiv preprint arXiv:2102.02115, 2021

  21. [21]

    Deˆ 2gaze: Deformable and decoupled representation learning for 3d gaze estimation

    Yunfeng Xiao, Xiaowei Bai, Baojun Chen, Hao Su, Hao He, Liang Xie, and Erwei Yin. Deˆ 2gaze: Deformable and decoupled representation learning for 3d gaze estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3091–3100, 2025

  22. [22]

    U2eyes: A binocular dataset for eye tracking and gaze estimation

    Sonia Porta, Benoit Bossavit, Rafael Cabeza, Andoni Larumbe-Bergera, Gonzalo Garde, and Arantxa Villanueva. U2eyes: A binocular dataset for eye tracking and gaze estimation. InProceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 0–0, 2019

  23. [23]

    Learning an appearance-based gaze estimator from one million synthesised images

    Erroll Wood, Tadas Baltrušaitis, Louis-Philippe Morency, Peter Robinson, and Andreas Bulling. Learning an appearance-based gaze estimator from one million synthesised images. InProceedings of the ninth biennial ACM symposium on eye tracking research & applications, pages 131–138, 2016

  24. [24]

    Learning from simulated and unsupervised images through adversarial training

    Ashish Shrivastava, Tomas Pfister, Oncel Tuzel, Joshua Susskind, Wenda Wang, and Russell Webb. Learning from simulated and unsupervised images through adversarial training. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2107–2116, 2017

  25. [25]

    Deep domain adaptation: A sim2real neural approach for improving eye-tracking systems.Proceedings of the ACM on Computer Graphics and Interactive Techniques, 7(2):1–17, 2024

    Viet Dung Nguyen, Reynold Bailey, Gabriel J Diaz, Chengyi Ma, Alexander Fix, and Alexander Ororbia. Deep domain adaptation: A sim2real neural approach for improving eye-tracking systems.Proceedings of the ACM on Computer Graphics and Interactive Techniques, 7(2):1–17, 2024

  26. [26]

    Rendering of eyes for eye-shape registration and gaze estimation

    Erroll Wood, Tadas Baltrusaitis, Xucong Zhang, Yusuke Sugano, Peter Robinson, and Andreas Bulling. Rendering of eyes for eye-shape registration and gaze estimation. InProceedings of the IEEE International Conference on Computer Vision, pages 3756–3764, 2015

  27. [27]

    Eyenerf: a hybrid representation for photorealistic synthesis, animation and relighting of human eyes.ACM Transactions on Graphics (TOG), 41(4):1–16, 2022

    Gengyan Li, Abhimitra Meka, Franziska Mueller, Marcel C Buehler, Otmar Hilliges, and Thabo Beeler. Eyenerf: a hybrid representation for photorealistic synthesis, animation and relighting of human eyes.ACM Transactions on Graphics (TOG), 41(4):1–16, 2022

  28. [28]

    Self-supervised domain adaptation for computer vision tasks

    Jiaolong Xu, Liang Xiao, and Antonio M López. Self-supervised domain adaptation for computer vision tasks. IEEE Access, 7:156694–156706, 2019

  29. [29]

    Improving out-of-distribution generalization via multi-task self-supervised pretraining.arXiv preprint arXiv:2003.13525, 2020

    Isabela Albuquerque, Nikhil Naik, Junnan Li, Nitish Keskar, and Richard Socher. Improving out-of-distribution generalization via multi-task self-supervised pretraining.arXiv preprint arXiv:2003.13525, 2020. 12

  30. [30]

    A cookbook of self-supervised learning.arXiv preprint arXiv:2304.12210, 2023

    Randall Balestriero, Mark Ibrahim, Vlad Sobal, Ari Morcos, Shashank Shekhar, Tom Goldstein, Florian Bordes, Adrien Bardes, Gregoire Mialon, Yuandong Tian, et al. A cookbook of self-supervised learning.arXiv preprint arXiv:2304.12210, 2023

  31. [31]

    A simple framework for contrastive learning of visual representations

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InInternational conference on machine learning, pages 1597–1607. PmLR, 2020

  32. [32]

    Momentum contrast for unsupervised visual representation learning

    Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020

  33. [33]

    Bootstrap your own latent-a new approach to self-supervised learning.Advances in neural information processing systems, 33:21271–21284, 2020

    Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning.Advances in neural information processing systems, 33:21271–21284, 2020

  34. [34]

    Emerging properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021

  35. [35]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

  36. [36]

    Vicreg: Variance-invariance-covariance regularization for self-supervised learning.arXiv preprint arXiv:2105.04906,

    Adrien Bardes, Jean Ponce, and Yann LeCun. Vicreg: Variance-invariance-covariance regularization for self- supervised learning.arXiv preprint arXiv:2105.04906, 2021

  37. [37]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022

  38. [38]

    BEiT: BERT Pre-Training of Image Transformers

    Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers.arXiv preprint arXiv:2106.08254, 2021

  39. [39]

    iBOT: Image BERT Pre-Training with Online Tokenizer

    Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. ibot: Image bert pre-training with online tokenizer.arXiv preprint arXiv:2111.07832, 2021

  40. [40]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  41. [41]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015

  42. [42]

    Fitnets: Hints for thin deep nets, 2015

    Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets, 2015

  43. [43]

    Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer

    Sergey Zagoruyko and Nikos Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer.arXiv preprint arXiv:1612.03928, 2016

  44. [44]

    Similarity-preserving knowledge distillation

    Frederick Tung and Greg Mori. Similarity-preserving knowledge distillation. InProceedings of the IEEE/CVF international conference on computer vision, pages 1365–1374, 2019

  45. [45]

    Contrastive Representation Distillation

    Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive Representation Distillation. InInternational Conference on Learning Representations, 2020

  46. [46]

    Co-training and co-distillation for quality improvement and compression of language models

    Hayeon Lee, Rui Hou, Jongpil Kim, Davis Liang, Hongbo Zhang, Sung Hwang, and Alexander Min. Co-training and co-distillation for quality improvement and compression of language models. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 7458–7467, 2023

  47. [47]

    Vic-kd: Variance-invariance-covariance knowledge distillation to make keyword spotting more robust against adversarial attacks

    Heitor R Guimarães, Arthur Pimentel, Anderson Avila, and Tiago H Falk. Vic-kd: Variance-invariance-covariance knowledge distillation to make keyword spotting more robust against adversarial attacks. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 12196–12200. IEEE, 2024

  48. [48]

    Fbnet: Hardware-aware efficient convnet design via differentiable neural 13 architecture search

    Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. Fbnet: Hardware-aware efficient convnet design via differentiable neural 13 architecture search. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10734–10742, 2019

  49. [49]

    Evaluation of eye tracking signal quality for virtual reality applications: A case study in the meta quest pro

    Samantha Aziz, Dillon J Lohr, Lee Friedman, and Oleg Komogortsev. Evaluation of eye tracking signal quality for virtual reality applications: A case study in the meta quest pro. InProceedings of the 2024 Symposium on Eye Tracking Research and Applications, pages 1–8, 2024

  50. [50]

    Dare-gram: Unsupervised domain adaptation regression by aligning inverse gram matrices

    Ismail Nejjar, Qin Wang, and Olga Fink. Dare-gram: Unsupervised domain adaptation regression by aligning inverse gram matrices. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11744–11754, 2023. 14 A Additional data description In Section 4.1, we described the Project Aria dataset. The distribution of the real da...