arxiv: 2604.02509 · v1 · submitted 2026-04-02 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Rapidly deploying on-device eye tracking by distilling visual foundation models

Cheng Jiang , Jogendra Kundu , David Colmenares , Fengting Yang , Joseph Robinson , Yatong An , Ali Behrooz

Authors on Pith no claims yet

Pith reviewed 2026-05-13 21:35 UTC · model grok-4.3

classification 💻 cs.CV

keywords eye trackinggaze estimationvisual foundation modelsknowledge distillationsynthetic datadomain adaptationon-device deploymentAR/VR

0 comments

The pith

DistillGaze distills visual foundation models with synthetic labels and unlabeled real images to produce accurate 256K-parameter eye trackers deployable on new hardware.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to rapidly adapt visual foundation models for on-device eye tracking when camera placements and lighting change across device generations. It uses a two-stage process: first turning a foundation model into a specialized teacher through self-supervised learning on labeled synthetic data plus unlabeled real data, then training a compact student model under the teacher's guidance plus self-training. This closes the synthetic-to-real gap enough to cut median gaze error by 58.62 percent relative to synthetic-only baselines. A reader would care because the result is a lightweight model that runs in real time and adapts without large new labeled datasets for each hardware variant.

Core claim

DistillGaze proceeds in two stages. First, a visual foundation model is adapted into a domain-specialized teacher using self-supervised learning on labeled synthetic images and unlabeled real images, where synthetic data supplies gaze supervision and real data bridges the domain gap. Second, a lightweight student model is trained using both teacher guidance and self-training. On a large-scale crowd-sourced dataset with over 2,000 participants, the resulting 256K-parameter model reduces median gaze error by 58.62 percent compared with synthetic-only baselines while remaining suitable for real-time on-device deployment across varying hardware configurations.

What carries the argument

DistillGaze, a two-stage distillation framework in which a visual foundation model is first adapted via self-supervised learning on mixed synthetic and real data to create a teacher, then used to supervise a compact student model for gaze regression.

If this is right

Eye tracking models can be trained and deployed for successive AR/VR device generations without collecting large new labeled real datasets each time.
A 256K-parameter model supports real-time on-device inference while achieving substantially lower error than larger synthetic-only alternatives.
The same two-stage process handles changes in camera pose, placement, and illumination without retraining from scratch.
Synthetic data supplies scalable supervision while unlabeled real data supplies the domain adaptation signal needed for regression tasks.
Foundation models originally trained on natural images can be specialized for near-eye infrared imagery through this distillation recipe.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could transfer to other on-device regression problems such as hand tracking or facial landmark detection that also face synthetic-to-real gaps.
Further gains might come from testing whether different foundation model backbones yield better teachers for the same student size.
The method implies that improvements in the quality or diversity of synthetic eye images would directly raise the final accuracy ceiling.
Deployment on additional hardware variants with measured error reduction would confirm the claimed adaptability across device families.

Load-bearing premise

That self-supervised adaptation of a visual foundation model on labeled synthetic data plus unlabeled real data will close the synthetic-to-real domain gap enough for high-accuracy gaze estimation across new hardware configurations.

What would settle it

A new hardware setup with different camera placement or illumination where the distilled model shows no substantial drop in median gaze error relative to a synthetic-only baseline.

read the original abstract

Eye tracking (ET) plays a critical role in augmented and virtual reality applications. However, rapidly deploying high-accuracy, on-device gaze estimation for new products remains challenging because hardware configurations (e.g., camera placement, camera pose, and illumination) often change across device generations. Visual foundation models (VFMs) are a promising direction for rapid training and deployment, and they excel on natural-image benchmarks; yet we find that off-the-shelf VFMs still struggle to achieve high accuracy on specialized near-eye infrared imagery. To address this gap, we introduce DistillGaze, a framework that distills a foundation model by leveraging labeled synthetic data and unlabeled real data for rapid and high-performance on-device gaze estimation. DistillGaze proceeds in two stages. First, we adapt a VFM into a domain-specialized teacher using self-supervised learning on labeled synthetic and unlabeled real images. Synthetic data provides scalable, high-quality gaze supervision, while unlabeled real data helps bridge the synthetic-to-real domain gap. Second, we train an on-device student using both teacher guidance and self-training. Evaluated on a large-scale, crowd-sourced dataset spanning over 2,000 participants, DistillGaze reduces median gaze error by 58.62% relative to synthetic-only baselines while maintaining a lightweight 256K-parameter model suitable for real-time on-device deployment. Overall, DistillGaze provides an efficient pathway for training and deploying ET models that adapt to hardware changes, and offers a recipe for combining synthetic supervision with unlabeled real data in on-device regression tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DistillGaze offers a workable two-stage distillation recipe that mixes synthetic labels with unlabeled real data to adapt VFMs for on-device eye tracking, but the single-dataset evaluation leaves cross-hardware generalization unproven.

read the letter

The main thing to know is that DistillGaze uses a two-stage process: first adapt a VFM into a teacher with self-supervised learning on labeled synthetic gaze data plus unlabeled real images, then distill to a small student model. On their crowd-sourced test set of over 2000 people it cuts median error by 58.62% versus synthetic-only baselines while staying at 256K parameters for real-time use. That combination of synthetic supervision and real-data bridging is the concrete new piece relative to plain VFM fine-tuning or pure synthetic training. The paper does a decent job framing the real deployment issue—hardware changes across AR/VR device generations—and shows a lightweight model that could actually run on-device. The approach is straightforward and targets a practical gap without requiring massive new labeled datasets. The soft spot is the evaluation. Everything rests on one large crowd-sourced collection. The stress-test concern holds: without held-out hardware splits or device-type ablations, we cannot separate true cross-configuration transfer from within-distribution improvement on similar camera geometries and lighting. The abstract gives no error bars or ablation tables, so the 58% number is hard to assess for robustness. If the full paper adds those controls it would strengthen the claim; otherwise the generalization story stays suggestive rather than conclusive. This is for engineers and researchers working on wearable computer vision who need quick adaptation recipes for regression tasks with domain shift. Readers who care about distillation pipelines or on-device constraints will find usable ideas. It shows honest engagement with the problem and the literature, so it deserves a serious referee even if revisions are needed for tighter validation. I would send it to peer review rather than desk reject.

Referee Report

2 major / 1 minor

Summary. The paper introduces DistillGaze, a two-stage distillation framework that first adapts a visual foundation model into a domain-specialized teacher via self-supervised learning on labeled synthetic and unlabeled real near-eye infrared images, then trains a lightweight student model using teacher guidance and self-training. The central empirical claim is that this yields a 58.62% reduction in median gaze error relative to synthetic-only baselines on a crowd-sourced dataset spanning over 2,000 participants, while producing a 256K-parameter model suitable for real-time on-device deployment and adaptation to hardware changes.

Significance. If the performance gains and cross-hardware generalization hold under more rigorous validation, the work would provide a practical recipe for rapid on-device eye-tracking deployment in AR/VR by efficiently bridging synthetic-to-real gaps without large-scale labeled real data. The lightweight model size and emphasis on unlabeled real data for adaptation are clear strengths for on-device regression tasks.

major comments (2)

[Evaluation] The experimental evaluation reports a 58.62% median error reduction but provides no error bars, ablation details on the contribution of each stage, or statistical tests, leaving the robustness of the central performance claim unassessable from the given results.
[Evaluation] No explicit held-out hardware split, device-type ablation, or cross-device generalization test is described; the single crowd-sourced dataset evaluation therefore does not isolate whether the domain-gap closure works for truly novel camera geometries, poses, or illumination across device generations, which is load-bearing for the adaptation claim.

minor comments (1)

The abstract and method description would benefit from explicitly naming the synthetic-only baselines and the precise self-supervised objectives used in each stage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important aspects of our evaluation that can be strengthened. We will revise the manuscript to address the concerns regarding robustness and generalization while maintaining the core contributions of DistillGaze.

read point-by-point responses

Referee: The experimental evaluation reports a 58.62% median error reduction but provides no error bars, ablation details on the contribution of each stage, or statistical tests, leaving the robustness of the central performance claim unassessable from the given results.

Authors: We agree that these elements are necessary for assessing robustness. In the revised manuscript, we will add error bars (computed as standard deviation over multiple training runs with different random seeds), detailed ablations breaking down the contribution of the self-supervised teacher adaptation stage versus the student self-training stage, and statistical significance tests (e.g., Wilcoxon signed-rank test) comparing DistillGaze against the synthetic-only baseline. These additions will directly support the reported 58.62% median error reduction. revision: yes
Referee: No explicit held-out hardware split, device-type ablation, or cross-device generalization test is described; the single crowd-sourced dataset evaluation therefore does not isolate whether the domain-gap closure works for truly novel camera geometries, poses, or illumination across device generations, which is load-bearing for the adaptation claim.

Authors: We acknowledge that an explicit held-out hardware split would more rigorously isolate cross-device generalization. Our crowd-sourced dataset inherently includes variations in camera geometries, poses, and illumination across 2,000+ participants, but we did not perform device-type partitioning. In revision, we will add a device-type ablation by grouping samples based on available participant metadata (e.g., inferred device characteristics) and report performance on held-out subsets. We will also expand the discussion to clarify how the two-stage framework enables adaptation to new hardware via unlabeled real data, while noting any limitations due to metadata availability as a direction for future datasets. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical gains measured on external dataset

full rationale

The paper describes a two-stage distillation process: self-supervised adaptation of a VFM on labeled synthetic plus unlabeled real data, followed by training a lightweight student model. The central performance claim (58.62% median error reduction) is evaluated against synthetic-only baselines on a large external crowd-sourced dataset (>2000 participants). No equations, fitted parameters, or self-citations are shown to reduce this gain to a quantity defined by the inputs themselves. The method is presented as a practical recipe rather than a closed-form derivation, and the reported improvement is falsifiable via the held-out test set. This is the most common honest outcome for an empirical distillation paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on standard machine-learning assumptions about the transferability of self-supervised adaptation and distillation; no new physical entities or ad-hoc constants are introduced beyond typical training hyperparameters.

axioms (1)

domain assumption Self-supervised learning on mixed synthetic labeled and real unlabeled near-eye images can produce a domain-specialized teacher that generalizes to new hardware
Invoked in the first stage of DistillGaze as described in the abstract.

pith-pipeline@v0.9.0 · 5599 in / 1330 out tokens · 46867 ms · 2026-05-13T21:35:24.415807+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DistillGaze proceeds in two stages. First, we adapt a VFM into a domain-specialized teacher using self-supervised learning on labeled synthetic and unlabeled real images... Second, we train an on-device student using both teacher guidance and self-training.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We evaluate DistillGaze on a large-scale, crowd-sourced dataset spanning over 2,000 participants

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 6 internal anchors

[1]

Dig- itally prototype your eye tracker: Simulating hardware performance using 3d synthetic data.arXiv preprint arXiv:2503.16742, 2025

Esther YH Lin, Yimin Ding, Jogendra Kundu, Yatong An, Mohamed T El-Haddad, and Alexander Fix. Dig- itally prototype your eye tracker: Simulating hardware performance using 3d synthetic data.arXiv preprint arXiv:2503.16742, 2025

work page arXiv 2025
[2]

Enabling eye tracking for crowd-sourced data collection with project aria.IEEE Access, 2025

Yusuf Mansour, Ajoy Savio Fernandes, Kiran Somasundaram, Tarek Hefny, Mahsa Shakeri, Oleg Komogortsev, Abhishek Sharma, and Michael J Proulx. Enabling eye tracking for crowd-sourced data collection with project aria.IEEE Access, 2025

work page 2025
[3]

DINOv3

Oriane Siméoni, Huy V Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Towards a general-purpose foundation model for computational pathology.Nature medicine, 30(3):850–862, 2024

Richard J Chen, Tong Ding, Ming Y Lu, Drew FK Williamson, Guillaume Jaume, Andrew H Song, Bowen Chen, Andrew Zhang, Daniel Shao, Muhammad Shaban, et al. Towards a general-purpose foundation model for computational pathology.Nature medicine, 30(3):850–862, 2024

work page 2024
[6]

Foundation models for fast, label-free detection of glioma infiltration.Nature, 637(8045):439–445, 2025

Akhil Kondepudi, Melike Pekmezci, Xinhai Hou, Katie Scotford, Cheng Jiang, Akshay Rao, Edward S Harake, Asadur Chowdury, Wajd Al-Holou, Lin Wang, et al. Foundation models for fast, label-free detection of glioma infiltration.Nature, 637(8045):439–445, 2025

work page 2025
[7]

Remoteclip: A vision language foundation model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, 62:1–16, 2024

Fan Liu, Delong Chen, Zhangqingyun Guan, Xiaocong Zhou, Jiale Zhu, Qiaolin Ye, Liyong Fu, and Jun Zhou. Remoteclip: A vision language foundation model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, 62:1–16, 2024

work page 2024
[8]

Spectralgpt: Spectral remote sensing foundation model.IEEE transactions on pattern analysis and machine intelligence, 46(8):5227–5244, 2024

Danfeng Hong, Bing Zhang, Xuyang Li, Yuxuan Li, Chenyu Li, Jing Yao, Naoto Yokoya, Hao Li, Pedram Ghamisi, Xiuping Jia, et al. Spectralgpt: Spectral remote sensing foundation model.IEEE transactions on pattern analysis and machine intelligence, 46(8):5227–5244, 2024

work page 2024
[9]

Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning

Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M Alvarez. Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. In Proceedings of the computer vision and pattern recognition conference, pages 22442–22452, 2025

work page 2025
[10]

General theory of remote gaze estimation using the pupil center and corneal reflections.IEEE Transactions on biomedical engineering, 53(6):1124–1133, 2006

Elias Daniel Guestrin and Moshe Eizenman. General theory of remote gaze estimation using the pupil center and corneal reflections.IEEE Transactions on biomedical engineering, 53(6):1124–1133, 2006. 11

work page 2006
[11]

Eye gaze tracking under natural head movements

Zhiwei Zhu and Qiang Ji. Eye gaze tracking under natural head movements. In2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 1, pages 918–923. IEEE, 2005

work page 2005
[12]

3d gaze estimation for head-mounted eye tracking system with auto-calibration method.IEEE Access, 8:104207–104215, 2020

Meng Liu, Youfu Li, and Hai Liu. 3d gaze estimation for head-mounted eye tracking system with auto-calibration method.IEEE Access, 8:104207–104215, 2020

work page 2020
[13]

Appearance-based gaze estimation in the wild

Xucong Zhang, Yusuke Sugano, Mario Fritz, and Andreas Bulling. Appearance-based gaze estimation in the wild. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4511–4520, 2015

work page 2015
[14]

Eth-xgaze: A large scale dataset for gaze estimation under extreme head pose and gaze variation

Xucong Zhang, Seonwook Park, Thabo Beeler, Derek Bradley, Siyu Tang, and Otmar Hilliges. Eth-xgaze: A large scale dataset for gaze estimation under extreme head pose and gaze variation. InEuropean Conference on Computer Vision, pages 365–381, 2020

work page 2020
[15]

Gaze estimation using transformer

Yihua Cheng and Feng Lu. Gaze estimation using transformer. InInternational Conference on Pattern Recognition (ICPR), pages 3341–3347. IEEE, 2022

work page 2022
[16]

Puregaze: Purifying gaze feature for generalizable gaze estimation

Yihua Cheng, Yiwei Bao, and Feng Lu. Puregaze: Purifying gaze feature for generalizable gaze estimation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 436–443, 2022

work page 2022
[17]

Gaze360: Physically unconstrained gaze estimation in the wild

Petr Kellnhofer, Adria Recasens, Simon Stent, Wojciech Matusik, and Antonio Torralba. Gaze360: Physically unconstrained gaze estimation in the wild. InProceedings of the IEEE/CVF international conference on computer vision, pages 6912–6921, 2019

work page 2019
[18]

Nvgaze: An anatomically-informed dataset for low-latency, near-eye gaze estimation

Joohwan Kim, Michael Stengel, Alexander Majercik, Shalini De Mello, David Dunn, Samuli Laine, Morgan McGuire, and David Luebke. Nvgaze: An anatomically-informed dataset for low-latency, near-eye gaze estimation. InProceedings of the 2019 CHI conference on human factors in computing systems, pages 1–12, 2019

work page 2019
[19]

Openeds2020: Open eyes dataset.arXiv preprint arXiv:2005.03876, 2020

Cristina Palmero, Abhishek Sharma, Karsten Behrendt, Kapil Krishnakumar, Oleg V Komogortsev, and Sachin S Talathi. Openeds2020: Open eyes dataset.arXiv preprint arXiv:2005.03876, 2020

work page arXiv 2005
[20]

Teyed: Over 20 million real-world eye images with pupil, eyelid, and iris 2d and 3d segmentations, 2d and 3d landmarks, 3d eyeball, gaze vector, and eye movement types

Wolfgang Fuhl, Gjergji Kasneci, and Enkelejda Kasneci. Teyed: Over 20 million real-world eye images with pupil, eyelid, and iris 2d and 3d segmentations, 2d and 3d landmarks, 3d eyeball, gaze vector, and eye movement types. arXiv preprint arXiv:2102.02115, 2021

work page arXiv 2021
[21]

Deˆ 2gaze: Deformable and decoupled representation learning for 3d gaze estimation

Yunfeng Xiao, Xiaowei Bai, Baojun Chen, Hao Su, Hao He, Liang Xie, and Erwei Yin. Deˆ 2gaze: Deformable and decoupled representation learning for 3d gaze estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3091–3100, 2025

work page 2025
[22]

U2eyes: A binocular dataset for eye tracking and gaze estimation

Sonia Porta, Benoit Bossavit, Rafael Cabeza, Andoni Larumbe-Bergera, Gonzalo Garde, and Arantxa Villanueva. U2eyes: A binocular dataset for eye tracking and gaze estimation. InProceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 0–0, 2019

work page 2019
[23]

Learning an appearance-based gaze estimator from one million synthesised images

Erroll Wood, Tadas Baltrušaitis, Louis-Philippe Morency, Peter Robinson, and Andreas Bulling. Learning an appearance-based gaze estimator from one million synthesised images. InProceedings of the ninth biennial ACM symposium on eye tracking research & applications, pages 131–138, 2016

work page 2016
[24]

Learning from simulated and unsupervised images through adversarial training

Ashish Shrivastava, Tomas Pfister, Oncel Tuzel, Joshua Susskind, Wenda Wang, and Russell Webb. Learning from simulated and unsupervised images through adversarial training. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2107–2116, 2017

work page 2017
[25]

Deep domain adaptation: A sim2real neural approach for improving eye-tracking systems.Proceedings of the ACM on Computer Graphics and Interactive Techniques, 7(2):1–17, 2024

Viet Dung Nguyen, Reynold Bailey, Gabriel J Diaz, Chengyi Ma, Alexander Fix, and Alexander Ororbia. Deep domain adaptation: A sim2real neural approach for improving eye-tracking systems.Proceedings of the ACM on Computer Graphics and Interactive Techniques, 7(2):1–17, 2024

work page 2024
[26]

Rendering of eyes for eye-shape registration and gaze estimation

Erroll Wood, Tadas Baltrusaitis, Xucong Zhang, Yusuke Sugano, Peter Robinson, and Andreas Bulling. Rendering of eyes for eye-shape registration and gaze estimation. InProceedings of the IEEE International Conference on Computer Vision, pages 3756–3764, 2015

work page 2015
[27]

Eyenerf: a hybrid representation for photorealistic synthesis, animation and relighting of human eyes.ACM Transactions on Graphics (TOG), 41(4):1–16, 2022

Gengyan Li, Abhimitra Meka, Franziska Mueller, Marcel C Buehler, Otmar Hilliges, and Thabo Beeler. Eyenerf: a hybrid representation for photorealistic synthesis, animation and relighting of human eyes.ACM Transactions on Graphics (TOG), 41(4):1–16, 2022

work page 2022
[28]

Self-supervised domain adaptation for computer vision tasks

Jiaolong Xu, Liang Xiao, and Antonio M López. Self-supervised domain adaptation for computer vision tasks. IEEE Access, 7:156694–156706, 2019

work page 2019
[29]

Improving out-of-distribution generalization via multi-task self-supervised pretraining.arXiv preprint arXiv:2003.13525, 2020

Isabela Albuquerque, Nikhil Naik, Junnan Li, Nitish Keskar, and Richard Socher. Improving out-of-distribution generalization via multi-task self-supervised pretraining.arXiv preprint arXiv:2003.13525, 2020. 12

work page arXiv 2003
[30]

A cookbook of self-supervised learning.arXiv preprint arXiv:2304.12210, 2023

Randall Balestriero, Mark Ibrahim, Vlad Sobal, Ari Morcos, Shashank Shekhar, Tom Goldstein, Florian Bordes, Adrien Bardes, Gregoire Mialon, Yuandong Tian, et al. A cookbook of self-supervised learning.arXiv preprint arXiv:2304.12210, 2023

work page arXiv 2023
[31]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InInternational conference on machine learning, pages 1597–1607. PmLR, 2020

work page 2020
[32]

Momentum contrast for unsupervised visual representation learning

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020

work page 2020
[33]

Bootstrap your own latent-a new approach to self-supervised learning.Advances in neural information processing systems, 33:21271–21284, 2020

Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning.Advances in neural information processing systems, 33:21271–21284, 2020

work page 2020
[34]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021

work page 2021
[35]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

Vicreg: Variance-invariance-covariance regularization for self-supervised learning.arXiv preprint arXiv:2105.04906,

Adrien Bardes, Jean Ponce, and Yann LeCun. Vicreg: Variance-invariance-covariance regularization for self- supervised learning.arXiv preprint arXiv:2105.04906, 2021

work page arXiv 2021
[37]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022

work page 2022
[38]

BEiT: BERT Pre-Training of Image Transformers

Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers.arXiv preprint arXiv:2106.08254, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[39]

iBOT: Image BERT Pre-Training with Online Tokenizer

Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. ibot: Image bert pre-training with online tokenizer.arXiv preprint arXiv:2111.07832, 2021

work page internal anchor Pith review arXiv 2021
[40]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

work page 2021
[41]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[42]

Fitnets: Hints for thin deep nets, 2015

Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets, 2015

work page 2015
[43]

Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer

Sergey Zagoruyko and Nikos Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer.arXiv preprint arXiv:1612.03928, 2016

work page arXiv 2016
[44]

Similarity-preserving knowledge distillation

Frederick Tung and Greg Mori. Similarity-preserving knowledge distillation. InProceedings of the IEEE/CVF international conference on computer vision, pages 1365–1374, 2019

work page 2019
[45]

Contrastive Representation Distillation

Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive Representation Distillation. InInternational Conference on Learning Representations, 2020

work page 2020
[46]

Co-training and co-distillation for quality improvement and compression of language models

Hayeon Lee, Rui Hou, Jongpil Kim, Davis Liang, Hongbo Zhang, Sung Hwang, and Alexander Min. Co-training and co-distillation for quality improvement and compression of language models. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 7458–7467, 2023

work page 2023
[47]

Vic-kd: Variance-invariance-covariance knowledge distillation to make keyword spotting more robust against adversarial attacks

Heitor R Guimarães, Arthur Pimentel, Anderson Avila, and Tiago H Falk. Vic-kd: Variance-invariance-covariance knowledge distillation to make keyword spotting more robust against adversarial attacks. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 12196–12200. IEEE, 2024

work page 2024
[48]

Fbnet: Hardware-aware efficient convnet design via differentiable neural 13 architecture search

Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. Fbnet: Hardware-aware efficient convnet design via differentiable neural 13 architecture search. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10734–10742, 2019

work page 2019
[49]

Evaluation of eye tracking signal quality for virtual reality applications: A case study in the meta quest pro

Samantha Aziz, Dillon J Lohr, Lee Friedman, and Oleg Komogortsev. Evaluation of eye tracking signal quality for virtual reality applications: A case study in the meta quest pro. InProceedings of the 2024 Symposium on Eye Tracking Research and Applications, pages 1–8, 2024

work page 2024
[50]

Dare-gram: Unsupervised domain adaptation regression by aligning inverse gram matrices

Ismail Nejjar, Qin Wang, and Olga Fink. Dare-gram: Unsupervised domain adaptation regression by aligning inverse gram matrices. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11744–11754, 2023. 14 A Additional data description In Section 4.1, we described the Project Aria dataset. The distribution of the real da...

work page 2023