pith. machine review for the scientific record. sign in

arxiv: 2605.13083 · v1 · submitted 2026-05-13 · 💻 cs.RO

Recognition: 2 theorem links

· Lean Theorem

TouchAnything: A Dataset and Framework for Bimanual Tactile Estimation from Egocentric Video

Chuqiao Lyu, Feiyang Hong, Guannan Zhang, Haotian Wu, Jianyi Zhou, Ruichen Zhen, Shuo Yang, Weisheng Dai, Wenbo Ding, Xushi Wang, Yinian Mao, Yuxiang Jiang, Zirui Liu, Ziteng Gao

Authors on Pith no claims yet

Pith reviewed 2026-05-14 18:58 UTC · model grok-4.3

classification 💻 cs.RO
keywords tactile estimationegocentric videobimanual interactionpressure mapsmulti-view visionhand-object interactiondatasetembodied learning
0
0 comments X

The pith

Tactile pressure maps can be predicted from egocentric video of bimanual interactions by incorporating optional wrist views.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets the absence of tactile signals in large-scale egocentric video, which prevents models from learning accurate contact, force, and pressure dynamics during human-object manipulation. It releases the EgoTouch dataset containing 208 tasks, 1891 episodes, multi-view RGB footage, 3D hand poses, and dense pressure readings from wearable sensors in varied indoor and outdoor settings. The TouchAnything framework is introduced to predict tactile outputs primarily from the head-mounted egocentric view while flexibly using wrist-mounted views at inference time when available. Experiments demonstrate measurable gains in prediction quality when wrist views are added. This setup aims to provide scalable tactile supervision for embodied learning without requiring tactile hardware on every data collection setup.

Core claim

The paper establishes that continuous pressure maps for bimanual hand-object interactions can be inferred from egocentric video input, and that a multi-view vision-to-touch model trained on the EgoTouch dataset achieves up to 5.0 percent relative improvement in Contact IoU and 6.1 percent relative improvement in Volumetric IoU when wrist-mounted views are included alongside the primary egocentric view.

What carries the argument

TouchAnything, a baseline multi-view vision-to-touch prediction framework that treats the egocentric view as the main input and incorporates available wrist views to refine tactile estimates.

If this is right

  • Large egocentric video collections can receive inferred tactile labels to support physically grounded pretraining of interaction models.
  • Embodied agents can acquire contact-aware representations without deploying tactile hardware on every robot or recording rig.
  • Bimanual manipulation policies can be trained with more realistic estimates of force and pressure distribution.
  • Dataset scaling for tactile research becomes feasible across diverse real-world environments using only cameras.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Pre-existing egocentric video archives could be retroactively annotated with tactile predictions to bootstrap new training runs.
  • Wearable systems using only head and wrist cameras might deliver approximate touch feedback for augmented or virtual reality interfaces.
  • The prediction approach could be extended to estimate additional contact properties such as friction or slip once paired with suitable auxiliary signals.

Load-bearing premise

The wearable tactile sensors supply accurate, dense, and time-synchronized ground-truth pressure maps that can serve as reliable supervision for training vision-based prediction models.

What would settle it

Apply the trained model to a new set of manipulation episodes recorded with the same sensor suite but featuring unseen objects or force profiles, then check whether the predicted contact regions and pressure values diverge substantially from the actual sensor readings in location, area, or magnitude.

read the original abstract

Egocentric human video data, which captures rich human-environment interactions and can be collected at scale, has become a key driver of embodied intelligence research. However, existing egocentric datasets typically lack tactile sensing, a critical modality that provides direct cues about contact, force, and pressure in human-object interaction. Without such signals, models struggle to learn physically grounded representations of real-world interaction dynamics. While tactile sensors provide these cues, deploying high-quality tactile hardware at scale remains expensive and cumbersome. This raises a central question: can tactile feedback be inferred directly from visual observations, enabling scalable tactile supervision for egocentric video data and supporting physically grounded embodied learning? To enable research in this direction, we introduce EgoTouch, a large-scale multi-view egocentric dataset with dense tactile supervision for bimanual hand-object interaction. EgoTouch comprises 208 manipulation tasks spanning 1,891 episodes in diverse indoor and outdoor environments, with synchronized multi-view RGB (head-mounted egocentric and dual wrist-mounted cameras), bimanual 3D hand pose, and continuous pressure maps from wearable tactile sensors. Building on EgoTouch, we introduce TouchAnything, a baseline multi-view vision-to-touch prediction framework that uses the egocentric view as the primary input and flexibly leverages available wrist-mounted views at inference time. Experiments show that incorporating wrist-mounted views generally improves tactile prediction over egocentric-only input, achieving up to 5.0% relative improvement in Contact IoU and 6.1% relative improvement in Volumetric IoU. We will publicly release the dataset, code, and benchmark.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces the EgoTouch dataset comprising 208 bimanual manipulation tasks and 1,891 episodes with synchronized head-mounted egocentric RGB, dual wrist-mounted RGB, 3D hand poses, and dense pressure maps from wearable tactile sensors. It also presents the TouchAnything multi-view vision-to-touch prediction framework that takes egocentric video as primary input and optionally incorporates wrist views at inference time, reporting relative gains of up to 5.0% in Contact IoU and 6.1% in Volumetric IoU when wrist views are added.

Significance. If the tactile ground-truth signals prove reliable, the dataset and benchmark would enable scalable vision-based tactile supervision for egocentric video, supporting physically grounded embodied learning without requiring tactile hardware at test time. The public release of data, code, and benchmark is a clear strength for reproducibility.

major comments (3)
  1. [Dataset] Dataset section: the manuscript provides no description of sensor calibration against known forces, drift correction, contact-area validation, or temporal synchronization checks between RGB frames and pressure readings across the 1,891 episodes. This is load-bearing because the reported IoU improvements are measured against these pressure maps as supervision.
  2. [Experiments] Experiments section: the headline relative improvements (5.0% Contact IoU, 6.1% Volumetric IoU) are presented without error bars, statistical significance tests, explicit train/validation/test splits, or comparison against strong single-view and multi-view baselines, preventing assessment of whether the wrist-view gains are robust.
  3. [Method] Model description: the TouchAnything architecture does not specify the exact mechanism for flexibly incorporating wrist views only at inference (e.g., whether the network is trained with all views or uses late fusion), which is central to the claim of flexible multi-view prediction.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'up to' for the maximum improvements does not indicate the specific tasks or conditions under which these values are achieved.
  2. [Evaluation Metrics] Notation: the definitions of Contact IoU and Volumetric IoU are not restated in the main text, forcing readers to consult supplementary material for metric details.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We sincerely thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to incorporate the suggested improvements for greater clarity and rigor.

read point-by-point responses
  1. Referee: [Dataset] Dataset section: the manuscript provides no description of sensor calibration against known forces, drift correction, contact-area validation, or temporal synchronization checks between RGB frames and pressure readings across the 1,891 episodes. This is load-bearing because the reported IoU improvements are measured against these pressure maps as supervision.

    Authors: We agree that a detailed account of sensor calibration and synchronization is necessary to substantiate the reliability of the tactile supervision. In the revised manuscript, we will add a new subsection in the Dataset section describing: (1) calibration of the wearable tactile sensors against known forces using a reference force gauge, (2) the drift correction procedure applied to pressure readings, (3) contact-area validation performed by cross-referencing pressure maps with visual contact annotations on a held-out subset of episodes, and (4) the temporal synchronization protocol employing hardware triggers and post-hoc timestamp alignment between the RGB streams and pressure data. These additions will directly address the concern regarding the quality of the ground-truth signals. revision: yes

  2. Referee: [Experiments] Experiments section: the headline relative improvements (5.0% Contact IoU, 6.1% Volumetric IoU) are presented without error bars, statistical significance tests, explicit train/validation/test splits, or comparison against strong single-view and multi-view baselines, preventing assessment of whether the wrist-view gains are robust.

    Authors: We concur that additional statistical rigor and baseline comparisons are required to properly evaluate the robustness of the wrist-view gains. In the revision we will: report error bars as standard deviations computed over five independent training runs with different random seeds; include paired t-test results with p-values to establish statistical significance of the improvements; explicitly document the train/validation/test episode splits (70/15/15 ratio, stratified by task to prevent leakage); and expand the baselines to include a strong single-view egocentric model, a fully multi-view model trained and tested with all views, and an additional late-fusion variant. These changes will allow readers to assess the reliability of the reported relative gains. revision: yes

  3. Referee: [Method] Model description: the TouchAnything architecture does not specify the exact mechanism for flexibly incorporating wrist views only at inference (e.g., whether the network is trained with all views or uses late fusion), which is central to the claim of flexible multi-view prediction.

    Authors: We thank the referee for highlighting this ambiguity. The TouchAnything model is trained end-to-end with all available views (egocentric plus wrist) using a cross-attention fusion module; at inference, wrist views are optionally omitted by zero-masking the corresponding input tokens to the fusion layer, which enables flexible single- or multi-view operation without retraining. We will revise the Method section to explicitly describe this training/inference procedure, add pseudocode for the masking mechanism, and include a diagram clarifying the view-conditional forward pass. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical dataset and baseline

full rationale

The paper presents a new multi-view egocentric dataset (EgoTouch) with synchronized RGB, 3D hand pose, and wearable pressure maps across 1,891 episodes, plus a baseline multi-view vision-to-touch prediction model (TouchAnything). All reported results are empirical performance numbers (Contact IoU, Volumetric IoU) measured on held-out episodes from the collected data; no equations, fitted parameters, or first-principles derivations are claimed, and no self-citation chain is used to justify uniqueness or force a result. The contribution is therefore self-contained against external benchmarks and does not reduce any prediction to its own inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on the new data collection and the assumption that visual features suffice to predict tactile signals; no invented physical entities are introduced.

free parameters (1)
  • neural network hyperparameters
    Standard training parameters such as learning rate and architecture choices are optimized on the collected data.
axioms (1)
  • domain assumption Wearable tactile sensors provide accurate and dense ground-truth pressure maps synchronized with video.
    All supervision and evaluation rest on the fidelity of these sensor readings across the 1,891 episodes.

pith-pipeline@v0.9.0 · 5640 in / 1371 out tokens · 61143 ms · 2026-05-14T18:58:02.368438+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

42 extracted references · 23 canonical work pages · 2 internal anchors

  1. [1]

    Dawadi, and Sushant Chalise

    Amrit Aryal, Santosh Giri, Sanjeeb Prasad Panday, Suman Sharma, Babu R. Dawadi, and Sushant Chalise. Efficient 3d scene reconstruction from multi-view RGB images using optimized gaussian splatting.IEEE Access, 14: 1269–1286, 2026. doi: 10.1109/ACCESS.2025.3648171. URLhttps://doi.org/10.1109/ACCESS.2025.3648171

  2. [3]

    ContactDB: Analyzing and Predicting Grasp Contact via Thermal Imaging

    Samarth Brahmbhatt, Cusuh Ham, Charles C. Kemp, and James Hays. Contactdb: Analyzing and predicting grasp contact via thermal imaging, 2019. URLhttps://arxiv.org/abs/1904.06830

  3. [4]

    Narang, Karl Van Wyk, Umar Iqbal, Stan Birchfield, Jan Kautz, and Dieter Fox

    Yu-Wei Chao, Wei Yang, Yu Xiang, Pavlo Molchanov, Ankur Handa, Jonathan Tremblay, Yashraj S. Narang, Karl Van Wyk, Umar Iqbal, Stan Birchfield, Jan Kautz, and Dieter Fox. Dexycb: A benchmark for capturing hand grasping of objects. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

  4. [5]

    Scaling egocentric vision: The epic-kitchens dataset

    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, et al. Scaling egocentric vision: The epic-kitchens dataset. In European Conference on Computer Vision (ECCV), 2018

  5. [6]

    Actionsense: A multimodal dataset and recording framework for human activities using wearable sensors in a kitchen environment

    Joseph Del Preore and Daniela Rus. Actionsense: A multimodal dataset and recording framework for human activities using wearable sensors in a kitchen environment. InAdvancesin Neural Information Processing Systems (NeurIPS), 2022

  6. [7]

    Black, and Otmar Hilliges

    Zicong Fan, Omid Taheri, Dimitrios Tzionas, Muhammed Kocabas, Manuel Kaufmann, Michael J. Black, and Otmar Hilliges. Arctic: A dataset for dexterous bimanual hand-object manipulation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  7. [8]

    Shenyuan Gao, William Liang, Kaiyuan Zheng, Ayaan Malik, Seonghyeon Ye, Sihyun Yu, Wei-Cheng Tseng, Yuzhu Dong, Kaichun Mo, Chen-Hsuan Lin, Qianli Ma, Seungjun Nah, Loic Magne, Jiannan Xiang, Yuqi Xie, Ruijie Zheng, Dantong Niu, You Liang Tan, K. R. Zentner, George Kurian, Suneel Indupuru, Pooya Jannaty, Jinwei Gu, Jun Zhang, Jitendra Malik, Pieter Abbeel...

  8. [9]

    Egopressure: A dataset for hand pressure and pose estimation in egocentric views.arXiv preprint, 2024

    Patrick Grady et al. Egopressure: A dataset for hand pressure and pose estimation in egocentric views.arXiv preprint, 2024

  9. [10]

    Ego4d: Around the world in 3,000 hours of egocentric video

    Kristen Grauman, Andrew Westbury, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

  10. [11]

    Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zachary Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, María Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md M...

  11. [12]

    Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives

    Kristen Grauman et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  12. [13]

    Keypoint transformer: Solving joint identification in challenging hands and object interactions for accurate 3d pose estimation

    Shreyas Hampali, Sayan Deb Sarkar, Mahdi Rad, and Vincent Lepetit. Keypoint transformer: Solving joint identification in challenging hands and object interactions for accurate 3d pose estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11090–11100, June 2022

  13. [14]

    EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video,

    Ryan Hoque, Peide Huang, David J. Yoon, Mouli Sivapurapu, and Jian Zhang. Egodex: Learning dexterous manipulation from large-scale egocentric video, 2026. URLhttps://arxiv.org/abs/2505.11709

  14. [15]

    Hand Pose Estimation via Latent 2.5D Heatmap Regression

    Umar Iqbal, Pavlo Molchanov, Thomas Breuel, Juergen Gall, and Jan Kautz. Hand pose estimation via latent 2.5 d heatmap regression.arXiv preprint arXiv:1804.09534, 2018

  15. [16]

    Digit: A novel design for a low-cost compact high-resolution tactile sensor with application to in-hand manipulation

    Mike Lambeta, Po-Wei Chou, Stephen Tian, et al. Digit: A novel design for a low-cost compact high-resolution tactile sensor with application to in-hand manipulation. InIEEE Robotics and Automation Letters, 2020

  16. [17]

    Ego-1k - A large-scale multiview video dataset for egocentric vision.CoRR, abs/2603.13741, 2026

    Jae Yong Lee, Daniel Scharstein, Akash Bapat, Hao Hu, Andrew Fu, Haoru Zhao, Paul Sammut, Xiang Li, Stephen Jeapes, Anik Gupta, Lior David, Saketh Madhuvarasu, Jay Girish Joshi, and Jason Wither. Ego-1k - A large-scale multiview video dataset for egocentric vision.CoRR, abs/2603.13741, 2026. doi: 10.48550/ARXIV.2603.13741. URLhttps://doi.org/10.48550/arXi...

  17. [18]

    Connecting touch and vision via cross-modal prediction

    Yunzhu Li, Jun-Yan Li, Antonio Torralba, et al. Connecting touch and vision via cross-modal prediction. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019

  18. [19]

    Yijiong Lin, Mauro Comi, Alex Church, Dandan Zhang, and Nathan F. Lepora. Attention for robot touch: Tactile saliency prediction for robust sim-to-real tactile control. In IROS, pages 10806–10812, 2023. doi: 10.1109/IROS55552.2023.10341888. URLhttps://doi.org/10.1109/IROS55552.2023.10341888

  19. [20]

    Hoi4d: A 4d egocentric dataset for category-level human-object interaction

    Yunze Liu, Yun Liu, Che Jiang, et al. Hoi4d: A 4d egocentric dataset for category-level human-object interaction. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

  20. [21]

    Being-h0.5: Scaling human-centric robot learning for cross-embodiment generalization

    Hao Luo, Ye Wang, Wanpeng Zhang, Sipeng Zheng, Ziheng Xi, Chaoyi Xu, Haiweng Xu, Haoqi Yuan, Chi Zhang, Yiqing Wang, Yicheng Feng, and Zongqing Lu. Being-h0.5: Scaling human-centric robot learning for cross-embodiment generalization. CoRR, abs/2601.12993, 2026. doi: 10.48550/ARXIV.2601.12993. URL https://doi.org/10.48550/arXiv.2601.12993

  21. [22]

    Dinov2: Learning robust visual features without supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, et al. Dinov2: Learning robust visual features without supervision. Transactionson Machine Learning Research, 2023

  22. [23]

    Patel, and Shao-Yuan Lo

    Toby Perrett, Ahmad Darkhalil, Saptarshi Sinha, Omar Emara, Sam Pollard, Kranti Kumar Parida, Kaiting Liu, Prajwal Gatti, Siddhant Bansal, Kevin Flanagan, Jacob Chalk, Zhifan Zhu, Rhodri Guerrier, Fahd Abdelazim, Bin Zhu, Davide Moltisanti, Michael Wray, Hazel Doughty, and Dima Damen. HD-EPIC: A highly-detailed egocentric video dataset. InIEEE/CVF Confere...

  23. [24]

    Egocentric hand activity video dataset and bidirectional motion-priors for hand action recognition.IEEEAccess, 14:8128–8148,

    Jiyoung Seo, Dong In Lee, Pilhyeon Lee, Jiwoo Lee, Youn-Hee Gil, Karthik Ramani, and Sangpil Kim. Egocentric hand activity video dataset and bidirectional motion-priors for hand action recognition.IEEEAccess, 14:8128–8148,

  24. [25]

    URLhttps://doi.org/10.1109/ACCESS.2026.3652803

    doi: 10.1109/ACCESS.2026.3652803. URLhttps://doi.org/10.1109/ACCESS.2026.3652803

  25. [26]

    Opentouch: Bringing full-hand touch to real-world interaction, 2025

    Yuxin Ray Song, Jinzhou Li, Rao Fu, Devin Murphy, Kaichen Zhou, Rishi Shiv, Yaqi Li, Haoyu Xiong, Crys- tal Elaine Owens, Yilun Du, Yiyue Luo, Xianyi Cheng, Antonio Torralba, Wojciech Matusik, and Paul Pu Liang. Opentouch: Bringing full-hand touch to real-world interaction, 2025. URLhttps://arxiv.org/abs/2512.16842

  26. [27]

    Patel, and Shao-Yuan Lo

    Haisheng Su, Feixiang Song, Cong Ma, Wei Wu, and Junchi Yan. Robosense: Large-scale dataset and benchmark for egocentric robot perception and navigation in crowded and unstructured environments. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 27446–27455. Computer Vision Foundation ...

  27. [28]

    A survey of robot manipulation in contact.Robotics Auton

    Markku Suomalainen, Yiannis Karayiannidis, and Ville Kyrki. A survey of robot manipulation in contact.Robotics Auton. Syst., 156:104224, 2022. doi: 10.1016/J.ROBOT.2022.104224. URLhttps://doi.org/10.1016/j.robot. 2022.104224. 12

  28. [29]

    Black, and Dimitrios Tzionas

    Omid Taheri, Nima Ghorbani, Michael J. Black, and Dimitrios Tzionas. Grab: A dataset of whole-body human grasping of objects. InEuropean Conference on Computer Vision (ECCV), 2020

  29. [30]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, et al. Attention is all you need. InAdvancesin Neural Information Processing Systems (NeurIPS), 2017

  30. [31]

    Akihiko Yamaguchi and Christopher G. Atkeson. Recent progress in tactile sensing and sensors for robotic manipulation: can we turn tactile sensing into vision? Adv. Robotics, 33(14):661–673, 2019. doi: 10.1080/ 01691864.2019.1632222. URLhttps://doi.org/10.1080/01691864.2019.1632222

  31. [32]

    Touch and go: Learning from human-collected vision and touch

    Fengyu Yang, Chenyang Ma, Jiacheng Zhang, Jing Zhu, Wenzhen Yuan, and Andrew Owens. Touch and go: Learning from human-collected vision and touch. In Sanmi Koyejo, S. Mohamed, A. Agar- wal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, Ne...

  32. [33]

    Tacgnn: Learning tactile-based in-hand manipulation with a blind robot using hierarchical graph neural network.IEEE Robotics Autom

    Linhan Yang, Bidan Huang, Qingbiao Li, Ya-Yen Tsai, Wang Wei Lee, Chaoyang Song, and Jia Pan. Tacgnn: Learning tactile-based in-hand manipulation with a blind robot using hierarchical graph neural network.IEEE Robotics Autom. Lett., 8(6):3605–3612, 2023. doi: 10.1109/LRA.2023.3264759. URLhttps://doi.org/10.1109/ LRA.2023.3264759

  33. [34]

    Oakink: A large-scale knowledge repository for understanding hand-object interaction

    Lixin Yang, Kailin Li, Xinyu Zhan, et al. Oakink: A large-scale knowledge repository for understanding hand-object interaction. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

  34. [35]

    Pressurevision: Estimating hand pressure from a single rgb image

    Patrick Grady Yang, Christian Haase-Schütz, Marcel Leonardi, et al. Pressurevision: Estimating hand pressure from a single rgb image. InEuropean Conference on Computer Vision (ECCV), 2022

  35. [36]

    Egovla: Learning vision- language-action models from egocentric human videos.CoRR, abs/2507.12440, 2025

    Ruihan Yang, Qinxi Yu, Yecheng Wu, Rui Yan, Borui Li, An-Chieh Cheng, Xueyan Zou, Yunhao Fang, Xuxin Cheng, Ri-Zhao Qiu, Hongxu Yin, Sifei Liu, Song Han, Yao Lu, and Xiaolong Wang. Egovla: Learning vision- language-action models from egocentric human videos.CoRR, abs/2507.12440, 2025. doi: 10.48550/ARXIV.2507. 12440. URLhttps://doi.org/10.48550/arXiv.2507.12440

  36. [37]

    Egoxtreme: A dataset for robust object pose estimation in egocentric views under extreme conditions.CoRR, abs/2603.25135,

    Taegyoon Yoon, Yegyu Han, Seojin Ji, Jaewoo Park, Sojeong Kim, Taein Kwon, and Hyung-Sin Kim. Egoxtreme: A dataset for robust object pose estimation in egocentric views under extreme conditions.CoRR, abs/2603.25135,

  37. [38]
  38. [39]

    Wenzhen Yuan, Siyuan Dong, and Edward H. Adelson. Gelsight: High-resolution robot tactile sensors for estimating geometry and force.Sensors, 17(12), 2017

  39. [40]

    Unidex: A robot foundation suite for universal dexterous hand control from egocentric human videos.CoRR, abs/2603.22264, 2026

    Gu Zhang, Qicheng Xu, Haozhe Zhang, Jianhan Ma, Long He, Yiming Bao, Zeyu Ping, Zhecheng Yuan, Chenhao Lu, Chengbo Yuan, Tianhai Liang, Xiaoyu Tian, Maanping Shao, Feihong Zhang, Mingyu Ding, Yang Gao, Hao Zhao, Hang Zhao, and Huazhe Xu. Unidex: A robot foundation suite for universal dexterous hand control from egocentric human videos.CoRR, abs/2603.22264...

  40. [41]

    EgoScale: Scaling Dexterous Manipulation with Diverse Ego- centric Human Data,

    Ruijie Zheng, Dantong Niu, Yuqi Xie, Jing Wang, Mengda Xu, Yunfan Jiang, Fernando Castañeda, Fengyuan Hu, You Liang Tan, Letian Fu, Trevor Darrell, Furong Huang, Yuke Zhu, Danfei Xu, and Linxi Fan. Egoscale: Scaling dexterous manipulation with diverse egocentric human data, 2026. URLhttps://arxiv.org/abs/2602.16710

  41. [42]

    Touching a nerf: Leveraging neural radiance fields for tactile sensory data generation

    Shaohong Zhong et al. Touching a nerf: Leveraging neural radiance fields for tactile sensory data generation. arXiv preprint arXiv:2304.12828, 2023

  42. [43]

    Russell, Max J

    Christian Zimmermann, Duygu Ceylan, Jimei Yang, Bryan C. Russell, Max J. Argus, and Thomas Brox. Freihand: A dataset for markerless capture of hand pose and shape from single RGB images. In2019 IEEE/CVF InternationalConference on Computer Vision, ICCV 2019, Seoul, Korea(South), October 27 - November2, 2019, pages 813–822. IEEE, 2019. doi: 10.1109/ICCV.201...