arxiv: 2605.13083 · v1 · submitted 2026-05-13 · 💻 cs.RO

Recognition: 2 theorem links

· Lean Theorem

TouchAnything: A Dataset and Framework for Bimanual Tactile Estimation from Egocentric Video

Chuqiao Lyu, Feiyang Hong, Guannan Zhang, Haotian Wu, Jianyi Zhou, Ruichen Zhen, Shuo Yang, Weisheng Dai, Wenbo Ding, Xushi Wang, Yinian Mao, Yuxiang Jiang, Zirui Liu, Ziteng Gao

Authors on Pith no claims yet

Pith reviewed 2026-05-14 18:58 UTC · model grok-4.3

classification 💻 cs.RO

keywords tactile estimationegocentric videobimanual interactionpressure mapsmulti-view visionhand-object interactiondatasetembodied learning

0 comments

The pith

Tactile pressure maps can be predicted from egocentric video of bimanual interactions by incorporating optional wrist views.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets the absence of tactile signals in large-scale egocentric video, which prevents models from learning accurate contact, force, and pressure dynamics during human-object manipulation. It releases the EgoTouch dataset containing 208 tasks, 1891 episodes, multi-view RGB footage, 3D hand poses, and dense pressure readings from wearable sensors in varied indoor and outdoor settings. The TouchAnything framework is introduced to predict tactile outputs primarily from the head-mounted egocentric view while flexibly using wrist-mounted views at inference time when available. Experiments demonstrate measurable gains in prediction quality when wrist views are added. This setup aims to provide scalable tactile supervision for embodied learning without requiring tactile hardware on every data collection setup.

Core claim

The paper establishes that continuous pressure maps for bimanual hand-object interactions can be inferred from egocentric video input, and that a multi-view vision-to-touch model trained on the EgoTouch dataset achieves up to 5.0 percent relative improvement in Contact IoU and 6.1 percent relative improvement in Volumetric IoU when wrist-mounted views are included alongside the primary egocentric view.

What carries the argument

TouchAnything, a baseline multi-view vision-to-touch prediction framework that treats the egocentric view as the main input and incorporates available wrist views to refine tactile estimates.

If this is right

Large egocentric video collections can receive inferred tactile labels to support physically grounded pretraining of interaction models.
Embodied agents can acquire contact-aware representations without deploying tactile hardware on every robot or recording rig.
Bimanual manipulation policies can be trained with more realistic estimates of force and pressure distribution.
Dataset scaling for tactile research becomes feasible across diverse real-world environments using only cameras.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Pre-existing egocentric video archives could be retroactively annotated with tactile predictions to bootstrap new training runs.
Wearable systems using only head and wrist cameras might deliver approximate touch feedback for augmented or virtual reality interfaces.
The prediction approach could be extended to estimate additional contact properties such as friction or slip once paired with suitable auxiliary signals.

Load-bearing premise

The wearable tactile sensors supply accurate, dense, and time-synchronized ground-truth pressure maps that can serve as reliable supervision for training vision-based prediction models.

What would settle it

Apply the trained model to a new set of manipulation episodes recorded with the same sensor suite but featuring unseen objects or force profiles, then check whether the predicted contact regions and pressure values diverge substantially from the actual sensor readings in location, area, or magnitude.

read the original abstract

Egocentric human video data, which captures rich human-environment interactions and can be collected at scale, has become a key driver of embodied intelligence research. However, existing egocentric datasets typically lack tactile sensing, a critical modality that provides direct cues about contact, force, and pressure in human-object interaction. Without such signals, models struggle to learn physically grounded representations of real-world interaction dynamics. While tactile sensors provide these cues, deploying high-quality tactile hardware at scale remains expensive and cumbersome. This raises a central question: can tactile feedback be inferred directly from visual observations, enabling scalable tactile supervision for egocentric video data and supporting physically grounded embodied learning? To enable research in this direction, we introduce EgoTouch, a large-scale multi-view egocentric dataset with dense tactile supervision for bimanual hand-object interaction. EgoTouch comprises 208 manipulation tasks spanning 1,891 episodes in diverse indoor and outdoor environments, with synchronized multi-view RGB (head-mounted egocentric and dual wrist-mounted cameras), bimanual 3D hand pose, and continuous pressure maps from wearable tactile sensors. Building on EgoTouch, we introduce TouchAnything, a baseline multi-view vision-to-touch prediction framework that uses the egocentric view as the primary input and flexibly leverages available wrist-mounted views at inference time. Experiments show that incorporating wrist-mounted views generally improves tactile prediction over egocentric-only input, achieving up to 5.0% relative improvement in Contact IoU and 6.1% relative improvement in Volumetric IoU. We will publicly release the dataset, code, and benchmark.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EgoTouch supplies a new dataset for inferring touch from egocentric video, but the pressure map ground truth has no reported calibration.

read the letter

The paper's core offering is the EgoTouch dataset, which pairs large-scale egocentric video of bimanual tasks with dense pressure maps from wearable sensors on both hands. They also release a baseline model, TouchAnything, that predicts tactile outputs from the video inputs, and they show modest gains when wrist views are added. What works is the data collection itself. 208 tasks, 1891 episodes, multi-view RGB plus hand pose and pressure, covering indoor and outdoor scenes. That scale is useful for training vision-to-touch models, and the public release of data and code is the right move. The experiments are straightforward. They report relative improvements of 5% on Contact IoU and 6.1% on Volumetric IoU from including wrist views. The evaluation uses held-out episodes, which is clean. The weak point is the ground truth quality. The paper does not describe any calibration of the tactile sensors, checks for drift, or tests of synchronization between the pressure readings and the RGB frames. If the pressure maps contain bias or noise, as is common with wearable devices, then the reported gains are hard to interpret. The central claim rests on those maps being reliable supervision. No other major issues jump out. The approach is empirical and the citations seem to cover the relevant prior work on egocentric datasets and tactile sensing. This is aimed at people working on embodied intelligence who need tactile signals but want to avoid expensive hardware setups. A reader building prediction models could use the data as a starting point, provided they verify the labels. It deserves a serious referee. The dataset is new and the question it asks is practical, so the work should go through review to confirm the data pipeline. Recommendation: send it for peer review.

Referee Report

3 major / 2 minor

Summary. The paper introduces the EgoTouch dataset comprising 208 bimanual manipulation tasks and 1,891 episodes with synchronized head-mounted egocentric RGB, dual wrist-mounted RGB, 3D hand poses, and dense pressure maps from wearable tactile sensors. It also presents the TouchAnything multi-view vision-to-touch prediction framework that takes egocentric video as primary input and optionally incorporates wrist views at inference time, reporting relative gains of up to 5.0% in Contact IoU and 6.1% in Volumetric IoU when wrist views are added.

Significance. If the tactile ground-truth signals prove reliable, the dataset and benchmark would enable scalable vision-based tactile supervision for egocentric video, supporting physically grounded embodied learning without requiring tactile hardware at test time. The public release of data, code, and benchmark is a clear strength for reproducibility.

major comments (3)

[Dataset] Dataset section: the manuscript provides no description of sensor calibration against known forces, drift correction, contact-area validation, or temporal synchronization checks between RGB frames and pressure readings across the 1,891 episodes. This is load-bearing because the reported IoU improvements are measured against these pressure maps as supervision.
[Experiments] Experiments section: the headline relative improvements (5.0% Contact IoU, 6.1% Volumetric IoU) are presented without error bars, statistical significance tests, explicit train/validation/test splits, or comparison against strong single-view and multi-view baselines, preventing assessment of whether the wrist-view gains are robust.
[Method] Model description: the TouchAnything architecture does not specify the exact mechanism for flexibly incorporating wrist views only at inference (e.g., whether the network is trained with all views or uses late fusion), which is central to the claim of flexible multi-view prediction.

minor comments (2)

[Abstract] Abstract: the phrase 'up to' for the maximum improvements does not indicate the specific tasks or conditions under which these values are achieved.
[Evaluation Metrics] Notation: the definitions of Contact IoU and Volumetric IoU are not restated in the main text, forcing readers to consult supplementary material for metric details.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We sincerely thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to incorporate the suggested improvements for greater clarity and rigor.

read point-by-point responses

Referee: [Dataset] Dataset section: the manuscript provides no description of sensor calibration against known forces, drift correction, contact-area validation, or temporal synchronization checks between RGB frames and pressure readings across the 1,891 episodes. This is load-bearing because the reported IoU improvements are measured against these pressure maps as supervision.

Authors: We agree that a detailed account of sensor calibration and synchronization is necessary to substantiate the reliability of the tactile supervision. In the revised manuscript, we will add a new subsection in the Dataset section describing: (1) calibration of the wearable tactile sensors against known forces using a reference force gauge, (2) the drift correction procedure applied to pressure readings, (3) contact-area validation performed by cross-referencing pressure maps with visual contact annotations on a held-out subset of episodes, and (4) the temporal synchronization protocol employing hardware triggers and post-hoc timestamp alignment between the RGB streams and pressure data. These additions will directly address the concern regarding the quality of the ground-truth signals. revision: yes
Referee: [Experiments] Experiments section: the headline relative improvements (5.0% Contact IoU, 6.1% Volumetric IoU) are presented without error bars, statistical significance tests, explicit train/validation/test splits, or comparison against strong single-view and multi-view baselines, preventing assessment of whether the wrist-view gains are robust.

Authors: We concur that additional statistical rigor and baseline comparisons are required to properly evaluate the robustness of the wrist-view gains. In the revision we will: report error bars as standard deviations computed over five independent training runs with different random seeds; include paired t-test results with p-values to establish statistical significance of the improvements; explicitly document the train/validation/test episode splits (70/15/15 ratio, stratified by task to prevent leakage); and expand the baselines to include a strong single-view egocentric model, a fully multi-view model trained and tested with all views, and an additional late-fusion variant. These changes will allow readers to assess the reliability of the reported relative gains. revision: yes
Referee: [Method] Model description: the TouchAnything architecture does not specify the exact mechanism for flexibly incorporating wrist views only at inference (e.g., whether the network is trained with all views or uses late fusion), which is central to the claim of flexible multi-view prediction.

Authors: We thank the referee for highlighting this ambiguity. The TouchAnything model is trained end-to-end with all available views (egocentric plus wrist) using a cross-attention fusion module; at inference, wrist views are optionally omitted by zero-masking the corresponding input tokens to the fusion layer, which enables flexible single- or multi-view operation without retraining. We will revise the Method section to explicitly describe this training/inference procedure, add pseudocode for the masking mechanism, and include a diagram clarifying the view-conditional forward pass. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical dataset and baseline

full rationale

The paper presents a new multi-view egocentric dataset (EgoTouch) with synchronized RGB, 3D hand pose, and wearable pressure maps across 1,891 episodes, plus a baseline multi-view vision-to-touch prediction model (TouchAnything). All reported results are empirical performance numbers (Contact IoU, Volumetric IoU) measured on held-out episodes from the collected data; no equations, fitted parameters, or first-principles derivations are claimed, and no self-citation chain is used to justify uniqueness or force a result. The contribution is therefore self-contained against external benchmarks and does not reduce any prediction to its own inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on the new data collection and the assumption that visual features suffice to predict tactile signals; no invented physical entities are introduced.

free parameters (1)

neural network hyperparameters
Standard training parameters such as learning rate and architecture choices are optimized on the collected data.

axioms (1)

domain assumption Wearable tactile sensors provide accurate and dense ground-truth pressure maps synchronized with video.
All supervision and evaluation rest on the fidelity of these sensor readings across the 1,891 episodes.

pith-pipeline@v0.9.0 · 5640 in / 1371 out tokens · 61143 ms · 2026-05-14T18:58:02.368438+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
TouchAnything, a baseline multi-view vision-to-touch prediction framework that uses the egocentric view as the primary input and flexibly leverages available wrist-mounted views at inference time... L = λ_mse L_MSE + λ_l1 L_L1 + λ_tv L_TV
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
EgoTouch comprises 208 manipulation tasks spanning 1,891 episodes... continuous pressure maps from wearable tactile sensors

Reference graph

Works this paper leans on

42 extracted references · 23 canonical work pages · 2 internal anchors

[1]

Dawadi, and Sushant Chalise

Amrit Aryal, Santosh Giri, Sanjeeb Prasad Panday, Suman Sharma, Babu R. Dawadi, and Sushant Chalise. Efficient 3d scene reconstruction from multi-view RGB images using optimized gaussian splatting.IEEE Access, 14: 1269–1286, 2026. doi: 10.1109/ACCESS.2025.3648171. URLhttps://doi.org/10.1109/ACCESS.2025.3648171

work page doi:10.1109/access.2025.3648171 2026
[3]

ContactDB: Analyzing and Predicting Grasp Contact via Thermal Imaging

Samarth Brahmbhatt, Cusuh Ham, Charles C. Kemp, and James Hays. Contactdb: Analyzing and predicting grasp contact via thermal imaging, 2019. URLhttps://arxiv.org/abs/1904.06830

work page internal anchor Pith review Pith/arXiv arXiv 2019
[4]

Narang, Karl Van Wyk, Umar Iqbal, Stan Birchfield, Jan Kautz, and Dieter Fox

Yu-Wei Chao, Wei Yang, Yu Xiang, Pavlo Molchanov, Ankur Handa, Jonathan Tremblay, Yashraj S. Narang, Karl Van Wyk, Umar Iqbal, Stan Birchfield, Jan Kautz, and Dieter Fox. Dexycb: A benchmark for capturing hand grasping of objects. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

2021
[5]

Scaling egocentric vision: The epic-kitchens dataset

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, et al. Scaling egocentric vision: The epic-kitchens dataset. In European Conference on Computer Vision (ECCV), 2018

2018
[6]

Actionsense: A multimodal dataset and recording framework for human activities using wearable sensors in a kitchen environment

Joseph Del Preore and Daniela Rus. Actionsense: A multimodal dataset and recording framework for human activities using wearable sensors in a kitchen environment. InAdvancesin Neural Information Processing Systems (NeurIPS), 2022

2022
[7]

Black, and Otmar Hilliges

Zicong Fan, Omid Taheri, Dimitrios Tzionas, Muhammed Kocabas, Manuel Kaufmann, Michael J. Black, and Otmar Hilliges. Arctic: A dataset for dexterous bimanual hand-object manipulation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

2023
[8]

Shenyuan Gao, William Liang, Kaiyuan Zheng, Ayaan Malik, Seonghyeon Ye, Sihyun Yu, Wei-Cheng Tseng, Yuzhu Dong, Kaichun Mo, Chen-Hsuan Lin, Qianli Ma, Seungjun Nah, Loic Magne, Jiannan Xiang, Yuqi Xie, Ruijie Zheng, Dantong Niu, You Liang Tan, K. R. Zentner, George Kurian, Suneel Indupuru, Pooya Jannaty, Jinwei Gu, Jun Zhang, Jitendra Malik, Pieter Abbeel...

work page doi:10.48550/arxiv.2602.06949 2026
[9]

Egopressure: A dataset for hand pressure and pose estimation in egocentric views.arXiv preprint, 2024

Patrick Grady et al. Egopressure: A dataset for hand pressure and pose estimation in egocentric views.arXiv preprint, 2024

2024
[10]

Ego4d: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

2022
[11]

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zachary Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, María Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md M...

work page doi:10.1007/s11263-025-02557-6 2025
[12]

Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives

Kristen Grauman et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024
[13]

Keypoint transformer: Solving joint identification in challenging hands and object interactions for accurate 3d pose estimation

Shreyas Hampali, Sayan Deb Sarkar, Mahdi Rad, and Vincent Lepetit. Keypoint transformer: Solving joint identification in challenging hands and object interactions for accurate 3d pose estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11090–11100, June 2022

2022
[14]

EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video,

Ryan Hoque, Peide Huang, David J. Yoon, Mouli Sivapurapu, and Jian Zhang. Egodex: Learning dexterous manipulation from large-scale egocentric video, 2026. URLhttps://arxiv.org/abs/2505.11709

work page arXiv 2026
[15]

Hand Pose Estimation via Latent 2.5D Heatmap Regression

Umar Iqbal, Pavlo Molchanov, Thomas Breuel, Juergen Gall, and Jan Kautz. Hand pose estimation via latent 2.5 d heatmap regression.arXiv preprint arXiv:1804.09534, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[16]

Digit: A novel design for a low-cost compact high-resolution tactile sensor with application to in-hand manipulation

Mike Lambeta, Po-Wei Chou, Stephen Tian, et al. Digit: A novel design for a low-cost compact high-resolution tactile sensor with application to in-hand manipulation. InIEEE Robotics and Automation Letters, 2020

2020
[17]

Ego-1k - A large-scale multiview video dataset for egocentric vision.CoRR, abs/2603.13741, 2026

Jae Yong Lee, Daniel Scharstein, Akash Bapat, Hao Hu, Andrew Fu, Haoru Zhao, Paul Sammut, Xiang Li, Stephen Jeapes, Anik Gupta, Lior David, Saketh Madhuvarasu, Jay Girish Joshi, and Jason Wither. Ego-1k - A large-scale multiview video dataset for egocentric vision.CoRR, abs/2603.13741, 2026. doi: 10.48550/ARXIV.2603.13741. URLhttps://doi.org/10.48550/arXi...

work page doi:10.48550/arxiv.2603.13741 2026
[18]

Connecting touch and vision via cross-modal prediction

Yunzhu Li, Jun-Yan Li, Antonio Torralba, et al. Connecting touch and vision via cross-modal prediction. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019

2019
[19]

Yijiong Lin, Mauro Comi, Alex Church, Dandan Zhang, and Nathan F. Lepora. Attention for robot touch: Tactile saliency prediction for robust sim-to-real tactile control. In IROS, pages 10806–10812, 2023. doi: 10.1109/IROS55552.2023.10341888. URLhttps://doi.org/10.1109/IROS55552.2023.10341888

work page doi:10.1109/iros55552.2023.10341888 2023
[20]

Hoi4d: A 4d egocentric dataset for category-level human-object interaction

Yunze Liu, Yun Liu, Che Jiang, et al. Hoi4d: A 4d egocentric dataset for category-level human-object interaction. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

2022
[21]

Being-h0.5: Scaling human-centric robot learning for cross-embodiment generalization

Hao Luo, Ye Wang, Wanpeng Zhang, Sipeng Zheng, Ziheng Xi, Chaoyi Xu, Haiweng Xu, Haoqi Yuan, Chi Zhang, Yiqing Wang, Yicheng Feng, and Zongqing Lu. Being-h0.5: Scaling human-centric robot learning for cross-embodiment generalization. CoRR, abs/2601.12993, 2026. doi: 10.48550/ARXIV.2601.12993. URL https://doi.org/10.48550/arXiv.2601.12993

work page doi:10.48550/arxiv.2601.12993 2026
[22]

Dinov2: Learning robust visual features without supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, et al. Dinov2: Learning robust visual features without supervision. Transactionson Machine Learning Research, 2023

2023
[23]

Patel, and Shao-Yuan Lo

Toby Perrett, Ahmad Darkhalil, Saptarshi Sinha, Omar Emara, Sam Pollard, Kranti Kumar Parida, Kaiting Liu, Prajwal Gatti, Siddhant Bansal, Kevin Flanagan, Jacob Chalk, Zhifan Zhu, Rhodri Guerrier, Fahd Abdelazim, Bin Zhu, Davide Moltisanti, Michael Wray, Hazel Doughty, and Dima Damen. HD-EPIC: A highly-detailed egocentric video dataset. InIEEE/CVF Confere...

work page arXiv 2025
[24]

Egocentric hand activity video dataset and bidirectional motion-priors for hand action recognition.IEEEAccess, 14:8128–8148,

Jiyoung Seo, Dong In Lee, Pilhyeon Lee, Jiwoo Lee, Youn-Hee Gil, Karthik Ramani, and Sangpil Kim. Egocentric hand activity video dataset and bidirectional motion-priors for hand action recognition.IEEEAccess, 14:8128–8148,
[25]

URLhttps://doi.org/10.1109/ACCESS.2026.3652803

doi: 10.1109/ACCESS.2026.3652803. URLhttps://doi.org/10.1109/ACCESS.2026.3652803

work page doi:10.1109/access.2026.3652803 2026
[26]

Opentouch: Bringing full-hand touch to real-world interaction, 2025

Yuxin Ray Song, Jinzhou Li, Rao Fu, Devin Murphy, Kaichen Zhou, Rishi Shiv, Yaqi Li, Haoyu Xiong, Crys- tal Elaine Owens, Yilun Du, Yiyue Luo, Xianyi Cheng, Antonio Torralba, Wojciech Matusik, and Paul Pu Liang. Opentouch: Bringing full-hand touch to real-world interaction, 2025. URLhttps://arxiv.org/abs/2512.16842

work page arXiv 2025
[27]

Patel, and Shao-Yuan Lo

Haisheng Su, Feixiang Song, Cong Ma, Wei Wu, and Junchi Yan. Robosense: Large-scale dataset and benchmark for egocentric robot perception and navigation in crowded and unstructured environments. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 27446–27455. Computer Vision Foundation ...

work page doi:10.1109/cvpr52734.2025 2025
[28]

A survey of robot manipulation in contact.Robotics Auton

Markku Suomalainen, Yiannis Karayiannidis, and Ville Kyrki. A survey of robot manipulation in contact.Robotics Auton. Syst., 156:104224, 2022. doi: 10.1016/J.ROBOT.2022.104224. URLhttps://doi.org/10.1016/j.robot. 2022.104224. 12

work page doi:10.1016/j.robot.2022.104224 2022
[29]

Black, and Dimitrios Tzionas

Omid Taheri, Nima Ghorbani, Michael J. Black, and Dimitrios Tzionas. Grab: A dataset of whole-body human grasping of objects. InEuropean Conference on Computer Vision (ECCV), 2020

2020
[30]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, et al. Attention is all you need. InAdvancesin Neural Information Processing Systems (NeurIPS), 2017

2017
[31]

Akihiko Yamaguchi and Christopher G. Atkeson. Recent progress in tactile sensing and sensors for robotic manipulation: can we turn tactile sensing into vision? Adv. Robotics, 33(14):661–673, 2019. doi: 10.1080/ 01691864.2019.1632222. URLhttps://doi.org/10.1080/01691864.2019.1632222

work page doi:10.1080/01691864.2019.1632222 2019
[32]

Touch and go: Learning from human-collected vision and touch

Fengyu Yang, Chenyang Ma, Jiacheng Zhang, Jing Zhu, Wenzhen Yuan, and Andrew Owens. Touch and go: Learning from human-collected vision and touch. In Sanmi Koyejo, S. Mohamed, A. Agar- wal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, Ne...

2022
[33]

Tacgnn: Learning tactile-based in-hand manipulation with a blind robot using hierarchical graph neural network.IEEE Robotics Autom

Linhan Yang, Bidan Huang, Qingbiao Li, Ya-Yen Tsai, Wang Wei Lee, Chaoyang Song, and Jia Pan. Tacgnn: Learning tactile-based in-hand manipulation with a blind robot using hierarchical graph neural network.IEEE Robotics Autom. Lett., 8(6):3605–3612, 2023. doi: 10.1109/LRA.2023.3264759. URLhttps://doi.org/10.1109/ LRA.2023.3264759

work page doi:10.1109/lra.2023.3264759 2023
[34]

Oakink: A large-scale knowledge repository for understanding hand-object interaction

Lixin Yang, Kailin Li, Xinyu Zhan, et al. Oakink: A large-scale knowledge repository for understanding hand-object interaction. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

2022
[35]

Pressurevision: Estimating hand pressure from a single rgb image

Patrick Grady Yang, Christian Haase-Schütz, Marcel Leonardi, et al. Pressurevision: Estimating hand pressure from a single rgb image. InEuropean Conference on Computer Vision (ECCV), 2022

2022
[36]

Egovla: Learning vision- language-action models from egocentric human videos.CoRR, abs/2507.12440, 2025

Ruihan Yang, Qinxi Yu, Yecheng Wu, Rui Yan, Borui Li, An-Chieh Cheng, Xueyan Zou, Yunhao Fang, Xuxin Cheng, Ri-Zhao Qiu, Hongxu Yin, Sifei Liu, Song Han, Yao Lu, and Xiaolong Wang. Egovla: Learning vision- language-action models from egocentric human videos.CoRR, abs/2507.12440, 2025. doi: 10.48550/ARXIV.2507. 12440. URLhttps://doi.org/10.48550/arXiv.2507.12440

work page doi:10.48550/arxiv.2507 2025
[37]

Egoxtreme: A dataset for robust object pose estimation in egocentric views under extreme conditions.CoRR, abs/2603.25135,

Taegyoon Yoon, Yegyu Han, Seojin Ji, Jaewoo Park, Sojeong Kim, Taein Kwon, and Hyung-Sin Kim. Egoxtreme: A dataset for robust object pose estimation in egocentric views under extreme conditions.CoRR, abs/2603.25135,

work page arXiv
[38]

Egoxtreme: A dataset for robust object pose estimation in egocentric views under extreme conditions.CoRR, abs/2603.25135,

doi: 10.48550/ARXIV.2603.25135. URLhttps://doi.org/10.48550/arXiv.2603.25135

work page doi:10.48550/arxiv.2603.25135
[39]

Wenzhen Yuan, Siyuan Dong, and Edward H. Adelson. Gelsight: High-resolution robot tactile sensors for estimating geometry and force.Sensors, 17(12), 2017

2017
[40]

Unidex: A robot foundation suite for universal dexterous hand control from egocentric human videos.CoRR, abs/2603.22264, 2026

Gu Zhang, Qicheng Xu, Haozhe Zhang, Jianhan Ma, Long He, Yiming Bao, Zeyu Ping, Zhecheng Yuan, Chenhao Lu, Chengbo Yuan, Tianhai Liang, Xiaoyu Tian, Maanping Shao, Feihong Zhang, Mingyu Ding, Yang Gao, Hao Zhao, Hang Zhao, and Huazhe Xu. Unidex: A robot foundation suite for universal dexterous hand control from egocentric human videos.CoRR, abs/2603.22264...

work page doi:10.48550/arxiv.2603.22264 2026
[41]

EgoScale: Scaling Dexterous Manipulation with Diverse Ego- centric Human Data,

Ruijie Zheng, Dantong Niu, Yuqi Xie, Jing Wang, Mengda Xu, Yunfan Jiang, Fernando Castañeda, Fengyuan Hu, You Liang Tan, Letian Fu, Trevor Darrell, Furong Huang, Yuke Zhu, Danfei Xu, and Linxi Fan. Egoscale: Scaling dexterous manipulation with diverse egocentric human data, 2026. URLhttps://arxiv.org/abs/2602.16710

work page arXiv 2026
[42]

Touching a nerf: Leveraging neural radiance fields for tactile sensory data generation

Shaohong Zhong et al. Touching a nerf: Leveraging neural radiance fields for tactile sensory data generation. arXiv preprint arXiv:2304.12828, 2023

work page arXiv 2023
[43]

Russell, Max J

Christian Zimmermann, Duygu Ceylan, Jimei Yang, Bryan C. Russell, Max J. Argus, and Thomas Brox. Freihand: A dataset for markerless capture of hand pose and shape from single RGB images. In2019 IEEE/CVF InternationalConference on Computer Vision, ICCV 2019, Seoul, Korea(South), October 27 - November2, 2019, pages 813–822. IEEE, 2019. doi: 10.1109/ICCV.201...

work page doi:10.1109/iccv.2019.00090 2019