Multi-Resolution Tactile Imitation Learning for Contact-Rich Robotic Manipulation

Erik Helmut; Georgia Chalvatzaki; Jan Peters; Niklas Funk; Rickmer Krohn; Vignesh Prasad

arxiv: 2606.06281 · v1 · pith:2SPXUXZSnew · submitted 2026-06-04 · 💻 cs.RO

Multi-Resolution Tactile Imitation Learning for Contact-Rich Robotic Manipulation

Rickmer Krohn , Erik Helmut , Niklas Funk , Jan Peters , Vignesh Prasad , Georgia Chalvatzaki This is my paper

Pith reviewed 2026-06-28 00:55 UTC · model grok-4.3

classification 💻 cs.RO

keywords tactile sensingimitation learningsensor fusionrobotic manipulationcontact-rich tasksmulti-resolutionGelSightevent-based sensors

0 comments

The pith

Multi-resolution tactile sensing fuses heterogeneous sensors to reach 80% success in contact-rich robotic manipulation where vision alone reaches 31%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MiTaS to combine an RGB camera with two tactile sensors that operate at different temporal resolutions for imitation learning of contact-rich tasks. It demonstrates through five tasks that the fused representation conditions a flow-matching policy more effectively than vision-only or single-tactile baselines. The work also shows that co-training on multi-tactile data improves results even when the highest-frequency sensor is unavailable at evaluation time. Sensor attention analysis confirms that the different resolutions contribute at distinct phases of each task.

Core claim

MiTaS uses modality-specific convolutional stems and transformer-based fusion to integrate RGB camera data with a vision-based GelSight Mini sensor and a high-frequency event-based Evetac sensor, then conditions a flow-matching policy on this representation. Across five contact-rich manipulation tasks this yields an average 80% success rate, compared with 31% for vision-only and 54% for visual-tactile baselines. Co-training a visuo-tactile model with the additional multi-tactile data improves performance by more than 10% on some tasks without requiring the Evetac sensor during policy execution.

What carries the argument

Modality-specific convolutional stems plus transformer fusion that combines RGB, GelSight Mini, and Evetac streams at different temporal resolutions to condition a flow-matching policy.

If this is right

MiTaS achieves 80% average success across the five contact-rich tasks.
Vision-only baselines reach only 31% and visual-tactile baselines reach 54%.
Co-training with multi-tactile data raises performance by over 10% in certain tasks even when the high-frequency sensor is absent at test time.
Attention maps show that each sensor contributes at different stages of task execution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

High-frequency event sensors may be needed only for data collection, not for final deployment, if co-training is used.
The same fusion pattern could be tested on additional robot platforms or task families that involve sustained contact.
If one sensor modality consistently dominates attention maps, future designs could drop the less-used modality after initial training.

Load-bearing premise

The heterogeneous tactile sensors supply complementary information that is not redundant with vision and that the convolutional-stem-plus-transformer architecture can extract and fuse without task-specific tuning.

What would settle it

If the same five tasks are repeated with a policy that receives only the GelSight Mini and RGB data (no Evetac even during training) and the success rate remains at or above 80%, the claim that multi-resolution fusion is required would be falsified.

Figures

Figures reproduced from arXiv: 2606.06281 by Erik Helmut, Georgia Chalvatzaki, Jan Peters, Niklas Funk, Rickmer Krohn, Vignesh Prasad.

**Figure 2.** Figure 2: Overview of the MiTaS architecture: Modality-specific CNN stems encode Vision, Gel [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Representative images of the five contact-rich manipulation tasks in order from left to [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: MiTaS outperformes the multimodal baseline Sparsh-X and two vision-only ViT-baselines [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Attention analysis with sensor readings in Lamp Installation. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Task progress across five manipulation tasks. Each row shows representative frames [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Examples of failure cases across the five manipulation tasks. Each row shows representa [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: In-hand GelSight distributions after reset across Gear Assembly (a-b), Board Wiping (c [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: For each task we plot the cross-attention from the executed action token to each sensor [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: We define the Evetac activation as the L1-distance to the base image in pixel-space. [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

read the original abstract

Touch sensing is beneficial for solving a wide variety of manipulation tasks. While there exists a wide range of tactile sensors with different properties, exploiting the fusion of multiple heterogeneous tactile sensors to improve manipulation learning remains underexplored. We present Multi-Resolution Tactile Sensing (MiTaS), a representation framework that leverages multiple tactile sensors operating at different temporal resolutions in order to solve complex contact-rich manipulation tasks. We propose a novel architecture using modality-specific convolutional stems and transformer-based fusion that effectively fuses information from an RGB camera stream, a vision-based GelSight Mini sensor and a high-frequency event-based Evetac sensor. This multi-sensor representation then conditions a flow-matching policy for solving downstream tasks. Experimental results across five contact-rich manipulation tasks demonstrate the effectiveness of multi-resolution tactile features in imitation learning. MiTaS achieves an average success rate of 80 %, while vision-only (31 %) and visual-tactile (54 %) baselines cannot solve the task reliably. Co-training a visuo-tactile model with multi-tactile data boosts performance by over 10 \% in certain tasks, without having access to the Evetac sensor during policy evaluation. A detailed sensor-reading and attention analysis reveals the importance of different sensors throughout task execution, validating our multi-resolution tactile sensing approach. Project Page: http://mitas-touch.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MiTaS shows a clear empirical lift from adding GelSight Mini and Evetac to vision on five contact-rich tasks, but the abstract leaves the source of that lift under-specified.

read the letter

The main thing to know is that this paper reports a solid jump in success rates on contact-rich manipulation by fusing two tactile sensors at different temporal resolutions with an RGB camera. MiTaS reaches 80% average success while vision-only sits at 31% and a visual-tactile baseline at 54%, and they also show a co-training trick that improves performance even when the high-frequency sensor is dropped at test time.

What is actually new is the concrete architecture: separate convolutional stems for each modality feeding a transformer fusion layer that then conditions a flow-matching policy. The attention analysis they include is a useful addition for seeing when each sensor matters during a task. That combination of sensors and the named framework does not look like a direct rehash of prior work based on the abstract.

The paper does a decent job describing the setup and running real-robot experiments across five tasks. The project page link is helpful for anyone who wants to look closer.

The soft spots are mostly around verification. The abstract gives overall success rates but no trial counts, variance numbers, or statistical tests, so it is hard to judge how reliable the gap really is. The stress-test concern about whether the two tactile streams actually supply non-redundant information and whether the stem-plus-transformer fusion works without per-task retuning is not answered in the provided text. If the full paper has per-modality ablations and confirms the same configuration was used unchanged across tasks, that would strengthen the central claim considerably. Without them the performance difference could partly reflect extra capacity or tuning rather than the multi-resolution idea itself.

This is for people working on tactile sensing and imitation learning in robotics. A reader who needs practical fusion methods for heterogeneous sensors would get value from the architecture and the reported numbers. It deserves a serious referee because the claim is concrete and testable even if the current write-up needs more experimental detail. I would send it to peer review.

Referee Report

3 major / 1 minor

Summary. The paper introduces MiTaS, a representation framework for fusing RGB camera data with two heterogeneous tactile sensors (GelSight Mini at standard resolution and high-frequency Evetac event-based sensor) via modality-specific convolutional stems and transformer-based fusion. The fused features condition a flow-matching policy for imitation learning on contact-rich manipulation. It reports an average success rate of 80% across five tasks, outperforming vision-only (31%) and visual-tactile (54%) baselines, with additional results on co-training with multi-tactile data and attention-based sensor analysis.

Significance. If the performance claims hold under rigorous controls, the work would demonstrate the value of multi-resolution tactile sensing for improving imitation learning in contact-rich robotics tasks where vision is insufficient. The attention analysis offers interpretability, and the flow-matching policy is a contemporary choice. The co-training result (boost without Evetac at test time) is a practical strength if replicated.

major comments (3)

[Results] Results section: No per-modality ablation studies or quantitative information-overlap metrics (e.g., mutual information between modalities) are provided to confirm that the GelSight Mini and Evetac streams supply non-redundant signals beyond the RGB stream and each other; without this, the 26-point gap over the visual-tactile baseline cannot be confidently attributed to multi-resolution tactile features.
[Results] Results section: The manuscript does not include a table or statement confirming that the identical convolutional-stem-plus-transformer architecture and hyperparameters were applied unchanged across all five tasks; this detail is load-bearing for the claim that the fusion generalizes without task-specific tuning.
[Abstract] Abstract and Results: The headline success rates (80%, 31%, 54%) are stated without any reference to number of trials, variance, statistical tests, or data-exclusion rules, so the central empirical claim cannot be verified from the supplied information.

minor comments (1)

[Abstract] Abstract: The baseline label 'visual-tactile (54 %)' should explicitly state which tactile sensor(s) it includes to avoid ambiguity with the proposed multi-tactile setting.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript accordingly to strengthen the empirical claims.

read point-by-point responses

Referee: [Results] Results section: No per-modality ablation studies or quantitative information-overlap metrics (e.g., mutual information between modalities) are provided to confirm that the GelSight Mini and Evetac streams supply non-redundant signals beyond the RGB stream and each other; without this, the 26-point gap over the visual-tactile baseline cannot be confidently attributed to multi-resolution tactile features.

Authors: We agree that dedicated per-modality ablations and mutual-information metrics would provide clearer evidence of complementary signals. The existing attention analysis offers some interpretability, but does not substitute for these quantitative controls. We will add the requested ablation studies and information-overlap metrics in the revised Results section. revision: yes
Referee: [Results] Results section: The manuscript does not include a table or statement confirming that the identical convolutional-stem-plus-transformer architecture and hyperparameters were applied unchanged across all five tasks; this detail is load-bearing for the claim that the fusion generalizes without task-specific tuning.

Authors: The architecture and hyperparameters were held fixed across tasks. We will insert an explicit statement and a summary table in the revised Results section documenting this consistency. revision: yes
Referee: [Abstract] Abstract and Results: The headline success rates (80%, 31%, 54%) are stated without any reference to number of trials, variance, statistical tests, or data-exclusion rules, so the central empirical claim cannot be verified from the supplied information.

Authors: We will revise the abstract and Results section to report the number of trials per task, standard deviations, and data-exclusion criteria. Statistical comparisons will be added where appropriate. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation of proposed architecture

full rationale

The paper introduces the MiTaS framework and a conv-stem+transformer architecture for fusing RGB, GelSight Mini, and Evetac data to condition a flow-matching policy, then reports success rates from imitation learning rollouts on five tasks. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. All central claims rest on direct experimental comparisons (80% vs. 31%/54% baselines) rather than any reduction to inputs by construction, satisfying the default expectation of a non-circular empirical robotics paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no equations or detailed methods, so no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.1-grok · 5785 in / 1169 out tokens · 23159 ms · 2026-06-28T00:55:51.704871+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 6 canonical work pages

[1]

Johansson and J

R. Johansson and J. Flanagan. Tactile sensory control of object manipulation in humans.The Senses: A Comprehensive Reference, 6:67–86, 01 2010. doi:10.1016/B978-012370880-9. 00346-7

work page doi:10.1016/b978-012370880-9 2010
[2]

W. Yuan, S. Dong, and E. H. Adelson. GelSight: High-Resolution Robot Tactile Sensors for Estimating Geometry and Force.Sensors, 17(12):2762, Dec. 2017. ISSN 1424-8220. doi: 10.3390/s17122762. URLhttps://www.mdpi.com/1424-8220/17/12/2762

work page doi:10.3390/s17122762 2017
[3]

Lambeta, P.-W

M. Lambeta, P.-W. Chou, S. Tian, B. Yang, B. Maloon, V . R. Most, D. Stroud, R. Santos, A. Byagowi, G. Kammerer, et al. Digit: A novel design for a low-cost compact high-resolution tactile sensor with application to in-hand manipulation.IEEE Robotics and Automation Letters, 5(3):3838–3845, 2020

2020
[4]

Ward-Cherrier, N

B. Ward-Cherrier, N. Pestell, L. Cramphorn, B. Winstone, M. E. Giannaccini, J. Rossiter, and N. F. Lepora. The tactip family: Soft optical tactile sensors with 3d-printed biomimetic morphologies.Soft robotics, 5(2):216–227, 2018

2018
[5]

S. Dong, D. K. Jha, D. Romeres, et al. Tactile-rl for insertion: Generalization to objects of unknown geometry. InICRA, 2021

2021
[6]

Calandra, A

R. Calandra, A. Owens, M. Upadhyaya, W. Yuan, J. Lin, E. H. Adelson, and S. Levine. The feeling of success: Does touch sensing help predict grasp outcomes?, 2025. URLhttps: //arxiv.org/abs/1710.05512

arXiv 2025
[7]

Helmut, N

E. Helmut, N. Funk, T. Schneider, C. d. Farias, and J. Peters. Tactile-Conditioned Diffusion Policy for Force-Aware Robotic Manipulation, Oct. 2025. URLhttp://arxiv.org/abs/ 2510.13324. arXiv:2510.13324 [cs]

arXiv 2025
[8]

In: 2022 International Conference on Robotics and Automation (ICRA), pp

J. Hansen, F. Hogan, D. Rivkin, D. Meger, M. Jenkin, and G. Dudek. Visuotactile-RL: Learn- ing Multimodal Manipulation Policies with Deep Reinforcement Learning. In2022 Interna- tional Conference on Robotics and Automation (ICRA), pages 8298–8304, May 2022. doi: 10.1109/ICRA46639.2022.9812019. URLhttps://ieeexplore.ieee.org/document/ 9812019

work page doi:10.1109/icra46639.2022.9812019 2022
[10]

M. Yang, A. Church, Y . Lin, C. J. Ford, H. Li, E. Psomopoulou, D. A. Barton, N. F. Lepora, et al. Anyrotate: Gravity-invariant in-hand object rotation with sim-to-real touch. InConfer- ence on Robot Learning, pages 4727–4747. PMLR, 2025

2025
[11]

Romero, H.-S

B. Romero, H.-S. Fang, P. Agrawal, and E. Adelson. Eyesight hand: Design of a fully-actuated dexterous robot hand with integrated vision-based tactile sensors and compliant actuation. In 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1853–1860. IEEE, 2024. 9

2024
[12]

N. Funk, E. Helmut, G. Chalvatzaki, R. Calandra, and J. Peters. Evetac: An Event-based Optical Tactile Sensor for Robotic Manipulation, Aug. 2024. URLhttp://arxiv.org/abs/ 2312.01236. arXiv:2312.01236 [cs]

arXiv 2024
[13]

D. Yin, S. Lu, J. Yang, Y . Zhang, Z. Dai, D. Nan, B. Cai, S. He, and X. Chen. Gelevent—a novel high-speed tactile sensor with event camera.IEEE Transactions on Instrumentation and Measurement, 74:1–13, 2025. doi:10.1109/TIM.2025.3551440

work page doi:10.1109/tim.2025.3551440 2025
[14]

Q. Li, O. Kroemer, Z. Su, F. F. Veiga, M. Kaboli, and H. J. Ritter. A review of tactile informa- tion: Perception and action through touch.IEEE Transactions on Robotics, 36(6):1619–1634, 2020

2020
[15]

Calandra, A

R. Calandra, A. Owens, D. Jayaraman, J. Lin, W. Yuan, J. Malik, E. H. Adelson, and S. Levine. More than a feeling: Learning to grasp and regrasp using vision and touch.IEEE Robotics and Automation Letters, 3(4):3300–3307, 2018

2018
[16]

H. Qi, B. Yi, S. Suresh, M. Lambeta, Y . Ma, R. Calandra, and J. Malik. General in-hand object rotation with vision and touch. InConference on Robot Learning, pages 2549–2564. PMLR, 2023

2023
[17]

Huang, Y

B. Huang, Y . Wang, X. Yang, Y . Luo, and Y . Li. 3d-vitac: Learning fine-grained manipulation with visuo-tactile sensing. InConference on Robot Learning, pages 2557–2578. PMLR, 2025

2025
[18]

H. Li, Y . Zhang, J. Zhu, S. Wang, M. A. Lee, H. Xu, E. Adelson, L. Fei-Fei, R. Gao, and J. Wu. See, Hear, and Feel: Smart Sensory Fusion for Robotic Manipulation, Dec. 2022. URL http://arxiv.org/abs/2212.03858. arXiv:2212.03858 [cs]

arXiv 2022
[19]

Ablett, O

T. Ablett, O. Limoyo, A. Sigal, A. Jilani, J. Kelly, K. Siddiqi, F. Hogan, and G. Dudek. Mul- timodal and force-matched imitation learning with a see-through visuotactile sensor.IEEE Transactions on Robotics, 41:946–959, 2024

2024
[20]

N. Funk, C. Chen, T. Schneider, G. Chalvatzaki, R. Calandra, and J. Peters. On the Importance of Tactile Sensing for Imitation Learning: A Case Study on Robotic Match Lighting.IEEE Robotics and Automation Letters, 11(5):6218–6225, 2026

2026
[21]

H. Xue, J. Ren, W. Chen, G. Zhang, Y . Fang, G. Gu, H. Xu, and C. Lu. Reactive Diffusion Policy: Slow-Fast Visual-Tactile Policy Learning for Contact-Rich Manipulation, Apr. 2025. URLhttp://arxiv.org/abs/2503.02881. arXiv:2503.02881 [cs]

arXiv 2025
[22]

Cheng, J

N. Cheng, J. Xu, C. Guan, J. Gao, W. Wang, Y . Li, F. Meng, J. Zhou, B. Fang, and W. Han. Touch100k: A large-scale touch-language-vision dataset for touch-centric multimodal rep- resentation.Information Fusion, 124:103305, Dec. 2025. ISSN 1566-2535. doi:10.1016/ j.inffus.2025.103305. URLhttps://www.sciencedirect.com/science/article/pii/ S1566253525003781

arXiv 2025
[23]

Higuera, A

C. Higuera, A. Sharma, T. Fan, C. K. Bodduluri, B. Boots, M. Kaess, M. Lambeta, T. Wu, Z. Liu, F. R. Hogan, and M. Mukadam. Tactile Beyond Pixels: Multisensory Touch Repre- sentations for Robot Manipulation, June 2025. URLhttp://arxiv.org/abs/2506.14754. arXiv:2506.14754 [cs]

arXiv 2025
[24]

George, S

A. George, S. Gano, P. Katragadda, and A. B. Farimani. VITaL Pretraining: Visuo-Tactile Pretraining for Tactile and Non-Tactile Manipulation Policies, Sept. 2024. URLhttp:// arxiv.org/abs/2403.11898. arXiv:2403.11898 [cs]

arXiv 2024
[25]

Z. Zhao, S. Haldar, J. Cui, L. Pinto, and R. Bhirangi. Touch begins where vision ends: Gen- eralizable policies for contact-rich manipulation, June 2025. URLhttp://arxiv.org/abs/ 2506.13762. arXiv:2506.13762 [cs]. 10

arXiv 2025
[26]

V . Dave, F. Lygerakis, and E. Rueckert. Multimodal Visual-Tactile Representation Learn- ing through Self-Supervised Contrastive Pre-Training, Jan. 2024. URLhttp://arxiv.org/ abs/2401.12024. arXiv:2401.12024 [cs]

arXiv 2024
[27]

L. Fu, H. Huang, L. Berscheid, H. Li, K. Goldberg, and S. Chitta. Safe Self-Supervised Learning in Real of Visuo-Tactile Feedback Policies for Industrial Insertion, Mar. 2023. URL http://arxiv.org/abs/2210.01340. arXiv:2210.01340 [cs]

arXiv 2023
[28]

Zhang, P

C. Zhang, P. Hao, X. Cao, X. Hao, S. Cui, and S. Wang. VTLA: Vision-Tactile-Language- Action Model with Preference Learning for Insertion Manipulation, May 2025. URLhttp: //arxiv.org/abs/2505.09577. arXiv:2505.09577 [cs]

arXiv 2025
[29]

G. Ye, Z. Zhang, X. Zhao, S. Wu, H. Lu, S. Lu, and H. Liu. Learning to Feel the Future: DreamTacVLA for Contact-Rich Manipulation, Dec. 2025. URLhttp://arxiv.org/abs/ 2512.23864. arXiv:2512.23864 [cs]

Pith/arXiv arXiv 2025
[30]

R. Feng, J. Hu, W. Xia, T. Gao, A. Shen, Y . Sun, B. Fang, and D. Hu. AnyTouch: Learning Unified Static-Dynamic Representation across Multiple Visuo-tactile Sensors, Apr. 2025. URL http://arxiv.org/abs/2502.12191. arXiv:2502.12191 [cs]

arXiv 2025
[31]

J. Zhao, Y . Ma, L. Wang, and E. H. Adelson. Transferable Tactile Transformers for Repre- sentation Learning Across Diverse Sensors and Tasks, Oct. 2024. URLhttp://arxiv.org/ abs/2406.13640. arXiv:2406.13640 [cs]

arXiv 2024
[32]

M. A. Lee, Y . Zhu, K. Srinivasan, P. Shah, S. Savarese, L. Fei-Fei, A. Garg, and J. Bohg. Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal Representations for Contact-Rich Tasks.CoRR, abs/1810.10191, 2018. URLhttp://arxiv.org/abs/1810. 10191. eprint: 1810.10191

Pith/arXiv arXiv 2018
[33]

M. A. Lee, Y . Zhu, P. Zachares, M. Tan, K. Srinivasan, S. Savarese, L. Fei-Fei, A. Garg, and J. Bohg. Making Sense of Vision and Touch: Learning Multimodal Representa- tions for Contact-Rich Tasks, July 2019. URLhttp://arxiv.org/abs/1907.13098. arXiv:1907.13098 [cs]

arXiv 2019
[34]

R. Feng, D. Hu, W. Ma, and X. Li. Play to the Score: Stage-Guided Dynamic Multi-Sensory Fusion for Robotic Manipulation, Oct. 2024. URLhttp://arxiv.org/abs/2408.01366. arXiv:2408.01366 [cs]

arXiv 2024
[35]

Sferrazza, Y

C. Sferrazza, Y . Seo, H. Liu, Y . Lee, and P. Abbeel. The Power of the Senses: Generalizable Manipulation from Vision and Touch through Masked Multimodal Learning, Nov. 2023. URL http://arxiv.org/abs/2311.00924. arXiv:2311.00924 [cs]

arXiv 2023
[36]

Lambeta, T

M. Lambeta, T. Wu, A. Sengul, V . R. Most, N. Black, K. Sawyer, R. Mercado, H. Qi, A. Sohn, B. Taylor, N. Tydingco, G. Kammerer, D. Stroud, J. Khatha, K. Jenkins, K. Most, N. Stein, R. Chavira, T. Craven-Bartle, E. Sanchez, Y . Ding, J. Malik, and R. Calandra. Digitizing touch with an artificial multimodal fingertip, 2024. URLhttps://arxiv.org/abs/2411.02479

arXiv 2024
[37]

J. Zhao, N. Kuppuswamy, S. Feng, B. Burchfiel, and E. Adelson. Polytouch: A robust multi- modal tactile sensor for contact-rich manipulation using tactile-diffusion policies. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 104–110. IEEE, 2025

2025
[38]

P. Jin, B. Huang, W. W. Lee, T. Li, and W. Yang. Visual-Force-Tactile Fusion for Gen- tle Intricate Insertion Tasks.IEEE Robotics and Automation Letters, 9(5):4830–4837, May
[39]

doi:10.1109/LRA.2024.3379803

ISSN 2377-3766. doi:10.1109/LRA.2024.3379803. URLhttps://ieeexplore. ieee.org/document/10476678/. 11

work page doi:10.1109/lra.2024.3379803 2024
[40]

K. Yu, Y . Han, Q. Wang, V . Saxena, D. Xu, and Y . Zhao. MimicTouch: Leveraging Multi- modal Human Tactile Demonstrations for Contact-rich Manipulation, Feb. 2025. URLhttp: //arxiv.org/abs/2310.16917. arXiv:2310.16917 [cs]

arXiv 2025
[41]

Black, N

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Haus- man, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky.π 0: A vision-language-action flow model for general robot control, 2026. URLhttps://arxiv. o...

Pith/arXiv arXiv 2026
[42]

Lipman, R

Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling, 2023. URLhttps://arxiv.org/abs/2210.02747

Pith/arXiv arXiv 2023
[43]

T. Xiao, M. Singh, E. Mintun, T. Darrell, P. Doll ´ar, and R. Girshick. Early convolutions help transformers see better, 2021. URLhttps://arxiv.org/abs/2106.14881

arXiv 2021
[44]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polo- sukhin. Attention is all you need, 2023. URLhttps://arxiv.org/abs/1706.03762

Pith/arXiv arXiv 2023
[45]

Urain, A

J. Urain, A. Mandlekar, Y . Du, N. Muhammad “Mahi” Shafiullah, D. Xu, K. Fragkiadaki, G. Chalvatzaki, and J. Peters. A survey on deep generative models for robot learning from multimodal demonstrations.IEEE Transactions on Robotics, 42:60–79, 2026. doi:10.1109/ TRO.2025.3631816

arXiv 2026
[46]

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

2025
[47]

Peebles and S

W. Peebles and S. Xie. Scalable diffusion models with transformers, 2023. URLhttps: //arxiv.org/abs/2212.09748

Pith/arXiv arXiv 2023
[48]

Dasari, O

S. Dasari, O. Mees, S. Zhao, M. K. Srirama, and S. Levine. The ingredients for robotic diffu- sion transformers, 2024. URLhttps://arxiv.org/abs/2410.10088

arXiv 2024
[49]

Krohn, V

R. Krohn, V . Prasad, G. Tiboni, and G. Chalvatzaki. Self-supervised multisensory pretraining for contact-rich robot reinforcement learning.IEEE Robotics and Automation Letters, 11(6): 6799–6806, 2026. doi:10.1109/LRA.2026.3681156

work page doi:10.1109/lra.2026.3681156 2026
[50]

Nagrani, S

A. Nagrani, S. Yang, A. Arnab, A. Jansen, C. Schmid, and C. Sun. Attention bottlenecks for multimodal fusion, 2022. URLhttps://arxiv.org/abs/2107.00135

arXiv 2022
[51]

Dosovitskiy, L

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. De- hghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021. URLhttps: //arxiv.org/abs/2010.11929

Pith/arXiv arXiv 2021
[52]

B. Tang, M. A. Lin, I. Akinola, A. Handa, G. S. Sukhatme, F. Ramos, D. Fox, and Y . Narang. Industreal: Transferring contact-rich assembly tasks from simulation to reality, 2023. URL https://arxiv.org/abs/2305.17110

arXiv 2023
[53]

M. Heo, Y . Lee, D. Lee, and J. J. Lim. Furniturebench: Reproducible real-world benchmark for long-horizon complex manipulation. InRobotics: Science and Systems, 2023

2023
[54]

Tancik, P

M. Tancik, P. P. Srinivasan, B. Mildenhall, S. Fridovich-Keil, N. Raghavan, U. Singhal, R. Ra- mamoorthi, J. T. Barron, and R. Ng. Fourier features let networks learn high frequency func- tions in low dimensional domains, 2020. URLhttps://arxiv.org/abs/2006.10739. 12 A Appendix This appendix provides supplementary material that supports the experimental e...

arXiv 2020

[1] [1]

Johansson and J

R. Johansson and J. Flanagan. Tactile sensory control of object manipulation in humans.The Senses: A Comprehensive Reference, 6:67–86, 01 2010. doi:10.1016/B978-012370880-9. 00346-7

work page doi:10.1016/b978-012370880-9 2010

[2] [2]

W. Yuan, S. Dong, and E. H. Adelson. GelSight: High-Resolution Robot Tactile Sensors for Estimating Geometry and Force.Sensors, 17(12):2762, Dec. 2017. ISSN 1424-8220. doi: 10.3390/s17122762. URLhttps://www.mdpi.com/1424-8220/17/12/2762

work page doi:10.3390/s17122762 2017

[3] [3]

Lambeta, P.-W

M. Lambeta, P.-W. Chou, S. Tian, B. Yang, B. Maloon, V . R. Most, D. Stroud, R. Santos, A. Byagowi, G. Kammerer, et al. Digit: A novel design for a low-cost compact high-resolution tactile sensor with application to in-hand manipulation.IEEE Robotics and Automation Letters, 5(3):3838–3845, 2020

2020

[4] [4]

Ward-Cherrier, N

B. Ward-Cherrier, N. Pestell, L. Cramphorn, B. Winstone, M. E. Giannaccini, J. Rossiter, and N. F. Lepora. The tactip family: Soft optical tactile sensors with 3d-printed biomimetic morphologies.Soft robotics, 5(2):216–227, 2018

2018

[5] [5]

S. Dong, D. K. Jha, D. Romeres, et al. Tactile-rl for insertion: Generalization to objects of unknown geometry. InICRA, 2021

2021

[6] [6]

Calandra, A

R. Calandra, A. Owens, M. Upadhyaya, W. Yuan, J. Lin, E. H. Adelson, and S. Levine. The feeling of success: Does touch sensing help predict grasp outcomes?, 2025. URLhttps: //arxiv.org/abs/1710.05512

arXiv 2025

[7] [7]

Helmut, N

E. Helmut, N. Funk, T. Schneider, C. d. Farias, and J. Peters. Tactile-Conditioned Diffusion Policy for Force-Aware Robotic Manipulation, Oct. 2025. URLhttp://arxiv.org/abs/ 2510.13324. arXiv:2510.13324 [cs]

arXiv 2025

[8] [8]

In: 2022 International Conference on Robotics and Automation (ICRA), pp

J. Hansen, F. Hogan, D. Rivkin, D. Meger, M. Jenkin, and G. Dudek. Visuotactile-RL: Learn- ing Multimodal Manipulation Policies with Deep Reinforcement Learning. In2022 Interna- tional Conference on Robotics and Automation (ICRA), pages 8298–8304, May 2022. doi: 10.1109/ICRA46639.2022.9812019. URLhttps://ieeexplore.ieee.org/document/ 9812019

work page doi:10.1109/icra46639.2022.9812019 2022

[9] [10]

M. Yang, A. Church, Y . Lin, C. J. Ford, H. Li, E. Psomopoulou, D. A. Barton, N. F. Lepora, et al. Anyrotate: Gravity-invariant in-hand object rotation with sim-to-real touch. InConfer- ence on Robot Learning, pages 4727–4747. PMLR, 2025

2025

[10] [11]

Romero, H.-S

B. Romero, H.-S. Fang, P. Agrawal, and E. Adelson. Eyesight hand: Design of a fully-actuated dexterous robot hand with integrated vision-based tactile sensors and compliant actuation. In 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1853–1860. IEEE, 2024. 9

2024

[11] [12]

N. Funk, E. Helmut, G. Chalvatzaki, R. Calandra, and J. Peters. Evetac: An Event-based Optical Tactile Sensor for Robotic Manipulation, Aug. 2024. URLhttp://arxiv.org/abs/ 2312.01236. arXiv:2312.01236 [cs]

arXiv 2024

[12] [13]

D. Yin, S. Lu, J. Yang, Y . Zhang, Z. Dai, D. Nan, B. Cai, S. He, and X. Chen. Gelevent—a novel high-speed tactile sensor with event camera.IEEE Transactions on Instrumentation and Measurement, 74:1–13, 2025. doi:10.1109/TIM.2025.3551440

work page doi:10.1109/tim.2025.3551440 2025

[13] [14]

Q. Li, O. Kroemer, Z. Su, F. F. Veiga, M. Kaboli, and H. J. Ritter. A review of tactile informa- tion: Perception and action through touch.IEEE Transactions on Robotics, 36(6):1619–1634, 2020

2020

[14] [15]

Calandra, A

R. Calandra, A. Owens, D. Jayaraman, J. Lin, W. Yuan, J. Malik, E. H. Adelson, and S. Levine. More than a feeling: Learning to grasp and regrasp using vision and touch.IEEE Robotics and Automation Letters, 3(4):3300–3307, 2018

2018

[15] [16]

H. Qi, B. Yi, S. Suresh, M. Lambeta, Y . Ma, R. Calandra, and J. Malik. General in-hand object rotation with vision and touch. InConference on Robot Learning, pages 2549–2564. PMLR, 2023

2023

[16] [17]

Huang, Y

B. Huang, Y . Wang, X. Yang, Y . Luo, and Y . Li. 3d-vitac: Learning fine-grained manipulation with visuo-tactile sensing. InConference on Robot Learning, pages 2557–2578. PMLR, 2025

2025

[17] [18]

H. Li, Y . Zhang, J. Zhu, S. Wang, M. A. Lee, H. Xu, E. Adelson, L. Fei-Fei, R. Gao, and J. Wu. See, Hear, and Feel: Smart Sensory Fusion for Robotic Manipulation, Dec. 2022. URL http://arxiv.org/abs/2212.03858. arXiv:2212.03858 [cs]

arXiv 2022

[18] [19]

Ablett, O

T. Ablett, O. Limoyo, A. Sigal, A. Jilani, J. Kelly, K. Siddiqi, F. Hogan, and G. Dudek. Mul- timodal and force-matched imitation learning with a see-through visuotactile sensor.IEEE Transactions on Robotics, 41:946–959, 2024

2024

[19] [20]

N. Funk, C. Chen, T. Schneider, G. Chalvatzaki, R. Calandra, and J. Peters. On the Importance of Tactile Sensing for Imitation Learning: A Case Study on Robotic Match Lighting.IEEE Robotics and Automation Letters, 11(5):6218–6225, 2026

2026

[20] [21]

H. Xue, J. Ren, W. Chen, G. Zhang, Y . Fang, G. Gu, H. Xu, and C. Lu. Reactive Diffusion Policy: Slow-Fast Visual-Tactile Policy Learning for Contact-Rich Manipulation, Apr. 2025. URLhttp://arxiv.org/abs/2503.02881. arXiv:2503.02881 [cs]

arXiv 2025

[21] [22]

Cheng, J

N. Cheng, J. Xu, C. Guan, J. Gao, W. Wang, Y . Li, F. Meng, J. Zhou, B. Fang, and W. Han. Touch100k: A large-scale touch-language-vision dataset for touch-centric multimodal rep- resentation.Information Fusion, 124:103305, Dec. 2025. ISSN 1566-2535. doi:10.1016/ j.inffus.2025.103305. URLhttps://www.sciencedirect.com/science/article/pii/ S1566253525003781

arXiv 2025

[22] [23]

Higuera, A

C. Higuera, A. Sharma, T. Fan, C. K. Bodduluri, B. Boots, M. Kaess, M. Lambeta, T. Wu, Z. Liu, F. R. Hogan, and M. Mukadam. Tactile Beyond Pixels: Multisensory Touch Repre- sentations for Robot Manipulation, June 2025. URLhttp://arxiv.org/abs/2506.14754. arXiv:2506.14754 [cs]

arXiv 2025

[23] [24]

George, S

A. George, S. Gano, P. Katragadda, and A. B. Farimani. VITaL Pretraining: Visuo-Tactile Pretraining for Tactile and Non-Tactile Manipulation Policies, Sept. 2024. URLhttp:// arxiv.org/abs/2403.11898. arXiv:2403.11898 [cs]

arXiv 2024

[24] [25]

Z. Zhao, S. Haldar, J. Cui, L. Pinto, and R. Bhirangi. Touch begins where vision ends: Gen- eralizable policies for contact-rich manipulation, June 2025. URLhttp://arxiv.org/abs/ 2506.13762. arXiv:2506.13762 [cs]. 10

arXiv 2025

[25] [26]

V . Dave, F. Lygerakis, and E. Rueckert. Multimodal Visual-Tactile Representation Learn- ing through Self-Supervised Contrastive Pre-Training, Jan. 2024. URLhttp://arxiv.org/ abs/2401.12024. arXiv:2401.12024 [cs]

arXiv 2024

[26] [27]

L. Fu, H. Huang, L. Berscheid, H. Li, K. Goldberg, and S. Chitta. Safe Self-Supervised Learning in Real of Visuo-Tactile Feedback Policies for Industrial Insertion, Mar. 2023. URL http://arxiv.org/abs/2210.01340. arXiv:2210.01340 [cs]

arXiv 2023

[27] [28]

Zhang, P

C. Zhang, P. Hao, X. Cao, X. Hao, S. Cui, and S. Wang. VTLA: Vision-Tactile-Language- Action Model with Preference Learning for Insertion Manipulation, May 2025. URLhttp: //arxiv.org/abs/2505.09577. arXiv:2505.09577 [cs]

arXiv 2025

[28] [29]

G. Ye, Z. Zhang, X. Zhao, S. Wu, H. Lu, S. Lu, and H. Liu. Learning to Feel the Future: DreamTacVLA for Contact-Rich Manipulation, Dec. 2025. URLhttp://arxiv.org/abs/ 2512.23864. arXiv:2512.23864 [cs]

Pith/arXiv arXiv 2025

[29] [30]

R. Feng, J. Hu, W. Xia, T. Gao, A. Shen, Y . Sun, B. Fang, and D. Hu. AnyTouch: Learning Unified Static-Dynamic Representation across Multiple Visuo-tactile Sensors, Apr. 2025. URL http://arxiv.org/abs/2502.12191. arXiv:2502.12191 [cs]

arXiv 2025

[30] [31]

J. Zhao, Y . Ma, L. Wang, and E. H. Adelson. Transferable Tactile Transformers for Repre- sentation Learning Across Diverse Sensors and Tasks, Oct. 2024. URLhttp://arxiv.org/ abs/2406.13640. arXiv:2406.13640 [cs]

arXiv 2024

[31] [32]

M. A. Lee, Y . Zhu, K. Srinivasan, P. Shah, S. Savarese, L. Fei-Fei, A. Garg, and J. Bohg. Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal Representations for Contact-Rich Tasks.CoRR, abs/1810.10191, 2018. URLhttp://arxiv.org/abs/1810. 10191. eprint: 1810.10191

Pith/arXiv arXiv 2018

[32] [33]

M. A. Lee, Y . Zhu, P. Zachares, M. Tan, K. Srinivasan, S. Savarese, L. Fei-Fei, A. Garg, and J. Bohg. Making Sense of Vision and Touch: Learning Multimodal Representa- tions for Contact-Rich Tasks, July 2019. URLhttp://arxiv.org/abs/1907.13098. arXiv:1907.13098 [cs]

arXiv 2019

[33] [34]

R. Feng, D. Hu, W. Ma, and X. Li. Play to the Score: Stage-Guided Dynamic Multi-Sensory Fusion for Robotic Manipulation, Oct. 2024. URLhttp://arxiv.org/abs/2408.01366. arXiv:2408.01366 [cs]

arXiv 2024

[34] [35]

Sferrazza, Y

C. Sferrazza, Y . Seo, H. Liu, Y . Lee, and P. Abbeel. The Power of the Senses: Generalizable Manipulation from Vision and Touch through Masked Multimodal Learning, Nov. 2023. URL http://arxiv.org/abs/2311.00924. arXiv:2311.00924 [cs]

arXiv 2023

[35] [36]

Lambeta, T

M. Lambeta, T. Wu, A. Sengul, V . R. Most, N. Black, K. Sawyer, R. Mercado, H. Qi, A. Sohn, B. Taylor, N. Tydingco, G. Kammerer, D. Stroud, J. Khatha, K. Jenkins, K. Most, N. Stein, R. Chavira, T. Craven-Bartle, E. Sanchez, Y . Ding, J. Malik, and R. Calandra. Digitizing touch with an artificial multimodal fingertip, 2024. URLhttps://arxiv.org/abs/2411.02479

arXiv 2024

[36] [37]

J. Zhao, N. Kuppuswamy, S. Feng, B. Burchfiel, and E. Adelson. Polytouch: A robust multi- modal tactile sensor for contact-rich manipulation using tactile-diffusion policies. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 104–110. IEEE, 2025

2025

[37] [38]

P. Jin, B. Huang, W. W. Lee, T. Li, and W. Yang. Visual-Force-Tactile Fusion for Gen- tle Intricate Insertion Tasks.IEEE Robotics and Automation Letters, 9(5):4830–4837, May

[38] [39]

doi:10.1109/LRA.2024.3379803

ISSN 2377-3766. doi:10.1109/LRA.2024.3379803. URLhttps://ieeexplore. ieee.org/document/10476678/. 11

work page doi:10.1109/lra.2024.3379803 2024

[39] [40]

K. Yu, Y . Han, Q. Wang, V . Saxena, D. Xu, and Y . Zhao. MimicTouch: Leveraging Multi- modal Human Tactile Demonstrations for Contact-rich Manipulation, Feb. 2025. URLhttp: //arxiv.org/abs/2310.16917. arXiv:2310.16917 [cs]

arXiv 2025

[40] [41]

Black, N

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Haus- man, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky.π 0: A vision-language-action flow model for general robot control, 2026. URLhttps://arxiv. o...

Pith/arXiv arXiv 2026

[41] [42]

Lipman, R

Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling, 2023. URLhttps://arxiv.org/abs/2210.02747

Pith/arXiv arXiv 2023

[42] [43]

T. Xiao, M. Singh, E. Mintun, T. Darrell, P. Doll ´ar, and R. Girshick. Early convolutions help transformers see better, 2021. URLhttps://arxiv.org/abs/2106.14881

arXiv 2021

[43] [44]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polo- sukhin. Attention is all you need, 2023. URLhttps://arxiv.org/abs/1706.03762

Pith/arXiv arXiv 2023

[44] [45]

Urain, A

J. Urain, A. Mandlekar, Y . Du, N. Muhammad “Mahi” Shafiullah, D. Xu, K. Fragkiadaki, G. Chalvatzaki, and J. Peters. A survey on deep generative models for robot learning from multimodal demonstrations.IEEE Transactions on Robotics, 42:60–79, 2026. doi:10.1109/ TRO.2025.3631816

arXiv 2026

[45] [46]

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

2025

[46] [47]

Peebles and S

W. Peebles and S. Xie. Scalable diffusion models with transformers, 2023. URLhttps: //arxiv.org/abs/2212.09748

Pith/arXiv arXiv 2023

[47] [48]

Dasari, O

S. Dasari, O. Mees, S. Zhao, M. K. Srirama, and S. Levine. The ingredients for robotic diffu- sion transformers, 2024. URLhttps://arxiv.org/abs/2410.10088

arXiv 2024

[48] [49]

Krohn, V

R. Krohn, V . Prasad, G. Tiboni, and G. Chalvatzaki. Self-supervised multisensory pretraining for contact-rich robot reinforcement learning.IEEE Robotics and Automation Letters, 11(6): 6799–6806, 2026. doi:10.1109/LRA.2026.3681156

work page doi:10.1109/lra.2026.3681156 2026

[49] [50]

Nagrani, S

A. Nagrani, S. Yang, A. Arnab, A. Jansen, C. Schmid, and C. Sun. Attention bottlenecks for multimodal fusion, 2022. URLhttps://arxiv.org/abs/2107.00135

arXiv 2022

[50] [51]

Dosovitskiy, L

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. De- hghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021. URLhttps: //arxiv.org/abs/2010.11929

Pith/arXiv arXiv 2021

[51] [52]

B. Tang, M. A. Lin, I. Akinola, A. Handa, G. S. Sukhatme, F. Ramos, D. Fox, and Y . Narang. Industreal: Transferring contact-rich assembly tasks from simulation to reality, 2023. URL https://arxiv.org/abs/2305.17110

arXiv 2023

[52] [53]

M. Heo, Y . Lee, D. Lee, and J. J. Lim. Furniturebench: Reproducible real-world benchmark for long-horizon complex manipulation. InRobotics: Science and Systems, 2023

2023

[53] [54]

Tancik, P

M. Tancik, P. P. Srinivasan, B. Mildenhall, S. Fridovich-Keil, N. Raghavan, U. Singhal, R. Ra- mamoorthi, J. T. Barron, and R. Ng. Fourier features let networks learn high frequency func- tions in low dimensional domains, 2020. URLhttps://arxiv.org/abs/2006.10739. 12 A Appendix This appendix provides supplementary material that supports the experimental e...

arXiv 2020