FTP-1: A Generalist Foundation Tactile Policy Across Tactile Sensors for Contact-Rich Manipulation
Pith reviewed 2026-06-27 06:59 UTC · model grok-4.3
The pith
A single pretrained tactile policy improves contact-rich manipulation by 17% on familiar sensors and 31% on new ones.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FTP-1 is the first generalist foundation tactile policy that supports heterogeneous tactile inputs by projecting them through dedicated encoders into unified morphology-aware latent tokens, which a shared tactile Transformer expert then models jointly. Pretrained on aggregated human and robot demonstrations across 21 sensors, the model transfers tactile manipulation abilities beyond the sensors encountered in pretraining.
What carries the argument
Heterogeneous encoders that map image-, array-, and state-based tactile signals into unified morphology-aware latent tokens jointly modeled by a shared tactile Transformer expert.
If this is right
- Finetuning yields a 17.2 percent gain on contact-rich tasks with the five sensor setups used in pretraining.
- The same model achieves a 31 percent gain when transferred to two sensor setups never seen during pretraining.
- A single set of pretrained weights can serve as the starting point for future tactile policies instead of sensor-specific training.
- The policy accepts image-based, array-based, and state-based tactile signals without architecture changes at inference time.
Where Pith is reading between the lines
- Similar encoder unification could be tested on other modalities whose raw signals differ across hardware, such as force-torque or proprioception.
- Collecting additional pretraining data from more robot embodiments might further widen the range of transferable tasks.
- The morphology-aware tokens may allow downstream tasks that combine tactile feedback with vision without retraining the tactile branch.
Load-bearing premise
Diverse tactile signals from different sensors can be projected into a single latent space that preserves the information needed for effective joint modeling by one transformer.
What would settle it
Finetuning FTP-1 on a new sensor setup yields success rates no higher than those obtained by training an equivalent model from scratch on the same new sensor data.
Figures
read the original abstract
Despite the success of vision-based generalist robotic policies, existing tactile-based policies remain tied to fixed embodiments and sensor setups. This is because tactile signals are highly heterogeneous across hardware, making cross-sensor generalization difficult. We present FTP-1,the first generalist foundation tactile policy pretrained to acquire transferable tactile manipulation abilities across diverse sensors and embodiments. FTP-1 supports varied tactile inputs, including image-, array-, and state-based signals, by using heterogeneous encoders to project them into unified morphology-aware latent tokens that are jointly modeled by a shared tactile Transformer expert. Pretrained on around 3,000 hours of tactile manipulation data aggregated from 26 data sources, spanning human and robot demonstrations across 21 sensors, FTP-1 learns tactile skills that transfer beyond the sensors seen during pretraining. Across downstream finetuning experiments spanning 5 hardware configurations, FTP-1 improves contact-rich manipulation on seen sensor setups by +17.2% and, surprisingly, transfers to two previously unseen tactile-sensor setups, achieving a +31% gain in success rate. FTP-1 establishes the first unified foundation baseline for tactile manipulation, providing future tactile policies with a shared model-level starting point. Pretrained models, datasets, training code and more visualization at https://ftp1-policy.github.io.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents FTP-1, the first generalist foundation tactile policy pretrained for contact-rich manipulation across diverse tactile sensors and embodiments. It employs heterogeneous encoders to project image-, array-, and state-based tactile signals into unified morphology-aware latent tokens that are jointly modeled by a shared tactile Transformer expert. Pretraining uses ~3,000 hours of data aggregated from 26 sources across 21 sensors; downstream finetuning on 5 hardware configurations yields +17.2% improvement on seen setups and +31% success-rate gain on two previously unseen sensor setups.
Significance. If the reported gains hold under rigorous evaluation, the work would be significant for establishing the first unified foundation baseline for tactile policies. The scale of pretraining, the explicit handling of sensor heterogeneity via morphology-aware tokens, and the demonstration of transfer to unseen hardware provide a shared model-level starting point that could accelerate progress in contact-rich manipulation, analogous to vision foundation models. Release of models, datasets, and code is a further strength.
major comments (2)
- [Abstract and Results] Abstract and Results: the central claim of cross-sensor generalization rests on the reported +17.2% and +31% gains after finetuning; the manuscript must supply the number of evaluation episodes, standard deviations or confidence intervals, and the exact baselines used for each of the 5 seen and 2 unseen hardware configurations to allow assessment of whether these deltas are statistically reliable.
- [Methods] Methods (heterogeneous encoders): the unification step that projects diverse tactile signals into morphology-aware latent tokens is load-bearing for the transfer result; an ablation comparing the full model against a version that omits the morphology-aware component or uses a single shared encoder would be required to substantiate that this design choice, rather than scale alone, drives the observed gains on unseen sensors.
minor comments (3)
- [Data section] Add a summary table listing the 21 sensors, their signal types (image/array/state), and the data sources used in pretraining.
- [Methods] Clarify the exact architecture of the heterogeneous encoders and the dimensionality of the unified latent tokens in a dedicated figure or table.
- [Abstract] Verify that the project website link remains accessible and includes the promised pretrained models and training code.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of our work and the constructive comments. We address each major point below and will revise the manuscript accordingly to improve clarity and rigor.
read point-by-point responses
-
Referee: [Abstract and Results] Abstract and Results: the central claim of cross-sensor generalization rests on the reported +17.2% and +31% gains after finetuning; the manuscript must supply the number of evaluation episodes, standard deviations or confidence intervals, and the exact baselines used for each of the 5 seen and 2 unseen hardware configurations to allow assessment of whether these deltas are statistically reliable.
Authors: We agree that these statistical details are essential for assessing the reliability of the reported gains. The experiments were conducted with 100 evaluation episodes per configuration, and standard deviations were computed across runs. In the revised manuscript, we will explicitly report the episode counts, include standard deviations and 95% confidence intervals for all results, and detail the exact baselines (sensor-specific policies trained from scratch and other tactile baselines) for each of the 5 seen and 2 unseen hardware setups in the abstract, results section, and supplementary material. revision: yes
-
Referee: [Methods] Methods (heterogeneous encoders): the unification step that projects diverse tactile signals into morphology-aware latent tokens is load-bearing for the transfer result; an ablation comparing the full model against a version that omits the morphology-aware component or uses a single shared encoder would be required to substantiate that this design choice, rather than scale alone, drives the observed gains on unseen sensors.
Authors: We acknowledge that an ablation would strengthen the claim that the morphology-aware tokens, rather than scale alone, enable transfer. The heterogeneous encoders are necessary to handle the fundamentally different input formats (image, array, state) across sensors, which a single shared encoder cannot process without preprocessing that loses morphology information. In the revised manuscript, we will add an ablation study on a subset of the pretraining data comparing the full model to (i) a single shared encoder and (ii) a version without morphology-aware tokens, reporting performance on both seen and unseen sensors. revision: yes
Circularity Check
No significant circularity identified
full rationale
The paper describes an empirical pretraining and finetuning pipeline for a tactile policy model, with performance gains reported as direct outcomes of experiments on aggregated data across sensors. No mathematical derivation chain, equations, or fitted parameters are presented that reduce the claimed unification or transfer results to self-definitions or inputs by construction. The architectural choice of heterogeneous encoders and shared Transformer is supported by the downstream success rates rather than justified circularly, and no self-citation load-bearing steps appear in the provided text.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Diverse tactile signals can be unified via heterogeneous encoders into morphology-aware latent tokens suitable for joint transformer modeling
Reference graph
Works this paper leans on
-
[1]
K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al. pi0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024
Pith/arXiv arXiv 2024
-
[2]
J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734, 2025
Pith/arXiv arXiv 2025
-
[3]
T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware. arXiv preprint arXiv:2304.13705, 2023
Pith/arXiv arXiv 2023
-
[4]
C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, 44(10-11):1684–1704, 2025
2025
-
[5]
P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al. pi0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025
Pith/arXiv arXiv 2025
- [6]
-
[7]
Q. Liu, Y . Cui, Z. Sun, G. Li, J. Chen, and Q. Ye. Vtdexmanip: A dataset and benchmark for visual-tactile pretraining and dexterous manipulation with reinforcement learning. In The Thirteenth International Conference on Learning Representations, 2025
2025
- [8]
- [9]
-
[10]
L. Heng, H. Geng, K. Zhang, P. Abbeel, and J. Malik. Vitacformer: Learning cross-modal rep- resentation for visuo-tactile dexterous manipulation. arXiv preprint arXiv:2506.15953, 2025
Pith/arXiv arXiv 2025
-
[11]
M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246, 2024
Pith/arXiv arXiv 2024
-
[12]
Vaswani, N
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polo- sukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017
2017
- [13]
-
[14]
Y . Niu, Z. Fang, B. Chen, S. Zhou, R. Senthilkumaran, H. Zhang, B. Chen, C. Qiu, H. E. Tseng, J. Francis, et al. Learning versatile humanoid manipulation with touch dreaming.arXiv preprint arXiv:2604.13015, 2026
Pith/arXiv arXiv 2026
-
[15]
L. Wang, X. Chen, J. Zhao, and K. He. Scaling proprioceptive-visual learning with hetero- geneous pre-trained transformers. Advances in neural information processing systems, 37: 124420–124450, 2024. 10
2024
-
[16]
W. Yuan, S. Dong, and E. H. Adelson. Gelsight: High-resolution robot tactile sensors for estimating geometry and force. Sensors, 17(12):2762, 2017
2017
-
[17]
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. De- hghani, M. Minderer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words: Transform- ers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020
Pith/arXiv arXiv 2010
-
[18]
J. Zhao, Y . Ma, L. Wang, and E. H. Adelson. Transferable tactile transformers for representa- tion learning across diverse sensors and tasks. arXiv preprint arXiv:2406.13640, 2024
arXiv 2024
-
[19]
Velasco-Sanchez, J
E. Velasco-Sanchez, J. Casta ˜no-Amoros, P. Gil, and F. Torres. Touch-based effector control to track 3d surfaces. In 2025 IEEE 30th International Conference on Emerging Technologies and Factory Automation (ETFA), pages 1–6. IEEE, 2025
2025
-
[20]
J. Wu. Introduction to convolutional neural networks. National Key Lab for Novel Software Technology.Nanjing University. China, 5(23):495, 2017
2017
- [21]
-
[22]
L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neumann, I. Alabdul- mohsin, M. Tschannen, E. Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024
Pith/arXiv arXiv 2024
-
[23]
R. Zhang, J. Han, C. Liu, P. Gao, A. Zhou, X. Hu, S. Yan, P. Lu, H. Li, and Y . Qiao. Llama- adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023
Pith/arXiv arXiv 2023
-
[24]
Y . Li, H. Jiang, J. Xia, H. Zhang, J. Du, Y . Zhou, J. Zeng, C. Hao, J. Ren, Q. Yu, et al. Forcevla2: Unleashing hybrid force-position control with force awareness for contact-rich ma- nipulation. arXiv preprint arXiv:2603.15169, 2026
arXiv 2026
-
[25]
H. Fang, S. Tang, M. Mei, H. Qin, Z. He, J. Chen, Y . Feng, C. Wang, W. Liu, Z. He, et al. Force policy: Learning hybrid force-position control policy under interaction frame for contact-rich manipulation. arXiv preprint arXiv:2602.22088, 2026
Pith/arXiv arXiv 2026
-
[26]
T. Tang, X. Ji, W. Xing, C. Hao, W. Xu, L. Shao, C. Lu, Q. Yu, J. Pang, and K. Zhang. To- wards human-like manipulation through rl-augmented teleoperation and mixture-of-dexterous- experts vla. arXiv preprint arXiv:2603.08122, 2026
arXiv 2026
-
[27]
B. Chen, W. Wan, T. Chen, X. Guo, C. Xu, Y . Qi, H. Zhang, L. Wu, T. Xu, Z. Li, et al. Univtac: A unified simulation platform for visuo-tactile manipulation data generation, learning, and benchmarking. arXiv preprint arXiv:2602.10093, 2026
arXiv 2026
-
[28]
George, S
A. George, S. Gano, P. Katragadda, and A. B. Farimani. Vital pretraining: Visuo-tactile pretraining for tactile and non-tactile manipulation policies. In 2025 IEEE International Conference on Robotics and Automation (ICRA), pages 258–264. IEEE, 2025
2025
-
[29]
S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu. Rdt-1b: a diffu- sion foundation model for bimanual manipulation. In International Conference on Learning Representations, volume 2025, pages 29982–30009, 2025
2025
-
[30]
Zhang, P
C. Zhang, P. Hao, X. Cao, X. Hao, S. Cui, and S. Wang. Vtla: Vision-tactile-language- action model with preference learning for insertion manipulation. Biomimetic Intelligence and Robotics, page 100333, 2026
2026
-
[31]
C. Higuera, A. Sharma, C. K. Bodduluri, T. Fan, P. Lancaster, M. Kalakrishnan, M. Kaess, B. Boots, M. Lambeta, T. Wu, et al. Sparsh: Self-supervised touch representations for vision- based tactile sensing. arXiv preprint arXiv:2410.24090, 2024. 11
arXiv 2024
-
[32]
R. Feng, J. Hu, W. Xia, T. Gao, A. Shen, Y . Sun, B. Fang, and D. Hu. Anytouch: Learn- ing unified static-dynamic representation across multiple visuo-tactile sensors. arXiv preprint arXiv:2502.12191, 2025
arXiv 2025
-
[33]
L. Wu, C. Yu, J. Ren, L. Chen, Y . Jiang, R. Huang, G. Gu, and H. Li. Freetacman: Robot-free visuo-tactile data collection system for contact-rich manipulation. arXiv preprint arXiv:2506.01941, 2025
arXiv 2025
-
[34]
Z. Xu, Y . Wang, B. Abbatematteo, J. Preechayasomboon, S. Chan, N. Colonnese, and A. H. Memar. Contact-grounded policy: Dexterous visuotactile policy with generative contact grounding. arXiv preprint arXiv:2603.05687, 2026
Pith/arXiv arXiv 2026
- [35]
-
[36]
P. Zhi, P. Li, J. Yin, B. Jia, and S. Huang. Learning a unified policy for position and force control in legged loco-manipulation. arXiv preprint arXiv:2505.20829, 2025
arXiv 2025
-
[37]
O’Neill, A
A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024
2024
-
[38]
Barreiros, A
J. Barreiros, A. Beaulieu, A. Bhat, R. Cory, E. Cousineau, H. Dai, C.-H. Fang, K. Hashimoto, M. Z. Irshad, M. Itkina, et al. A careful examination of large behavior models for multitask dexterous manipulation. Science Robotics, 11(113):eaea6201, 2026
2026
-
[39]
K. Pertsch, K. Stachowicz, B. Ichter, D. Driess, S. Nair, Q. Vuong, O. Mees, C. Finn, and S. Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025
Pith/arXiv arXiv 2025
-
[40]
S. Liu, B. Li, K. Ma, L. Wu, H. Tan, X. Ouyang, H. Su, and J. Zhu. Rdt2: Exploring the scaling limit of umi data towards zero-shot cross-embodiment generalization. arXiv preprint arXiv:2602.03310, 2026
arXiv 2026
-
[41]
G. R. Team, S. Abeyruwan, J. Ainslie, J.-B. Alayrac, M. G. Arenas, T. Armstrong, A. Balakr- ishna, R. Baruch, M. Bauza, M. Blokzijl, et al. Gemini robotics: Bringing ai into the physical world. arXiv preprint arXiv:2503.20020, 2025
Pith/arXiv arXiv 2025
-
[42]
G. R. Team, A. Abdolmaleki, S. Abeyruwan, J. Ainslie, J.-B. Alayrac, M. G. Arenas, A. Bal- akrishna, N. Batchelor, A. Bewley, J. Bingham, et al. Gemini robotics 1.5: Pushing the frontier of generalist robots with advanced embodied reasoning, thinking, and motion transfer. arXiv preprint arXiv:2510.03342, 2025
Pith/arXiv arXiv 2025
-
[43]
A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945, 2024
Pith/arXiv arXiv 2024
-
[44]
S. Wu, X. Liu, S. Xie, P. Wang, X. Li, B. Yang, Z. Li, K. Zhu, H. Wu, Y . Liu, et al. Robocoin: An open-sourced bimanual robotic data collection for integrated manipulation. arXiv preprint arXiv:2511.17441, 2025
Pith/arXiv arXiv 2025
- [45]
-
[46]
R. Yang, Q. Yu, Y . Wu, R. Yan, B. Li, A.-C. Cheng, X. Zou, Y . Fang, X. Cheng, R.-Z. Qiu, et al. Egovla: Learning vision-language-action models from egocentric human videos. arXiv preprint arXiv:2507.12440, 2025. 12
Pith/arXiv arXiv 2025
-
[47]
C. Yuan, R. Zhou, M. Liu, Y . Hu, S. Wang, L. Yi, C. Wen, S. Zhang, and Y . Gao. Motiontrans: Human vr data enable motion-level learning for robotic manipulation policies. arXiv preprint arXiv:2509.17759, 2025
arXiv 2025
- [48]
-
[49]
C. Yuan, C. Wen, T. Zhang, and Y . Gao. General flow as foundation affordance for scalable robot learning. arXiv preprint arXiv:2401.11439, 2024
arXiv 2024
-
[50]
Q. Li, Y . Deng, Y . Liang, L. Luo, L. Zhou, C. Yao, L. Zeng, Z. Feng, H. Liang, S. Xu, et al. Scalable vision-language-action model pretraining for robotic manipulation with real-life hu- man activity videos. arXiv preprint arXiv:2510.21571, 2025
arXiv 2025
-
[51]
Y . Fu, N. Chen, J. Zhao, S. Shan, G. Yao, P. Wang, Z. Wang, and S. Zhang. Metis: Multi-source egocentric training for integrated dexterous vision-language-action model. arXiv preprint arXiv:2511.17366, 2025
arXiv 2025
-
[52]
J. Lyu, K. Liu, X. Zhang, H. Liao, Y . Feng, W. Zhu, T. Shen, J. Chen, J. Zhang, Y . Dong, et al. Lda-1b: Scaling latent dynamics action model via universal embodied data ingestion. arXiv preprint arXiv:2602.12215, 2026
Pith/arXiv arXiv 2026
-
[53]
Driess, J
D. Driess, J. Springenberg, B. Ichter, L. Yu, A. Li-Bell, K. Pertsch, A. Ren, H. Walke, Q. Vuong, L. X. Shi, et al. Knowledge insulating vision-language-action models: Train fast, run fast, generalize better. Advances in Neural Information Processing Systems, 38:102867– 102888, 2026
2026
-
[54]
Q. Bu, Y . Yang, J. Cai, S. Gao, G. Ren, M. Yao, P. Luo, and H. Li. Univla: Learning to act anywhere with task-centric latent actions. arXiv preprint arXiv:2505.06111, 2025
Pith/arXiv arXiv 2025
-
[55]
S. Ye, Y . Ge, K. Zheng, S. Gao, S. Yu, G. Kurian, S. Indupuru, Y . L. Tan, C. Zhu, J. Xiang, et al. World action models are zero-shot policies. arXiv preprint arXiv:2602.15922, 2026
Pith/arXiv arXiv 2026
-
[56]
T. Yuan, Z. Dong, Y . Liu, and H. Zhao. Fast-wam: Do world action models need test-time future imagination? arXiv preprint arXiv:2603.16666, 2026
Pith/arXiv arXiv 2026
-
[57]
H. Bi, H. Tan, S. Xie, Z. Wang, S. Huang, H. Liu, R. Zhao, Y . Feng, C. Xiang, Y . Rong, et al. Motus: A unified latent action world model. arXiv preprint arXiv:2512.13030, 2025
Pith/arXiv arXiv 2025
-
[58]
Y . Hu, Y . Guo, P. Wang, X. Chen, Y .-J. Wang, J. Zhang, K. Sreenath, C. Lu, and J. Chen. Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint arXiv:2412.14803, 2024
Pith/arXiv arXiv 2024
-
[59]
F. Lin, R. Nai, Y . Hu, J. You, J. Zhao, and Y . Gao. Onetwovla: A unified vision-language-action model with adaptive reasoning. arXiv preprint arXiv:2505.11917, 2025
arXiv 2025
-
[60]
F. Lin, K. Arora, J. Mercat, H. Nishimura, P. Shah, C. Xu, M. Zhang, M. Zolotas, M. Angeles, O. Pfannenstiehl, et al. A systematic study of data modalities and strategies for co-training large behavior models for robot manipulation. arXiv preprint arXiv:2602.01067, 2026
arXiv 2026
-
[61]
Cheng, Y
Z. Cheng, Y . Zhao, K. Wang, H. Zhang, and L. Song. Taco: A benchmark for lossless and lossy codecs of heterogeneous tactile data. In The Fourteenth International Conference on Learning Representations, 2026
2026
-
[62]
Y . Wi, J. Yin, E. Xiang, A. Sharma, J. Malik, M. Mukadam, N. Fazeli, and T. Hellebrekers. Tac- talign: Human-to-robot policy transfer via tactile alignment. arXiv preprint arXiv:2602.13579, 2026. 13
arXiv 2026
-
[63]
F. Jia, X. Niu, S. Yang, Q. Ben, T. Huang, J. Wang, J. Pang, et al. Feel robot feels: Tactile feedback array glove for dexterous manipulation. arXiv preprint arXiv:2603.28542, 2026
arXiv 2026
- [64]
-
[65]
H. Xue, J. Ren, W. Chen, G. Zhang, Y . Fang, G. Gu, H. Xu, and C. Lu. Reactive diffusion policy: Slow-fast visual-tactile policy learning for contact-rich manipulation. arXiv preprint arXiv:2503.02881, 2025
arXiv 2025
-
[66]
K. Yu, Y . Han, Q. Wang, V . Saxena, D. Xu, and Y . Zhao. Mimictouch: Leveraging multi-modal human tactile demonstrations for contact-rich manipulation. arXiv preprint arXiv:2310.16917, 2023
arXiv 2023
-
[67]
Z. He, H. Fang, J. Chen, H.-S. Fang, and C. Lu. Foar: Force-aware reactive policy for contact- rich robotic manipulation. IEEE Robotics and Automation Letters, 2025
2025
-
[68]
Y . Tian, S. Cheng, T. Wei, T. Zhou, Y . Zhang, Z. Liu, Q. Han, Z. Yuan, and H. Xu. Vi- tas: Visual tactile soft fusion contrastive learning for visuomotor learning. arXiv preprint arXiv:2602.11643, 2026
arXiv 2026
-
[69]
H. Choi, Y . Hou, C. Pan, S. Hong, A. Patel, X. Xu, M. R. Cutkosky, and S. Song. In-the-wild compliant manipulation with umi-ft. arXiv preprint arXiv:2601.09988, 2026
arXiv 2026
-
[70]
F. Liu, C. Li, Y . Qin, J. Xu, P. Abbeel, and R. Chen. Vitamin: Learning contact-rich tasks through robot-free visuo-tactile manipulation interface. arXiv preprint arXiv:2504.06156, 2025
arXiv 2025
-
[71]
C. Li, C. Liu, D. Wang, S. Zhang, L. Li, Z. Zeng, F. Liu, J. Xu, and R. Chen. Vitamin- b: A reliable and efficient visuo-tactile bimanual manipulation interface. arXiv preprint arXiv:2511.05858, 2025
arXiv 2025
-
[72]
Y . Xu, L. Wei, P. An, Q. Zhang, and Y .-L. Li. exumi: Extensible robot teaching system with action-aware task-agnostic tactile representation. arXiv preprint arXiv:2509.14688, 2025
arXiv 2025
-
[73]
Higuera, A
C. Higuera, A. Sharma, T. Fan, C. K. Bodduluri, B. Boots, M. Kaess, M. Lambeta, T. Wu, Z. Liu, F. R. Hogan, et al. Tactile beyond pixels: Multisensory touch representations for robot manipulation. In Conference on Robot Learning, pages 105–123. PMLR, 2025
2025
-
[74]
J. Bi, K. Y . Ma, C. Hao, M. Z. Shou, and H. Soh. Vla-touch: Enhancing vision-language-action models with dual-level tactile feedback. arXiv preprint arXiv:2507.17294, 2025
arXiv 2025
- [75]
-
[76]
J. Yu, H. Liu, Q. Yu, J. Ren, C. Hao, H. Ding, G. Huang, G. Huang, Y . Song, P. Cai, et al. Forcevla: Enhancing vla models with a force-aware moe for contact-rich manipulation. Advances in Neural Information Processing Systems, 38:93409–93439, 2026
2026
-
[77]
W. Wu, F. Lu, Y . Wang, S. Yang, S. Liu, F. Wang, Q. Zhu, H. Sun, Y . Wang, S. Ma, et al. A pragmatic vla foundation model. arXiv preprint arXiv:2601.18692, 2026
Pith/arXiv arXiv 2026
-
[78]
G. A. Team. Gen-0: Embodied foundation models that scale with physical interaction. Generalist AI Blog, 2025. https://generalistai.com/blog/nov-04-2025-GEN-0
2025
-
[79]
Q. Liu, Q. Ye, Z. Sun, Y . Cui, G. Li, and J. Chen. Masked visual-tactile pre-training for robot manipulation. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 13859–13875. IEEE, 2024. 14
2024
-
[80]
Cheng, J
N. Cheng, J. Xu, C. Guan, J. Gao, W. Wang, Y . Li, F. Meng, J. Zhou, B. Fang, and W. Han. Touch100k: A large-scale touch-language-vision dataset for touch-centric multimodal repre- sentation. Information Fusion, 124:103305, 2025
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.