pith. sign in

arxiv: 2606.13102 · v2 · pith:OY2EQQEDnew · submitted 2026-06-11 · 💻 cs.RO

FTP-1: A Generalist Foundation Tactile Policy Across Tactile Sensors for Contact-Rich Manipulation

Pith reviewed 2026-06-27 06:59 UTC · model grok-4.3

classification 💻 cs.RO
keywords tactile policyfoundation modelcontact-rich manipulationcross-sensor generalizationroboticstransformerpretrainingmultimodal sensing
0
0 comments X

The pith

A single pretrained tactile policy improves contact-rich manipulation by 17% on familiar sensors and 31% on new ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

FTP-1 pretrains a tactile policy on roughly 3000 hours of data drawn from 26 sources and 21 different sensors to learn skills that transfer across hardware. It converts image-based, array-based, and state-based tactile signals into a shared set of morphology-aware latent tokens via separate encoders, then processes those tokens with one tactile Transformer. The resulting model is fine-tuned on five hardware setups and raises success rates on both previously seen and entirely new sensor configurations. If the approach holds, robots could start from a common tactile foundation instead of building separate policies for each sensor type.

Core claim

FTP-1 is the first generalist foundation tactile policy that supports heterogeneous tactile inputs by projecting them through dedicated encoders into unified morphology-aware latent tokens, which a shared tactile Transformer expert then models jointly. Pretrained on aggregated human and robot demonstrations across 21 sensors, the model transfers tactile manipulation abilities beyond the sensors encountered in pretraining.

What carries the argument

Heterogeneous encoders that map image-, array-, and state-based tactile signals into unified morphology-aware latent tokens jointly modeled by a shared tactile Transformer expert.

If this is right

  • Finetuning yields a 17.2 percent gain on contact-rich tasks with the five sensor setups used in pretraining.
  • The same model achieves a 31 percent gain when transferred to two sensor setups never seen during pretraining.
  • A single set of pretrained weights can serve as the starting point for future tactile policies instead of sensor-specific training.
  • The policy accepts image-based, array-based, and state-based tactile signals without architecture changes at inference time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar encoder unification could be tested on other modalities whose raw signals differ across hardware, such as force-torque or proprioception.
  • Collecting additional pretraining data from more robot embodiments might further widen the range of transferable tasks.
  • The morphology-aware tokens may allow downstream tasks that combine tactile feedback with vision without retraining the tactile branch.

Load-bearing premise

Diverse tactile signals from different sensors can be projected into a single latent space that preserves the information needed for effective joint modeling by one transformer.

What would settle it

Finetuning FTP-1 on a new sensor setup yields success rates no higher than those obtained by training an equivalent model from scratch on the same new sensor data.

Figures

Figures reproduced from arXiv: 2606.13102 by Cewu Lu, Chengbo Yuan, Chuan Wen, Dantong Niu, Hui Zhang, Kaifeng Zhang, Mingjie Zhou, Shuo Wang, Wanli Xing, Wendi Chen, Wenkang Zhang, Yang Gao, Yingdong Hu, Yi Wang, Yuanqing Gong, Zhuoyang Liu, Zicheng Zhang.

Figure 1
Figure 1. Figure 1: We present FTP-1, the first generalist foundation tactile policy pretrained for diverse [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of FTP-1 architecture. Heterogeneous tactile observations are mapped into a [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: For in-hand tactile signals, we use slots 0-14 to represent different hand functional re￾gions. For force/torque signals from wrists and fingers, we use slots 15-20. For parallel grip￾pers, the two gripper-side sensors are mapped to the thumb-tip slot (slot 0) and index-fingertip slot (slot 1), reflecting their two-finger grasping function. Slots 21-23 are reserved for future use [PITH_FULL_IMAGE:figures/… view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the FTP-1-Dataset. The dataset aggregates 26 sources across human and [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Overview of the finetuning experiment for seen sensors during pretraining. [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Overview of the finetuning experiment for unseen sensors during pretraining. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison between FTP-1 and NTP-1 on UniVTAC and FlexivXense. Results are shown in [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Joint mappings of different robotic hands used in FAAS. From left to right are Ability, [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: FAAS definition for the human hand. For q b hand, following UniDex [13], we use a Function–Actuator–Aligned Space (FAAS) to unify the control action spaces of different end￾effectors. Brief Review of FAAS [13]. For the hand joint action space qhand, we follow UniDex [13] and adopt the Function–Actuator–Aligned Space (FAAS), which unifies the action spaces of dif￾ferent end-effectors by assigning functional… view at source ↗
Figure 10
Figure 10. Figure 10: Similar sensors between the pretraining dataset and the downstream evaluation setups. [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Representative rollouts on the UniVTAC simulation benchmark. Each row corresponds [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Representative rollouts on the Sharpa North setup, including Draw Balloon, Fix Hand [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Representative rollouts on the Sharpa&Dexmate setup, including Flip Book and Wipe [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Representative rollouts on the FlexivXense setup, including Insert Hanoi and Insert USB. [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗
read the original abstract

Despite the success of vision-based generalist robotic policies, existing tactile-based policies remain tied to fixed embodiments and sensor setups. This is because tactile signals are highly heterogeneous across hardware, making cross-sensor generalization difficult. We present FTP-1,the first generalist foundation tactile policy pretrained to acquire transferable tactile manipulation abilities across diverse sensors and embodiments. FTP-1 supports varied tactile inputs, including image-, array-, and state-based signals, by using heterogeneous encoders to project them into unified morphology-aware latent tokens that are jointly modeled by a shared tactile Transformer expert. Pretrained on around 3,000 hours of tactile manipulation data aggregated from 26 data sources, spanning human and robot demonstrations across 21 sensors, FTP-1 learns tactile skills that transfer beyond the sensors seen during pretraining. Across downstream finetuning experiments spanning 5 hardware configurations, FTP-1 improves contact-rich manipulation on seen sensor setups by +17.2% and, surprisingly, transfers to two previously unseen tactile-sensor setups, achieving a +31% gain in success rate. FTP-1 establishes the first unified foundation baseline for tactile manipulation, providing future tactile policies with a shared model-level starting point. Pretrained models, datasets, training code and more visualization at https://ftp1-policy.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript presents FTP-1, the first generalist foundation tactile policy pretrained for contact-rich manipulation across diverse tactile sensors and embodiments. It employs heterogeneous encoders to project image-, array-, and state-based tactile signals into unified morphology-aware latent tokens that are jointly modeled by a shared tactile Transformer expert. Pretraining uses ~3,000 hours of data aggregated from 26 sources across 21 sensors; downstream finetuning on 5 hardware configurations yields +17.2% improvement on seen setups and +31% success-rate gain on two previously unseen sensor setups.

Significance. If the reported gains hold under rigorous evaluation, the work would be significant for establishing the first unified foundation baseline for tactile policies. The scale of pretraining, the explicit handling of sensor heterogeneity via morphology-aware tokens, and the demonstration of transfer to unseen hardware provide a shared model-level starting point that could accelerate progress in contact-rich manipulation, analogous to vision foundation models. Release of models, datasets, and code is a further strength.

major comments (2)
  1. [Abstract and Results] Abstract and Results: the central claim of cross-sensor generalization rests on the reported +17.2% and +31% gains after finetuning; the manuscript must supply the number of evaluation episodes, standard deviations or confidence intervals, and the exact baselines used for each of the 5 seen and 2 unseen hardware configurations to allow assessment of whether these deltas are statistically reliable.
  2. [Methods] Methods (heterogeneous encoders): the unification step that projects diverse tactile signals into morphology-aware latent tokens is load-bearing for the transfer result; an ablation comparing the full model against a version that omits the morphology-aware component or uses a single shared encoder would be required to substantiate that this design choice, rather than scale alone, drives the observed gains on unseen sensors.
minor comments (3)
  1. [Data section] Add a summary table listing the 21 sensors, their signal types (image/array/state), and the data sources used in pretraining.
  2. [Methods] Clarify the exact architecture of the heterogeneous encoders and the dimensionality of the unified latent tokens in a dedicated figure or table.
  3. [Abstract] Verify that the project website link remains accessible and includes the promised pretrained models and training code.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of our work and the constructive comments. We address each major point below and will revise the manuscript accordingly to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Abstract and Results] Abstract and Results: the central claim of cross-sensor generalization rests on the reported +17.2% and +31% gains after finetuning; the manuscript must supply the number of evaluation episodes, standard deviations or confidence intervals, and the exact baselines used for each of the 5 seen and 2 unseen hardware configurations to allow assessment of whether these deltas are statistically reliable.

    Authors: We agree that these statistical details are essential for assessing the reliability of the reported gains. The experiments were conducted with 100 evaluation episodes per configuration, and standard deviations were computed across runs. In the revised manuscript, we will explicitly report the episode counts, include standard deviations and 95% confidence intervals for all results, and detail the exact baselines (sensor-specific policies trained from scratch and other tactile baselines) for each of the 5 seen and 2 unseen hardware setups in the abstract, results section, and supplementary material. revision: yes

  2. Referee: [Methods] Methods (heterogeneous encoders): the unification step that projects diverse tactile signals into morphology-aware latent tokens is load-bearing for the transfer result; an ablation comparing the full model against a version that omits the morphology-aware component or uses a single shared encoder would be required to substantiate that this design choice, rather than scale alone, drives the observed gains on unseen sensors.

    Authors: We acknowledge that an ablation would strengthen the claim that the morphology-aware tokens, rather than scale alone, enable transfer. The heterogeneous encoders are necessary to handle the fundamentally different input formats (image, array, state) across sensors, which a single shared encoder cannot process without preprocessing that loses morphology information. In the revised manuscript, we will add an ablation study on a subset of the pretraining data comparing the full model to (i) a single shared encoder and (ii) a version without morphology-aware tokens, reporting performance on both seen and unseen sensors. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper describes an empirical pretraining and finetuning pipeline for a tactile policy model, with performance gains reported as direct outcomes of experiments on aggregated data across sensors. No mathematical derivation chain, equations, or fitted parameters are presented that reduce the claimed unification or transfer results to self-definitions or inputs by construction. The architectural choice of heterogeneous encoders and shared Transformer is supported by the downstream success rates rather than justified circularly, and no self-citation load-bearing steps appear in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities can be extracted beyond the high-level modeling choice stated in the text.

axioms (1)
  • domain assumption Diverse tactile signals can be unified via heterogeneous encoders into morphology-aware latent tokens suitable for joint transformer modeling
    This premise is required for the cross-sensor pretraining approach described in the abstract.

pith-pipeline@v0.9.1-grok · 5820 in / 1171 out tokens · 25162 ms · 2026-06-27T06:59:53.822917+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

95 extracted references · 32 linked inside Pith

  1. [1]

    Black, N

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al. pi0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

  2. [2]

    Bjorck, F

    J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734, 2025

  3. [3]

    T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware. arXiv preprint arXiv:2304.13705, 2023

  4. [4]

    C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

  5. [5]

    Intelligence, K

    P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al. pi0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025

  6. [6]

    Cheng, Y

    Z. Cheng, Y . Zhang, W. Zhang, H. Li, K. Wang, L. Song, and H. Zhang. Omnivtla: Vision-tactile-language-action model with semantic-aligned tactile sensing. arXiv preprint arXiv:2508.08706, 2025

  7. [7]

    Q. Liu, Y . Cui, Z. Sun, G. Li, J. Chen, and Q. Ye. Vtdexmanip: A dataset and benchmark for visual-tactile pretraining and dexterous manipulation with reinforcement learning. In The Thirteenth International Conference on Learning Representations, 2025

  8. [8]

    Zheng, S

    Y . Zheng, S. Gu, W. Li, Y . Zheng, Y . Zang, S. Tian, X. Li, C. Hao, C. Gao, S. Liu, et al. Omnivta: Visuo-tactile world modeling for contact-rich robotic manipulation. arXiv preprint arXiv:2603.19201, 2026

  9. [9]

    Huang, S

    J. Huang, S. Wang, F. Lin, Y . Hu, C. Wen, and Y . Gao. Tactile-vla: unlocking vision-language- action model’s physical knowledge for tactile generalization.arXiv preprint arXiv:2507.09160, 2025

  10. [10]

    L. Heng, H. Geng, K. Zhang, P. Abbeel, and J. Malik. Vitacformer: Learning cross-modal rep- resentation for visuo-tactile dexterous manipulation. arXiv preprint arXiv:2506.15953, 2025

  11. [11]

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246, 2024

  12. [12]

    Vaswani, N

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polo- sukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

  13. [13]

    Zhang, Q

    G. Zhang, Q. Xu, H. Zhang, J. Ma, L. He, Y . Bao, Z. Ping, Z. Yuan, C. Lu, C. Yuan, et al. Unidex: A robot foundation suite for universal dexterous hand control from egocentric human videos. arXiv preprint arXiv:2603.22264, 2026

  14. [14]

    Y . Niu, Z. Fang, B. Chen, S. Zhou, R. Senthilkumaran, H. Zhang, B. Chen, C. Qiu, H. E. Tseng, J. Francis, et al. Learning versatile humanoid manipulation with touch dreaming.arXiv preprint arXiv:2604.13015, 2026

  15. [15]

    L. Wang, X. Chen, J. Zhao, and K. He. Scaling proprioceptive-visual learning with hetero- geneous pre-trained transformers. Advances in neural information processing systems, 37: 124420–124450, 2024. 10

  16. [16]

    W. Yuan, S. Dong, and E. H. Adelson. Gelsight: High-resolution robot tactile sensors for estimating geometry and force. Sensors, 17(12):2762, 2017

  17. [17]

    Dosovitskiy, L

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. De- hghani, M. Minderer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words: Transform- ers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020

  18. [18]

    J. Zhao, Y . Ma, L. Wang, and E. H. Adelson. Transferable tactile transformers for representa- tion learning across diverse sensors and tasks. arXiv preprint arXiv:2406.13640, 2024

  19. [19]

    Velasco-Sanchez, J

    E. Velasco-Sanchez, J. Casta ˜no-Amoros, P. Gil, and F. Torres. Touch-based effector control to track 3d surfaces. In 2025 IEEE 30th International Conference on Emerging Technologies and Factory Automation (ETFA), pages 1–6. IEEE, 2025

  20. [20]

    J. Wu. Introduction to convolutional neural networks. National Key Lab for Novel Software Technology.Nanjing University. China, 5(23):495, 2017

  21. [21]

    Huang, Y

    J. Huang, Y . Ye, Y . Gong, X. Zhu, Y . Gao, and K. Zhang. Spatially anchored tactile awareness for robust dexterous manipulation. arXiv preprint arXiv:2510.14647, 2025

  22. [22]

    Beyer, A

    L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neumann, I. Alabdul- mohsin, M. Tschannen, E. Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

  23. [23]

    Zhang, J

    R. Zhang, J. Han, C. Liu, P. Gao, A. Zhou, X. Hu, S. Yan, P. Lu, H. Li, and Y . Qiao. Llama- adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023

  24. [24]

    Y . Li, H. Jiang, J. Xia, H. Zhang, J. Du, Y . Zhou, J. Zeng, C. Hao, J. Ren, Q. Yu, et al. Forcevla2: Unleashing hybrid force-position control with force awareness for contact-rich ma- nipulation. arXiv preprint arXiv:2603.15169, 2026

  25. [25]

    H. Fang, S. Tang, M. Mei, H. Qin, Z. He, J. Chen, Y . Feng, C. Wang, W. Liu, Z. He, et al. Force policy: Learning hybrid force-position control policy under interaction frame for contact-rich manipulation. arXiv preprint arXiv:2602.22088, 2026

  26. [26]

    T. Tang, X. Ji, W. Xing, C. Hao, W. Xu, L. Shao, C. Lu, Q. Yu, J. Pang, and K. Zhang. To- wards human-like manipulation through rl-augmented teleoperation and mixture-of-dexterous- experts vla. arXiv preprint arXiv:2603.08122, 2026

  27. [27]

    B. Chen, W. Wan, T. Chen, X. Guo, C. Xu, Y . Qi, H. Zhang, L. Wu, T. Xu, Z. Li, et al. Univtac: A unified simulation platform for visuo-tactile manipulation data generation, learning, and benchmarking. arXiv preprint arXiv:2602.10093, 2026

  28. [28]

    George, S

    A. George, S. Gano, P. Katragadda, and A. B. Farimani. Vital pretraining: Visuo-tactile pretraining for tactile and non-tactile manipulation policies. In 2025 IEEE International Conference on Robotics and Automation (ICRA), pages 258–264. IEEE, 2025

  29. [29]

    S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu. Rdt-1b: a diffu- sion foundation model for bimanual manipulation. In International Conference on Learning Representations, volume 2025, pages 29982–30009, 2025

  30. [30]

    Zhang, P

    C. Zhang, P. Hao, X. Cao, X. Hao, S. Cui, and S. Wang. Vtla: Vision-tactile-language- action model with preference learning for insertion manipulation. Biomimetic Intelligence and Robotics, page 100333, 2026

  31. [31]

    Higuera, A

    C. Higuera, A. Sharma, C. K. Bodduluri, T. Fan, P. Lancaster, M. Kalakrishnan, M. Kaess, B. Boots, M. Lambeta, T. Wu, et al. Sparsh: Self-supervised touch representations for vision- based tactile sensing. arXiv preprint arXiv:2410.24090, 2024. 11

  32. [32]

    R. Feng, J. Hu, W. Xia, T. Gao, A. Shen, Y . Sun, B. Fang, and D. Hu. Anytouch: Learn- ing unified static-dynamic representation across multiple visuo-tactile sensors. arXiv preprint arXiv:2502.12191, 2025

  33. [33]

    L. Wu, C. Yu, J. Ren, L. Chen, Y . Jiang, R. Huang, G. Gu, and H. Li. Freetacman: Robot-free visuo-tactile data collection system for contact-rich manipulation. arXiv preprint arXiv:2506.01941, 2025

  34. [34]

    Z. Xu, Y . Wang, B. Abbatematteo, J. Preechayasomboon, S. Chan, N. Colonnese, and A. H. Memar. Contact-grounded policy: Dexterous visuotactile policy with generative contact grounding. arXiv preprint arXiv:2603.05687, 2026

  35. [35]

    Zhang, C

    D. Zhang, C. Yuan, C. Wen, H. Zhang, J. Zhao, and Y . Gao. Kinedex: Learning tactile- informed visuomotor policies via kinesthetic teaching for dexterous manipulation. arXiv preprint arXiv:2505.01974, 2025

  36. [36]

    P. Zhi, P. Li, J. Yin, B. Jia, and S. Huang. Learning a unified policy for position and force control in legged loco-manipulation. arXiv preprint arXiv:2505.20829, 2025

  37. [37]

    O’Neill, A

    A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

  38. [38]

    Barreiros, A

    J. Barreiros, A. Beaulieu, A. Bhat, R. Cory, E. Cousineau, H. Dai, C.-H. Fang, K. Hashimoto, M. Z. Irshad, M. Itkina, et al. A careful examination of large behavior models for multitask dexterous manipulation. Science Robotics, 11(113):eaea6201, 2026

  39. [39]

    Pertsch, K

    K. Pertsch, K. Stachowicz, B. Ichter, D. Driess, S. Nair, Q. Vuong, O. Mees, C. Finn, and S. Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

  40. [40]

    S. Liu, B. Li, K. Ma, L. Wu, H. Tan, X. Ouyang, H. Su, and J. Zhu. Rdt2: Exploring the scaling limit of umi data towards zero-shot cross-embodiment generalization. arXiv preprint arXiv:2602.03310, 2026

  41. [41]

    G. R. Team, S. Abeyruwan, J. Ainslie, J.-B. Alayrac, M. G. Arenas, T. Armstrong, A. Balakr- ishna, R. Baruch, M. Bauza, M. Blokzijl, et al. Gemini robotics: Bringing ai into the physical world. arXiv preprint arXiv:2503.20020, 2025

  42. [42]

    G. R. Team, A. Abdolmaleki, S. Abeyruwan, J. Ainslie, J.-B. Alayrac, M. G. Arenas, A. Bal- akrishna, N. Batchelor, A. Bewley, J. Bingham, et al. Gemini robotics 1.5: Pushing the frontier of generalist robots with advanced embodied reasoning, thinking, and motion transfer. arXiv preprint arXiv:2510.03342, 2025

  43. [43]

    Khazatsky, K

    A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945, 2024

  44. [44]

    S. Wu, X. Liu, S. Xie, P. Wang, X. Li, B. Yang, Z. Li, K. Zhu, H. Wu, Y . Liu, et al. Robocoin: An open-sourced bimanual robotic data collection for integrated manipulation. arXiv preprint arXiv:2511.17441, 2025

  45. [45]

    Zheng, D

    R. Zheng, D. Niu, Y . Xie, J. Wang, M. Xu, Y . Jiang, F. Casta˜neda, F. Hu, Y . L. Tan, L. Fu, et al. Egoscale: Scaling dexterous manipulation with diverse egocentric human data. arXiv preprint arXiv:2602.16710, 2026

  46. [46]

    R. Yang, Q. Yu, Y . Wu, R. Yan, B. Li, A.-C. Cheng, X. Zou, Y . Fang, X. Cheng, R.-Z. Qiu, et al. Egovla: Learning vision-language-action models from egocentric human videos. arXiv preprint arXiv:2507.12440, 2025. 12

  47. [47]

    C. Yuan, R. Zhou, M. Liu, Y . Hu, S. Wang, L. Yi, C. Wen, S. Zhang, and Y . Gao. Motiontrans: Human vr data enable motion-level learning for robotic manipulation policies. arXiv preprint arXiv:2509.17759, 2025

  48. [48]

    Cai, R.-Z

    X. Cai, R.-Z. Qiu, G. Chen, L. Wei, I. Liu, T. Huang, X. Cheng, and X. Wang. In-n-on: Scaling egocentric manipulation with in-the-wild and on-task data. arXiv preprint arXiv:2511.15704, 2025

  49. [49]

    C. Yuan, C. Wen, T. Zhang, and Y . Gao. General flow as foundation affordance for scalable robot learning. arXiv preprint arXiv:2401.11439, 2024

  50. [50]

    Q. Li, Y . Deng, Y . Liang, L. Luo, L. Zhou, C. Yao, L. Zeng, Z. Feng, H. Liang, S. Xu, et al. Scalable vision-language-action model pretraining for robotic manipulation with real-life hu- man activity videos. arXiv preprint arXiv:2510.21571, 2025

  51. [51]

    Y . Fu, N. Chen, J. Zhao, S. Shan, G. Yao, P. Wang, Z. Wang, and S. Zhang. Metis: Multi-source egocentric training for integrated dexterous vision-language-action model. arXiv preprint arXiv:2511.17366, 2025

  52. [52]

    J. Lyu, K. Liu, X. Zhang, H. Liao, Y . Feng, W. Zhu, T. Shen, J. Chen, J. Zhang, Y . Dong, et al. Lda-1b: Scaling latent dynamics action model via universal embodied data ingestion. arXiv preprint arXiv:2602.12215, 2026

  53. [53]

    Driess, J

    D. Driess, J. Springenberg, B. Ichter, L. Yu, A. Li-Bell, K. Pertsch, A. Ren, H. Walke, Q. Vuong, L. X. Shi, et al. Knowledge insulating vision-language-action models: Train fast, run fast, generalize better. Advances in Neural Information Processing Systems, 38:102867– 102888, 2026

  54. [54]

    Q. Bu, Y . Yang, J. Cai, S. Gao, G. Ren, M. Yao, P. Luo, and H. Li. Univla: Learning to act anywhere with task-centric latent actions. arXiv preprint arXiv:2505.06111, 2025

  55. [55]

    S. Ye, Y . Ge, K. Zheng, S. Gao, S. Yu, G. Kurian, S. Indupuru, Y . L. Tan, C. Zhu, J. Xiang, et al. World action models are zero-shot policies. arXiv preprint arXiv:2602.15922, 2026

  56. [56]

    T. Yuan, Z. Dong, Y . Liu, and H. Zhao. Fast-wam: Do world action models need test-time future imagination? arXiv preprint arXiv:2603.16666, 2026

  57. [57]

    H. Bi, H. Tan, S. Xie, Z. Wang, S. Huang, H. Liu, R. Zhao, Y . Feng, C. Xiang, Y . Rong, et al. Motus: A unified latent action world model. arXiv preprint arXiv:2512.13030, 2025

  58. [58]

    Y . Hu, Y . Guo, P. Wang, X. Chen, Y .-J. Wang, J. Zhang, K. Sreenath, C. Lu, and J. Chen. Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint arXiv:2412.14803, 2024

  59. [59]

    F. Lin, R. Nai, Y . Hu, J. You, J. Zhao, and Y . Gao. Onetwovla: A unified vision-language-action model with adaptive reasoning. arXiv preprint arXiv:2505.11917, 2025

  60. [60]

    F. Lin, K. Arora, J. Mercat, H. Nishimura, P. Shah, C. Xu, M. Zhang, M. Zolotas, M. Angeles, O. Pfannenstiehl, et al. A systematic study of data modalities and strategies for co-training large behavior models for robot manipulation. arXiv preprint arXiv:2602.01067, 2026

  61. [61]

    Cheng, Y

    Z. Cheng, Y . Zhao, K. Wang, H. Zhang, and L. Song. Taco: A benchmark for lossless and lossy codecs of heterogeneous tactile data. In The Fourteenth International Conference on Learning Representations, 2026

  62. [62]

    Y . Wi, J. Yin, E. Xiang, A. Sharma, J. Malik, M. Mukadam, N. Fazeli, and T. Hellebrekers. Tac- talign: Human-to-robot policy transfer via tactile alignment. arXiv preprint arXiv:2602.13579, 2026. 13

  63. [63]

    F. Jia, X. Niu, S. Yang, Q. Ben, T. Huang, J. Wang, J. Pang, et al. Feel robot feels: Tactile feedback array glove for dexterous manipulation. arXiv preprint arXiv:2603.28542, 2026

  64. [64]

    Huang, Y

    B. Huang, Y . Wang, X. Yang, Y . Luo, and Y . Li. 3d-vitac: Learning fine-grained manipulation with visuo-tactile sensing. arXiv preprint arXiv:2410.24091, 2024

  65. [65]

    H. Xue, J. Ren, W. Chen, G. Zhang, Y . Fang, G. Gu, H. Xu, and C. Lu. Reactive diffusion policy: Slow-fast visual-tactile policy learning for contact-rich manipulation. arXiv preprint arXiv:2503.02881, 2025

  66. [66]

    K. Yu, Y . Han, Q. Wang, V . Saxena, D. Xu, and Y . Zhao. Mimictouch: Leveraging multi-modal human tactile demonstrations for contact-rich manipulation. arXiv preprint arXiv:2310.16917, 2023

  67. [67]

    Z. He, H. Fang, J. Chen, H.-S. Fang, and C. Lu. Foar: Force-aware reactive policy for contact- rich robotic manipulation. IEEE Robotics and Automation Letters, 2025

  68. [68]

    Y . Tian, S. Cheng, T. Wei, T. Zhou, Y . Zhang, Z. Liu, Q. Han, Z. Yuan, and H. Xu. Vi- tas: Visual tactile soft fusion contrastive learning for visuomotor learning. arXiv preprint arXiv:2602.11643, 2026

  69. [69]

    H. Choi, Y . Hou, C. Pan, S. Hong, A. Patel, X. Xu, M. R. Cutkosky, and S. Song. In-the-wild compliant manipulation with umi-ft. arXiv preprint arXiv:2601.09988, 2026

  70. [70]

    F. Liu, C. Li, Y . Qin, J. Xu, P. Abbeel, and R. Chen. Vitamin: Learning contact-rich tasks through robot-free visuo-tactile manipulation interface. arXiv preprint arXiv:2504.06156, 2025

  71. [71]

    C. Li, C. Liu, D. Wang, S. Zhang, L. Li, Z. Zeng, F. Liu, J. Xu, and R. Chen. Vitamin- b: A reliable and efficient visuo-tactile bimanual manipulation interface. arXiv preprint arXiv:2511.05858, 2025

  72. [72]

    Y . Xu, L. Wei, P. An, Q. Zhang, and Y .-L. Li. exumi: Extensible robot teaching system with action-aware task-agnostic tactile representation. arXiv preprint arXiv:2509.14688, 2025

  73. [73]

    Higuera, A

    C. Higuera, A. Sharma, T. Fan, C. K. Bodduluri, B. Boots, M. Kaess, M. Lambeta, T. Wu, Z. Liu, F. R. Hogan, et al. Tactile beyond pixels: Multisensory touch representations for robot manipulation. In Conference on Robot Learning, pages 105–123. PMLR, 2025

  74. [74]

    J. Bi, K. Y . Ma, C. Hao, M. Z. Shou, and H. Soh. Vla-touch: Enhancing vision-language-action models with dual-level tactile feedback. arXiv preprint arXiv:2507.17294, 2025

  75. [75]

    Zhang, H

    K. Zhang, H. Zhang, Z. Xu, Z. Zhang, M. R. I. Prince, X. Li, X. Han, Y . Zhou, A. Ajoudani, and Y . She. Tacvla: Contact-aware tactile fusion for robust vision-language-action manipulation. arXiv preprint arXiv:2603.12665, 2026

  76. [76]

    J. Yu, H. Liu, Q. Yu, J. Ren, C. Hao, H. Ding, G. Huang, G. Huang, Y . Song, P. Cai, et al. Forcevla: Enhancing vla models with a force-aware moe for contact-rich manipulation. Advances in Neural Information Processing Systems, 38:93409–93439, 2026

  77. [77]

    W. Wu, F. Lu, Y . Wang, S. Yang, S. Liu, F. Wang, Q. Zhu, H. Sun, Y . Wang, S. Ma, et al. A pragmatic vla foundation model. arXiv preprint arXiv:2601.18692, 2026

  78. [78]

    G. A. Team. Gen-0: Embodied foundation models that scale with physical interaction. Generalist AI Blog, 2025. https://generalistai.com/blog/nov-04-2025-GEN-0

  79. [79]

    Q. Liu, Q. Ye, Z. Sun, Y . Cui, G. Li, and J. Chen. Masked visual-tactile pre-training for robot manipulation. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 13859–13875. IEEE, 2024. 14

  80. [80]

    Cheng, J

    N. Cheng, J. Xu, C. Guan, J. Gao, W. Wang, Y . Li, F. Meng, J. Zhou, B. Fang, and W. Han. Touch100k: A large-scale touch-language-vision dataset for touch-centric multimodal repre- sentation. Information Fusion, 124:103305, 2025

Showing first 80 references.