pith. sign in

arxiv: 2606.04269 · v1 · pith:LRNC5F6Xnew · submitted 2026-06-02 · 💻 cs.RO · cs.AI· cs.CV

Instant-Fold: In-Context Imitation Learning for Deformable Object Manipulation

Pith reviewed 2026-06-28 09:24 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CV
keywords deformable object manipulationin-context imitation learningflow-matchingtemporal contrastive pretrainingsim-to-real transfervisual representationsfoldingrobot manipulation
0
0 comments X

The pith

Given a single human demonstration, a policy can infer and execute multiple manipulation modes for deformable objects without gradient updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Deformable object manipulation involves high-dimensional states that change over long sequences with several valid ways to complete a task. The paper shows that pretraining visual representations in simulation allows a transformer policy to read one demonstration and directly produce actions for different spatial arrangements and action orders. This matters because most imitation methods need many demonstrations or online updates to handle variations, whereas this method adapts in context. The full pipeline runs in simulation yet succeeds on real hardware with no extra collection or tuning.

Core claim

Given a single human demonstration, our policy infers and executes diverse manipulation modes directly from the demonstration, including variations in spatial execution and ordering, without requiring gradient updates. Our approach first learns deformation-aware visual representations via temporal contrastive pretraining, after which a flow-matching transformer policy conditioned on the demonstration predicts actions to execute the intended manipulation mode. Trained entirely in simulation, Instant-Fold generalizes across diverse folding modes and transfers zero-shot to real-world settings without additional data collection or finetuning.

What carries the argument

A flow-matching transformer policy conditioned on the demonstration, supported by deformation-aware visual representations learned via temporal contrastive pretraining in simulation.

Load-bearing premise

The deformation-aware visual representations obtained from temporal contrastive pretraining in simulation are sufficient to support generalization across manipulation modes and zero-shot transfer to real-world settings without additional data or finetuning.

What would settle it

A test in which the policy cannot produce a new spatial layout or ordering from the same demonstration, or requires finetuning to succeed on real hardware, would show the central claim does not hold.

Figures

Figures reproduced from arXiv: 2606.04269 by Cheng Qian, Edward Johns, Yilong Wang.

Figure 1
Figure 1. Figure 1: Given a single human demonstration as a prompt, I [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: provides an overview of our method. INSTANT-FOLD consists of two stages: deformation￾aware temporal contrastive pretraining, which learns correspondence-preserving visual tokens, and in-context policy learning, which conditions action generation on demonstration context. Intuitively, the pretrained tokenizer captures how cloth geometry and appearance evolve under deformation, the context encoder infers the… view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of Temporal Contrastive Pretraining with a batch size of two trajectories. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualizations of our pretrained cloth encoder on real-world demonstrations. A shared [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Scaling with context diversity. Scaling with Context Diversity. We vary the number of downstream training contexts while keeping the pre￾trained encoder and policy recipe fixed. For the seen split, evaluation are restricted to the 4-context folds from modes 1–4 [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Representative examples of randomized initial cloth configurations used for training, il [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualizations of DINOV3 and UniGarmentManip representations on real-world folding [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Detailed architecture. Top: the context encoder maps demonstration tokens into a demon￾stration context by combining cloth tokens and robot-state tokens in the context keyframes through per-frame spatial attention and joint spatio-temporal attention, followed by learned summary tokens. Bottom: the flow-matching action decoder fuses the current scene with the demonstration context, concatenates the resultin… view at source ↗
Figure 9
Figure 9. Figure 9: Per-context same-init C-SR@95 for the best policy. Bars are sorted by perfor￾mance, with seen and held-out contexts shown in paired colors. The hardest held-out cases are the simultaneous/side-fold variants, especially LL and SL, followed by SR. E Real-world Experiments Modelling Robot Occlusion. For real-world experiments, we observed severe overhead-camera occlusions, particularly during the second stage… view at source ↗
Figure 10
Figure 10. Figure 10: Gripper occlusion modelling. Green contour: original cloth mask; white dots: gripper centers; yellow dots: newly occluded visible particles. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Real-world garments used in our experiments. [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Visualization of all mode keyframes except final retraction frame. [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
Figure 12
Figure 12. Figure 12: Visualization of all mode keyframes except final retraction frame (continued). [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗
Figure 12
Figure 12. Figure 12: Visualization of all mode keyframes except final retraction frame (continued). [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗
Figure 12
Figure 12. Figure 12: Visualization of all mode keyframes except final retraction frame (continued). [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗
Figure 12
Figure 12. Figure 12: Visualization of all mode keyframes except final retraction frame (continued). [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗
Figure 12
Figure 12. Figure 12: Visualization of all mode keyframes except final retraction frame (continued). [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗
Figure 12
Figure 12. Figure 12: Visualization of all mode keyframes except final retraction frame (continued). [PITH_FULL_IMAGE:figures/full_fig_p028_12.png] view at source ↗
Figure 12
Figure 12. Figure 12: Visualization of all mode keyframes except final retraction frame (continued). [PITH_FULL_IMAGE:figures/full_fig_p029_12.png] view at source ↗
Figure 12
Figure 12. Figure 12: Visualization of all mode keyframes except final retraction frame (continued). [PITH_FULL_IMAGE:figures/full_fig_p030_12.png] view at source ↗
Figure 12
Figure 12. Figure 12: Visualization of all mode keyframes except final retraction frame (continued). [PITH_FULL_IMAGE:figures/full_fig_p031_12.png] view at source ↗
read the original abstract

Deformable object manipulation (DOM) is challenging due to high-dimensional, partially observable states that evolve through long-horizon, topology-changing interactions with multiple valid manipulation modes. We introduce Instant-Fold, an in-context imitation learning framework for DOM. Given a single human demonstration, our policy infers and executes diverse manipulation modes directly from the demonstration, including variations in spatial execution and ordering, without requiring gradient updates. Our approach first learns deformation-aware visual representations via temporal contrastive pretraining, after which a flow-matching transformer policy conditioned on the demonstration predicts actions to execute the intended manipulation mode. Trained entirely in simulation, Instant-Fold generalizes across diverse folding modes and transfers zero-shot to real-world settings without additional data collection or finetuning. Videos are available at https://instant-fold.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Instant-Fold, an in-context imitation learning framework for deformable object manipulation (DOM). It first pretrains deformation-aware visual representations via temporal contrastive learning in simulation, then deploys a flow-matching transformer policy conditioned on a single human demonstration to infer and execute diverse manipulation modes (including spatial and ordering variations) without gradient updates. The approach is trained entirely in simulation and claims zero-shot transfer to real-world settings without additional data or finetuning.

Significance. If the central claims hold, the work would advance sample-efficient, mode-generalizing policies for high-dimensional, topology-changing DOM tasks by demonstrating in-context inference from one demonstration and sim-to-real transfer via contrastive representations. This could reduce the data requirements typical in deformable manipulation and support broader applicability of flow-matching policies in robotics.

major comments (2)
  1. [Abstract] Abstract: The central claim that temporal contrastive pretraining in simulation alone yields deformation-aware features sufficient for mode inference, diverse execution, and zero-shot real-world transfer is load-bearing, yet the provided text supplies no quantitative results, ablation studies on representation robustness, or tests addressing sim-real gaps in friction, elasticity, or sensor noise. This leaves the sufficiency of the representations unverified.
  2. [Abstract] Abstract: The assertion of generalization across folding modes and zero-shot transfer without domain randomization, real data mixing, or explicit mode-inference ablations is not supported by any reported metrics or error analysis in the given text, undermining evaluation of the weakest assumption regarding representation invariance.
minor comments (1)
  1. [Abstract] The abstract references a project website for videos but does not indicate whether quantitative results or ablations appear in the main text or supplementary material.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address each major comment below and will revise the abstract accordingly to better highlight the quantitative support present in the full manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that temporal contrastive pretraining in simulation alone yields deformation-aware features sufficient for mode inference, diverse execution, and zero-shot real-world transfer is load-bearing, yet the provided text supplies no quantitative results, ablation studies on representation robustness, or tests addressing sim-real gaps in friction, elasticity, or sensor noise. This leaves the sufficiency of the representations unverified.

    Authors: The full manuscript contains quantitative results on success rates for mode inference and execution, ablation studies evaluating the temporal contrastive pretraining's impact on representation robustness, and sim-to-real experiments that explicitly test and report performance under variations in friction, elasticity, and sensor noise. We will revise the abstract to include key metrics and reference these supporting experiments and ablations. revision: yes

  2. Referee: [Abstract] Abstract: The assertion of generalization across folding modes and zero-shot transfer without domain randomization, real data mixing, or explicit mode-inference ablations is not supported by any reported metrics or error analysis in the given text, undermining evaluation of the weakest assumption regarding representation invariance.

    Authors: The manuscript reports specific metrics on generalization across folding modes (including spatial and ordering variations), zero-shot transfer without domain randomization or real data mixing, and includes explicit ablations on mode inference along with error analysis in the experiments section. We will update the abstract to incorporate these metrics, ablations, and error analysis to substantiate the claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents a two-stage learning pipeline (temporal contrastive pretraining for deformation-aware features, followed by conditioning a flow-matching transformer on a single demonstration) whose performance claims rest on empirical generalization from simulation to real rather than any definitional equivalence, fitted-parameter renaming, or self-citation chain. No equations, uniqueness theorems, or ansatzes are shown that reduce outputs to inputs by construction; the in-context mode inference and zero-shot transfer are presented as learned behaviors, not tautological restatements of the training procedure. This is the normal non-circular outcome for an empirical robotics method whose central results are externally falsifiable via held-out demonstrations and real-robot trials.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that simulation dynamics are close enough to reality for zero-shot transfer after contrastive pretraining; no new physical entities are introduced.

free parameters (2)
  • contrastive loss weights and augmentation parameters
    Temporal contrastive pretraining requires choices of positive/negative pair construction and loss scaling that are tuned on simulation data.
  • flow-matching network hyperparameters
    Architecture depth, conditioning dimension, and sampling schedule are selected to fit the demonstration-conditioned action distribution.
axioms (1)
  • domain assumption Simulation accurately models real-world deformable object dynamics for zero-shot transfer
    The paper states that the policy is trained entirely in simulation yet transfers without finetuning.

pith-pipeline@v0.9.1-grok · 5669 in / 1290 out tokens · 23217 ms · 2026-06-28T09:24:08.694159+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 8 canonical work pages · 3 internal anchors

  1. [1]

    Brown, B

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

  2. [2]

    DINOv3

    O. Sim ´eoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025

  3. [3]

    Zhang, B

    K. Zhang, B. Li, K. Hauser, and Y . Li. Adaptigraph: Material-adaptive graph-based neural dynamics for robotic manipulation. InProceedings of Robotics: Science and Systems (RSS), 2024

  4. [4]

    S. Chen, Y . Xu, C. Yu, L. Li, X. Ma, Z. Xu, and D. Hsu. Daxbench: Benchmarking deformable object manipulation with differentiable physics. InThe Eleventh International Conference on Learning Representations

  5. [5]

    X. Lin, Y . Wang, J. Olkin, and D. Held. Softgym: Benchmarking deep reinforcement learning for deformable object manipulation. InConference on Robot Learning, pages 432–448. PMLR, 2021

  6. [6]

    Bender, M

    J. Bender, M. M ¨uller, and M. Macklin. Position-based simulation methods in computer graph- ics. InEurographics (tutorials), pages 1–32, 2015

  7. [7]

    Pfaff, M

    T. Pfaff, M. Fortunato, A. Sanchez-Gonzalez, and P. Battaglia. Learning mesh-based simula- tion with graph networks. InInternational Conference on Learning Representations

  8. [8]

    Macklin, M

    M. Macklin, M. M ¨uller, and N. Chentanez. Xpbd: position-based simulation of compliant con- strained dynamics. InProceedings of the 9th International Conference on Motion in Games, MIG ’16, page 49–54, New York, NY , USA, 2016. Association for Computing Machinery. ISBN 9781450345927. doi:10.1145/2994258.2994272. URLhttps://doi.org/10.1145/ 2994258.2994272

  9. [9]

    Canberk, C

    A. Canberk, C. Chi, H. Ha, B. Burchfiel, E. Cousineau, S. Feng, and S. Song. Cloth funnels: Canonicalized-alignment for multi-purpose garment manipulation. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 5872–5879. IEEE, 2023

  10. [10]

    X. Ma, D. Hsu, and W. S. Lee. Learning latent graph dynamics for visual manipulation of deformable objects. In2022 International Conference on Robotics and Automation (ICRA), pages 8266–8273. IEEE, 2022

  11. [11]

    Deng and D

    Y . Deng and D. Hsu. General-purpose clothes manipulation with semantic keypoints. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 13181–13187. IEEE, 2025

  12. [12]

    R. Wu, H. Lu, Y . Wang, Y . Wang, and H. Dong. Unigarmentmanip: A unified framework for category-level garment manipulation via dense visual correspondence. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16340– 16350, June 2024

  13. [13]

    Lips, V .-L

    T. Lips, V .-L. De Gusseme, and F. Wyffels. Learning keypoints for robotic cloth manipulation using synthetic data.IEEE Robotics and Automation Letters, 9(7):6528–6535, 2024

  14. [14]

    Lippi, P

    M. Lippi, P. Poklukar, M. C. Welle, A. Varava, H. Yin, A. Marino, and D. Kragic. Latent space roadmap for visual action planning of deformable and rigid object manipulation. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5619–5626. IEEE, 2020. 32

  15. [15]

    W. Yan, A. Vangipuram, P. Abbeel, and L. Pinto. Learning predictive representations for deformable objects using contrastive estimation. InConference on Robot Learning, pages 564–574. PMLR, 2021

  16. [16]

    Zhang, B

    K. Zhang, B. Li, K. Hauser, and Y . Li. Particle-grid neural dynamics for learning deformable object models from rgb-d videos. InProceedings of Robotics: Science and Systems (RSS), 2025

  17. [17]

    M. Song, J. Ha, B. Park, and D. Park. Implicit neural-representation learning for elastic deformable-object manipulations. InRobotics: Science and Systems (RSS). Robotics: Science and Systems Foundation, 2025

  18. [18]

    T. Tian, H. Li, B. Ai, X. Yuan, Z. Huang, and H. Su. Diffusion dynamics models with gener- ative state estimation for cloth manipulation. InConference on Robot Learning, pages 1703–

  19. [19]

    R. Shi, Z. Xue, Y . You, and C. Lu. Skeleton merger: an unsupervised aligned keypoint detector. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 43–52, 2021

  20. [20]

    B. Zhou, H. Zhou, T. Liang, Q. Yu, S. Zhao, Y . Zeng, J. Lv, S. Luo, Q. Wang, X. Yu, et al. Clothesnet: An information-rich 3d garment model repository with simulated clothes environ- ment. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20428–20438, 2023

  21. [21]

    Ha and S

    H. Ha and S. Song. Flingbot: The unreasonable effectiveness of dynamic manipulation for cloth unfolding. InConference on Robot Learning, pages 24–33. PMLR, 2022

  22. [22]

    H. Xue, Y . Li, W. Xu, H. Li, D. Zheng, and C. Lu. Unifolding: Towards sample-efficient, scalable, and generalizable robotic garment folding. InConference on Robot Learning, pages 3321–3341. PMLR, 2023

  23. [24]

    Sunil, M

    N. Sunil, M. Tippur, A. S. Portillo, E. H. Adelson, and A. R. Garcia. Reactive in-air clothing manipulation with confidence-aware dense correspondence and visuotactile affordance. In Conference on Robot Learning, pages 93–104. PMLR, 2025

  24. [25]

    T. Weng, S. M. Bajracharya, Y . Wang, K. Agrawal, and D. Held. Fabricflownet: Bimanual cloth manipulation with a flow-based policy. InConference on Robot Learning, pages 192–

  25. [26]

    Hoque, D

    R. Hoque, D. Seita, A. Balakrishna, A. Ganapathi, A. Tanwani, N. Jamali, K. Yamane, S. Iba, and K. Goldberg. VisuoSpatial Foresight for Multi-Step, Multi-Task Fabric Manipulation. In Robotics: Science and Systems (RSS), 2020

  26. [27]

    Longhini, M

    A. Longhini, M. C. Welle, Z. Erickson, and D. Kragic. Adafold: Adapting folding trajectories of cloths via feedback-loop manipulation.IEEE Robotics and Automation Letters, 2024

  27. [28]

    V osylius and E

    V . V osylius and E. Johns. Few-shot in-context imitation learning via implicit graph alignment. InConference on Robot Learning, 2023

  28. [29]

    Di Palo and E

    N. Di Palo and E. Johns. Dinobot: Robot manipulation via retrieval and alignment with vi- sion foundation models. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 2798–2805. IEEE, 2024

  29. [30]

    Zhang and A

    X. Zhang and A. Boularias. One-shot imitation learning with invariance matching for robotic manipulation. InProceedings of Robotics: Science and Systems, Delft, Netherlands, 2024. 33

  30. [31]

    Wang and E

    Y . Wang and E. Johns. One-shot dual-arm imitation learning.2025 IEEE International Con- ference on Robotics and Automation (ICRA), pages 5660–5668, 2025

  31. [32]

    Y . Duan, M. Andrychowicz, B. Stadie, O. Jonathan Ho, J. Schneider, I. Sutskever, P. Abbeel, and W. Zaremba. One-shot imitation learning.Advances in neural information processing systems, 30, 2017

  32. [33]

    Di Palo and E

    N. Di Palo and E. Johns. Keypoint action tokens enable in-context imitation learning in robotics. InProceedings of Robotics: Science and Systems (RSS), 2024

  33. [34]

    L. Fu, H. Huang, G. Datta, L. Y . Chen, W. C.-H. Panitch, F. Liu, H. Li, and K. Goldberg. In- context imitation learning via next-token prediction.arXiv preprint arXiv:2408.15980, 2024

  34. [35]

    Zhang, S

    X. Zhang, S. Liu, P. Huang, W. J. Han, Y . Lyu, M. Xu, and D. Zhao. Dynamics as prompts: In- context learning for sim-to-real system identifications.IEEE Robotics and Automation Letters, 2025

  35. [36]

    V osylius and E

    V . V osylius and E. Johns. Instant policy: In-context imitation learning via graph diffusion. In Proceedings of the International Conference on Learning Representations (ICLR), 2025

  36. [37]

    M ¨uller, B

    M. M ¨uller, B. Heidelberger, M. Hennix, and J. Ratcliff. Position based dynamics. In Journal of Visual Communication and Image Representation, 2007. URLhttps://api. semanticscholar.org/CorpusID:6159986

  37. [38]

    Khosla, P

    P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y . Tian, P. Isola, A. Maschinot, C. Liu, and D. Kr- ishnan. Supervised contrastive learning.Advances in neural information processing systems, 33:18661–18673, 2020

  38. [39]

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Rep- resentations, 2022

  39. [40]

    In Defense of the Triplet Loss for Person Re-Identification

    A. Hermans, L. Beyer, and B. Leibe. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737, 2017

  40. [41]

    James and A

    S. James and A. J. Davison. Q-attention: Enabling efficient learning for vision-based robotic manipulation.IEEE Robotics and Automation Letters, 7(2):1612–1619, 2022

  41. [42]

    arXiv preprint arXiv:2508.11002 (2025)

    N. Gkanatsios, J. Xu, M. Bronars, A. Mousavian, T.-W. Ke, and K. Fragkiadaki. 3d flowmatch actor: Unified 3d policy for single-and dual-arm manipulation.arXiv preprint arXiv:2508.11002, 2025

  42. [43]

    Y . Wang, C. Qian, R. Fan, and E. Johns. Observer actor: Active vision imitation learning with sparse view gaussian splatting.arXiv preprint arXiv:2511.18140, 2025

  43. [44]

    J. Su, M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

  44. [45]

    Press, N

    O. Press, N. Smith, and M. Lewis. Train short, test long: Attention with linear biases enables input length extrapolation. InInternational Conference on Learning Representations, 2021

  45. [46]

    Scalable Diffusion Models with Transformers

    W. Peebles and S. Xie. Scalable diffusion models with transformers.arXiv preprint arXiv:2212.09748, 2022

  46. [47]

    X. Liu, C. Gong, et al. Flow straight and fast: Learning to generate and transfer data with rectified flow. InThe Eleventh International Conference on Learning Representations, 2022

  47. [48]

    Macklin, M

    M. Macklin, M. M ¨uller, N. Chentanez, and T.-Y . Kim. Unified particle physics for real-time applications.ACM Trans. Graph., 33(4), July 2014. ISSN 0730-0301. doi:10.1145/2601097. 2601152. URLhttps://doi.org/10.1145/2601097.2601152. 34

  48. [49]

    Bertiche, M

    H. Bertiche, M. Madadi, and S. Escalera. Cloth3d: clothed 3d humans. InEuropean Confer- ence on Computer Vision, pages 344–359. Springer, 2020

  49. [50]

    J. D. Robinson, C.-Y . Chuang, S. Sra, and S. Jegelka. Contrastive learning with hard negative samples. InInternational Conference on Learning Representations, 2021

  50. [51]

    N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R ¨adle, C. Rolland, L. Gustafson, et al. Sam 2: Segment anything in images and videos. InInternational Confer- ence on Learning Representations, volume 2025, pages 28085–28128, 2025

  51. [52]

    C. Qian, J. Urain, K. Zakka, and J. Peters. Pianomime: Learning a generalist, dexterous piano player from internet demonstrations. InConference on Robot Learning, pages 1194–1215. PMLR, 2025. 35