pith. machine review for the scientific record. sign in

arxiv: 2604.08664 · v1 · submitted 2026-04-09 · 💻 cs.RO

Recognition: unknown

Generative Simulation for Policy Learning in Physical Human-Robot Interaction

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:01 UTC · model grok-4.3

classification 💻 cs.RO
keywords generative simulationphysical human-robot interactionsim-to-real transferimitation learninglarge language modelsassistive roboticszero-shot transferpoint cloud policy
0
0 comments X

The pith

A text-to-simulation pipeline using language models generates training data for robot policies that transfer directly to real assistive tasks with over 80 percent success.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a generative simulation framework that turns high-level natural language prompts into diverse physical human-robot interaction scenarios. Large language and vision-language models create soft-body human models, scene layouts, and robot trajectories, which supply large-scale synthetic data for training vision-based imitation learning policies on point clouds. These policies are tested on real-world scratching and bathing tasks, where they achieve zero-shot transfer with success rates above 80 percent and handle unscripted human movements. The work automates environment creation, data collection, and policy training to address data scarcity in physical human-robot interaction.

Core claim

A zero-shot text2sim2real framework automatically synthesizes pHRI scenarios from natural-language prompts by procedurally generating soft-body human models, scene layouts, and robot motion trajectories via LLMs and VLMs, then trains vision-based imitation learning policies on segmented point clouds that transfer to real assistive tasks with success rates exceeding 80 percent and resilience to variable human motion.

What carries the argument

The generative simulation pipeline that uses LLMs and VLMs to procedurally generate soft-body human models, scene layouts, and robot trajectories from text prompts, enabling autonomous collection of synthetic demonstration data for imitation learning on point clouds.

If this is right

  • Policies trained only on the generated synthetic data deploy directly in physical environments without real-world fine-tuning or additional data.
  • Varying natural-language prompts scales the creation of training scenarios for new assistive tasks without manual scene design.
  • Point-cloud-based imitation learning on the synthetic data produces behaviors robust to unscripted human motion during contact-rich tasks.
  • The full pipeline from prompt to trained policy removes the need for large-scale real-world data collection in pHRI.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same prompt-driven generation approach could be tested on other contact-rich tasks such as dressing or feeding to check whether success rates remain high when force profiles differ.
  • Replacing the current soft-body models with higher-fidelity physics engines might reduce any remaining sim-to-real gap for tasks that depend on precise contact forces.
  • Combining the generated point clouds with additional sensor modalities could improve robustness when real environments contain visual clutter not present in the synthetic scenes.

Load-bearing premise

The procedurally generated soft-body human models, scene layouts, and robot motion trajectories produced by LLMs and VLMs from natural-language prompts sufficiently capture the physical dynamics, contact forces, and behavioral variability of real human-robot interactions.

What would settle it

Measure policy success rates in a real-user study where participants introduce body types, motion speeds, or contact patterns outside the range of the procedurally generated models; if rates fall below 80 percent, the transfer claim is falsified.

Figures

Figures reproduced from arXiv: 2604.08664 by Julian Millan, Junxiang Wang, Nir Pechuk, Tiancheng Wu, Xinwen Xu, Zackory Erickson.

Figure 1
Figure 1. Figure 1: Overview of our proposed “text2sim2real” pipeline for learning physical human-robot interaction (pHRI) policies. From a high-level textual [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Detailed breakdown of our proposed generative simulation pipeline for physical human-robot interaction. An LLM first generates structured [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Real work setup for both bathing and scratching tasks, visualizing [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
read the original abstract

Developing autonomous physical human-robot interaction (pHRI) systems is limited by the scarcity of large-scale training data to learn robust robot behaviors for real-world applications. In this paper, we introduce a zero-shot "text2sim2real" generative simulation framework that automatically synthesizes diverse pHRI scenarios from high-level natural-language prompts. Leveraging Large Language Models (LLMs) and Vision-Language Models (VLMs), our pipeline procedurally generates soft-body human models, scene layouts, and robot motion trajectories for assistive tasks. We utilize this framework to autonomously collect large-scale synthetic demonstration datasets and then train vision-based imitation learning policies operating on segmented point clouds. We evaluate our approach through a user study on two physically assistive tasks: scratching and bathing. Our learned policies successfully achieve zero-shot sim-to-real transfer, attaining success rates exceeding 80% and demonstrating resilience to unscripted human motion. Overall, we introduce the first generative simulation pipeline for pHRI applications, automating simulation environment synthesis, data collection, and policy learning. Additional information may be found on our project website: https://rchi-lab.github.io/gen_phri/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents a 'text2sim2real' generative simulation pipeline for physical human-robot interaction (pHRI) that uses LLMs and VLMs to procedurally create soft-body human models, scene layouts, and robot trajectories from natural-language prompts. It collects large-scale synthetic demonstration data in simulation, trains vision-based imitation learning policies on segmented point clouds, and evaluates zero-shot sim-to-real transfer on two assistive tasks (scratching and bathing) via a user study reporting success rates exceeding 80% with resilience to unscripted human motion. The work claims to be the first automated generative simulation framework for pHRI that handles environment synthesis, data collection, and policy learning end-to-end.

Significance. If the zero-shot transfer results hold under rigorous validation, the framework would meaningfully address data scarcity in pHRI by automating diverse scenario generation, enabling scalable training of contact-rich policies without manual simulation engineering. The integration of LLMs/VLMs for procedural soft-body and trajectory synthesis is a novel engineering contribution that could generalize to other assistive robotics domains, provided the generated dynamics sufficiently approximate real contact forces and human variability.

major comments (2)
  1. [Abstract and Evaluation] Abstract and Evaluation section: The central claim of >80% success rates with resilience to unscripted motion in the user study on scratching and bathing lacks any reported trial counts, number of participants, statistical tests, baseline comparisons, or quantification of 'unscripted' motion (e.g., via metrics on human trajectory variance or failure modes). Without these, the performance numbers cannot be assessed for reliability or compared to prior pHRI work.
  2. [Generative Simulation Pipeline] Generative Simulation Pipeline section: The zero-shot sim-to-real claim rests on the assumption that LLM/VLM-generated soft-body humans, layouts, and trajectories produce contact forces, deformations, and behavioral variability close to reality, yet no quantitative validation is provided (e.g., matching of force profiles, friction, or compliance parameters against real human tissue/sensor data). This is load-bearing for the transfer result, as high real-world success could arise from policy robustness rather than simulation fidelity.
minor comments (1)
  1. [Abstract] The abstract mentions a project website but the manuscript should include a brief summary of any additional results or videos hosted there to aid reviewers.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive feedback. We address each major comment point-by-point below, indicating where we will revise the manuscript.

read point-by-point responses
  1. Referee: [Abstract and Evaluation] Abstract and Evaluation section: The central claim of >80% success rates with resilience to unscripted motion in the user study on scratching and bathing lacks any reported trial counts, number of participants, statistical tests, baseline comparisons, or quantification of 'unscripted' motion (e.g., via metrics on human trajectory variance or failure modes). Without these, the performance numbers cannot be assessed for reliability or compared to prior pHRI work.

    Authors: We agree that the current presentation of the user study results is insufficiently detailed for rigorous assessment. While the manuscript reports success rates exceeding 80% from the user study on the two tasks, it does not explicitly state trial counts, participant numbers, statistical tests, baseline comparisons, or quantitative metrics for unscripted motion. In the revised manuscript, we will expand the Evaluation section to include these specifics, such as the number of participants and trials performed, any statistical analysis, available baseline comparisons, and metrics on human trajectory variance and failure modes. This will improve transparency and allow direct comparison to prior pHRI work. revision: yes

  2. Referee: [Generative Simulation Pipeline] Generative Simulation Pipeline section: The zero-shot sim-to-real claim rests on the assumption that LLM/VLM-generated soft-body humans, layouts, and trajectories produce contact forces, deformations, and behavioral variability close to reality, yet no quantitative validation is provided (e.g., matching of force profiles, friction, or compliance parameters against real human tissue/sensor data). This is load-bearing for the transfer result, as high real-world success could arise from policy robustness rather than simulation fidelity.

    Authors: We acknowledge that the manuscript provides no direct quantitative validation (e.g., force profile matching or compliance parameters) of the generated soft-body dynamics against real human data. The zero-shot transfer results serve as indirect empirical support for the pipeline's utility in policy learning, but we agree this does not fully address the fidelity question. In revision, we will add a new subsection in the Generative Simulation Pipeline section that discusses the procedural generation parameters, any qualitative observations from simulation, and an explicit limitations paragraph on the lack of direct sensor-based validation. We will also suggest future work involving real-world force/torque data collection for more rigorous matching. revision: partial

standing simulated objections not resolved
  • Direct quantitative validation of generated contact forces, friction, and tissue compliance against real human sensor data, which would require new experimental hardware and data collection outside the scope of the presented generative framework.

Circularity Check

0 steps flagged

No circularity: empirical pipeline validated externally

full rationale

The paper describes an engineering pipeline that uses LLMs/VLMs to procedurally generate soft-body simulations from text prompts, collects synthetic demonstrations, trains point-cloud imitation policies, and evaluates zero-shot transfer via independent real-world user studies on scratching and bathing tasks. No equations, fitted parameters, or self-citations are present that reduce any claimed result to an input by construction. Success metrics (>80% real-world rates, resilience to unscripted motion) are measured against external human participants rather than internal definitions or self-referential fits. The derivation chain is self-contained against real-world benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the generative fidelity of off-the-shelf LLMs and VLMs plus the assumption that imitation learning on segmented point clouds will generalize across the sim-to-real gap; no new physical constants or entities are introduced.

axioms (1)
  • domain assumption Large language models and vision-language models can generate accurate soft-body human models, scene layouts, and motion trajectories for pHRI tasks from natural-language prompts.
    Invoked as the foundation of the text2sim2real pipeline in the abstract.

pith-pipeline@v0.9.0 · 5509 in / 1506 out tokens · 75313 ms · 2026-05-10T17:01:41.646498+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 12 canonical work pages · 2 internal anchors

  1. [1]

    Articubot: Learning universal articulated object manipulation policy via large scale simulation,

    Y . Wang, Z. Wang, M. Nakura, P. Bhowal, C.-L. Kuo, Y .-T. Chen, Z. Erickson, and D. Held, “Articubot: Learning universal articulated object manipulation policy via large scale simulation,”arXiv preprint arXiv:2503.03045, 2025

  2. [2]

    Local policies enable zero-shot long-horizon ma- nipulation,

    M. Dalal, M. Liu, W. Talbott, C. Chen, D. Pathak, J. Zhang, and R. Salakhutdinov, “Local policies enable zero-shot long-horizon ma- nipulation,” in2025 IEEE Intl. Conf. on Robotics and Automation (ICRA). IEEE, 2025, pp. 13 875–13 882

  3. [3]

    Fetchbot: Learning generalizable object fetching in cluttered scenes via zero-shot sim2real,

    W. Liu, Y . Wan, J. Wang, Y . Kuang, X. Shi, H. Li, D. Zhao, Z. Zhang, and H. Wang, “Fetchbot: Learning generalizable object fetching in cluttered scenes via zero-shot sim2real,” in9th Annual Conf. on Robot Learning, 2025

  4. [4]

    Architect: Generating vivid and interactive 3d scenes with hierarchical 2d inpainting,

    Y . Wang, X. Qiu, J. Liu, Z. Chen, J. Cai, Y . Wang, T.-H. Wang, Z. Xian, and C. Gan, “Architect: Generating vivid and interactive 3d scenes with hierarchical 2d inpainting,”Advances in Neural Informa- tion Processing Systems, vol. 37, pp. 67 575–67 603, 2024

  5. [5]

    Virtual community: An open world for humans, robots, and society,

    Q. Zhou, H. Zhang, X. Lin, Z. Zhang, Y . Chen, W. Liu, Z. Zhang, S. Chen, L. Fang, Q. Lyu, X. Sun, J. Yang, Z. Wang, B. C. Dang, Z. Chen, D. Ladia, Q. V . Dang, J. Liu, and C. Gan, “Virtual community: An open world for humans, robots, and society,” inThe Fourteenth Intl. Conf. on Learning Representations, 2026

  6. [6]

    Simworld: An open-ended realistic simulator for autonomous agents in physical and social worlds,

    J. Ren, Y . Zhuang, X. Ye, L. Mao, X. He, J. Shen, M. Dogra, Y . Liang, R. Zhang, T. Yue,et al., “Simworld: An open-ended realistic simulator for autonomous agents in physical and social worlds,”arXiv preprint arXiv:2512.01078, 2025

  7. [7]

    Splatsim: Zero-shot sim2real transfer of rgb manipulation policies using gaussian splatting,

    M. N. Qureshi, S. Garg, F. Yandun, D. Held, G. Kantor, and A. Silwal, “Splatsim: Zero-shot sim2real transfer of rgb manipulation policies using gaussian splatting,” in2025 IEEE Intl. Conf. on Robotics and Automation (ICRA). IEEE, 2025, pp. 6502–6509

  8. [8]

    High-fidelity simulated data generation for real-world zero-shot robotic manipulation learning with gaussian splatting,

    H. Zhao, C. Zeng, L. Zhuang, Y . Zhao, S. Xue, H. Wang, X. Zhao, Z. Li, K. Li, S. Huang,et al., “High-fidelity simulated data generation for real-world zero-shot robotic manipulation learning with gaussian splatting,”arXiv preprint arXiv:2510.10637, 2025

  9. [9]

    Discoverse: Efficient robot simulation in complex high-fidelity environments,

    Y . Jia, G. Wang, Y . Dong, J. Wu, Y . Zeng, H. Lin, Z. Wang, H. Ge, W. Gu, K. Ding,et al., “Discoverse: Efficient robot simulation in complex high-fidelity environments,” in2025 IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS). IEEE, 2025, pp. 6272–6279

  10. [10]

    Dexpoint: Gener- alizable point cloud reinforcement learning for sim-to-real dexterous manipulation,

    Y . Qin, B. Huang, Z.-H. Yin, H. Su, and X. Wang, “Dexpoint: Gener- alizable point cloud reinforcement learning for sim-to-real dexterous manipulation,” inConf. on Robot Learning. PMLR, 2023, pp. 594– 605

  11. [11]

    Crossing the human-robot embodiment gap with sim-to-real rl using one human demonstration,

    T. G. W. Lum, O. Y . Lee, K. Liu, and J. Bohg, “Crossing the human-robot embodiment gap with sim-to-real rl using one human demonstration,” in9th Conf. on Robot Learning, ser. Proceedings of Machine Learning Research, J. Lim, S. Song, and H.-W. Park, Eds., vol. 305. PMLR, 27–30 Sep 2025, pp. 4418–4441

  12. [12]

    You only teach once: Learn one-shot bimanual robotic manipulation from video demonstrations,

    H. Zhou, R. Wang, Y . Tai, Y . Deng, G. Liu, and K. Jia, “You only teach once: Learn one-shot bimanual robotic manipulation from video demonstrations,” inProceedings of Robotics: Science and Systems XXI, 2025, pp. 149–170

  13. [13]

    Hermes: Human-to-robot embodied learning from multi-source motion data for mobile dexterous manipulation,

    Z. Yuan, T. Wei, L. Gu, P. Hua, T. Liang, Y . Chen, and H. Xu, “Hermes: Human-to-robot embodied learning from multi-source motion data for mobile dexterous manipulation,”arXiv preprint arXiv:2508.20085, 2025

  14. [14]

    Osvi-wm: One-shot visual imitation for unseen tasks using world-model-guided trajectory generation, 2025

    R. G. Goswami, P. Krishnamurthy, Y . LeCun, and F. Khorrami, “Osvi- wm: One-shot visual imitation for unseen tasks using world-model- guided trajectory generation,”arXiv preprint arXiv:2505.20425, 2025

  15. [15]

    Text2room: Extracting textured 3d meshes from 2d text-to-image models,

    L. H ¨ollein, A. Cao, A. Owens, J. Johnson, and M. Nießner, “Text2room: Extracting textured 3d meshes from 2d text-to-image models,” inIEEE/CVF Intl. Conf. on Computer Vision (ICCV), October 2023, pp. 7909–7920

  16. [16]

    Assistive gym: A physics simulation framework for assistive robotics,

    Z. Erickson, V . Gangaram, A. Kapusta, C. K. Liu, and C. C. Kemp, “Assistive gym: A physics simulation framework for assistive robotics,”IEEE Intl. Conf. on Robotics and Automation (ICRA), 2020

  17. [17]

    Rcare world: A human-centric simulation world for caregiving robots,

    R. Ye, W. Xu, H. Fu, R. K. Jenamani, V . Nguyen, C. Lu, K. Dim- itropoulou, and T. Bhattacharjee, “Rcare world: A human-centric simulation world for caregiving robots,” in2022 IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS). IEEE, 2022, pp. 33–40

  18. [18]

    Robogen: Towards unleashing infinite data for automated robot learning via generative simulation.arXiv preprint arXiv:2311.01455, 2023

    Y . Wang, Z. Xian, F. Chen, T.-H. Wang, Y . Wang, K. Fragkiadaki, Z. Erickson, D. Held, and C. Gan, “Robogen: Towards unleashing infinite data for automated robot learning via generative simulation,” arXiv preprint arXiv:2311.01455, 2023

  19. [19]

    Gensim: Generating robotic simulation tasks via large language models,

    L. Wang, Y . Ling, Z. Yuan, M. Shridhar, C. Bao, Y . Qin, B. Wang, H. Xu, and X. Wang, “Gensim: Generating robotic simulation tasks via large language models,”arXiv preprint arXiv:2310.01361, 2023

  20. [20]

    Eureka: Human-Level Reward Design via Coding Large Language Models

    Y . J. Ma, W. Liang, G. Wang, D.-A. Huang, O. Bastani, D. Ja- yaraman, Y . Zhu, L. Fan, and A. Anandkumar, “Eureka: Human- level reward design via coding large language models,”arXiv preprint arXiv:2310.12931, 2023

  21. [21]

    Dreureka: Language model guided sim-to-real transfer,

    J. Ma, W. Liang, H.-J. Wang, Y . Zhu, L. Fan, O. Bastani, and D. Ja- yaraman, “Dreureka: Language model guided sim-to-real transfer,” in Proceedings of Robotics: Science and Systems XX. RSS, 2024

  22. [22]

    Mimicgen: A data generation system for scalable robot learning using human demonstrations

    A. Mandlekar, S. Nasiriany, B. Wen, I. Akinola, Y . Narang, L. Fan, Y . Zhu, and D. Fox, “Mimicgen: A data generation system for scalable robot learning using human demonstrations,”arXiv preprint arXiv:2310.17596, 2023

  23. [23]

    Demogen: Syn- thetic demonstration generation for data-efficient visuo- motor policy learning.arXiv preprint arXiv:2502.16932, 2025

    Z. Xue, S. Deng, Z. Chen, Y . Wang, Z. Yuan, and H. Xu, “Demogen: Synthetic demonstration generation for data-efficient visuomotor pol- icy learning,”arXiv preprint arXiv:2502.16932, 2025

  24. [24]

    Physdiff: Physics-guided human motion diffusion model,

    Y . Yuan, J. Song, U. Iqbal, A. Vahdat, and J. Kautz, “Physdiff: Physics-guided human motion diffusion model,” inProceedings of the IEEE/CVF Intl. Conf. on Computer Vision, 2023, pp. 16 010–16 021

  25. [25]

    Rabbit: A robot-assisted bed bathing system with multimodal perception and integrated compliance,

    R. Madan, S. Valdez, D. Kim, S. Fang, L. Zhong, D. T. Virtue, and T. Bhattacharjee, “Rabbit: A robot-assisted bed bathing system with multimodal perception and integrated compliance,” in2024 ACM/IEEE Intl. Conf. on human-robot interaction, 2024, pp. 472–481

  26. [26]

    Vttb: A visuo-tactile learning approach for robot-assisted bed bathing,

    Y . Gu and Y . Demiris, “Vttb: A visuo-tactile learning approach for robot-assisted bed bathing,”IEEE Robotics and Automation Letters, vol. 9, no. 6, pp. 5751–5758, 2024

  27. [27]

    Learning bimanual manipulation policies for bathing bed- bound people,

    ——, “Learning bimanual manipulation policies for bathing bed- bound people,” in2024 IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS). IEEE, 2024, pp. 8936–8943

  28. [28]

    Openrobocare: A multimodal multi-task expert demonstration dataset for robot caregiving,

    X. Liang, Z. Liu, K. Lin, E. Gu, R. Ye, T. Nguyen, C. Hsu, Z. Wu, X. Yang, C. S. Y . Cheung,et al., “Openrobocare: A multimodal multi-task expert demonstration dataset for robot caregiving,” in2025 IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS). IEEE, 2025, pp. 2661–2668

  29. [29]

    i-sim2real: Reinforcement learning of robotic policies in tight human- robot interaction loops,

    S. W. Abeyruwan, L. Graesser, D. B. D’Ambrosio, A. Singh, A. Shankar, A. Bewley, D. Jain, K. M. Choromanski, and P. R. Sanketi, “i-sim2real: Reinforcement learning of robotic policies in tight human- robot interaction loops,” inConf. on Robot Learning. PMLR, 2023, pp. 212–224

  30. [30]

    Symbridge: A human-in-the-loop cyber- physical interactive system for adaptive human-robot symbiosis,

    H. Chen, Y . Xu, Y . Ren, Y . Ye, X. Li, N. Ding, Y . Wu, Y . Liu, P. Cong, Z. Wang,et al., “Symbridge: A human-in-the-loop cyber- physical interactive system for adaptive human-robot symbiosis,” in SIGGRAPH Asia 2025 Conference, 2025, pp. 1–12

  31. [31]

    Expressive body capture: 3D hands, face, and body from a single image,

    G. Pavlakos, V . Choutas, N. Ghorbani, T. Bolkart, A. A. A. Osman, D. Tzionas, and M. J. Black, “Expressive body capture: 3D hands, face, and body from a single image,” inIEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 10 975–10 985

  32. [32]

    Genesis: A generative and universal physics engine for robotics and beyond,

    G. Authors, “Genesis: A generative and universal physics engine for robotics and beyond,” December 2024. [Online]. Available: https://github.com/Genesis-Embodied-AI/Genesis

  33. [33]

    SMPLitex: A Generative Model and Dataset for 3D Human Texture Estimation from Single Image,

    D. Casas and M. Comino-Trinidad, “SMPLitex: A Generative Model and Dataset for 3D Human Texture Estimation from Single Image,” inBritish Machine Vision Conference (BMVC), 2023

  34. [34]

    3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations,

    Y . Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu, “3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations,” in2nd Workshop on Dexterous Manipulation: Design, Perception and Control (RSS), 2024

  35. [35]

    Clip- score: A reference-free evaluation metric for image captioning,

    J. Hessel, A. Holtzman, M. Forbes, R. Le Bras, and Y . Choi, “Clip- score: A reference-free evaluation metric for image captioning,” in 2021 conference on empirical methods in natural language processing, 2021, pp. 7514–7528

  36. [36]

    Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,

    J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,” inInternational conference on machine learning. PMLR, 2023, pp. 19 730–19 742

  37. [37]

    Evaluating text-to-visual generation with image-to-text generation, 2024

    Z. Lin, D. Pathak, B. Li, J. Li, X. Xia, G. Neubig, P. Zhang, and D. Ramanan, “Evaluating text-to-visual generation with image-to-text generation,”arXiv preprint arXiv:2404.01291, 2024

  38. [38]

    Cerebral representation of the relief of itch by scratching,

    V . Vierow, M. Fukuoka, A. Ikoma, A. D ¨orfler, H. O. Handwerker, and C. Forster, “Cerebral representation of the relief of itch by scratching,” Journal of neurophysiology, vol. 102, no. 6, pp. 3216–3224, 2009

  39. [39]

    SAM 3: Segment Anything with Concepts

    N. Carion, L. Gustafson, Y .-T. Hu, S. Debnath, R. Hu, D. Suris, C. Ryali, K. V . Alwala, H. Khedr, A. Huang,et al., “Sam 3: Segment anything with concepts,”arXiv preprint arXiv:2511.16719, 2025