arxiv: 2604.08664 · v1 · submitted 2026-04-09 · 💻 cs.RO

Recognition: unknown

Generative Simulation for Policy Learning in Physical Human-Robot Interaction

Junxiang Wang , Xinwen Xu , Tiancheng Wu , Julian Millan , Nir Pechuk , Zackory Erickson

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:01 UTC · model grok-4.3

classification 💻 cs.RO

keywords generative simulationphysical human-robot interactionsim-to-real transferimitation learninglarge language modelsassistive roboticszero-shot transferpoint cloud policy

0 comments

The pith

A text-to-simulation pipeline using language models generates training data for robot policies that transfer directly to real assistive tasks with over 80 percent success.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a generative simulation framework that turns high-level natural language prompts into diverse physical human-robot interaction scenarios. Large language and vision-language models create soft-body human models, scene layouts, and robot trajectories, which supply large-scale synthetic data for training vision-based imitation learning policies on point clouds. These policies are tested on real-world scratching and bathing tasks, where they achieve zero-shot transfer with success rates above 80 percent and handle unscripted human movements. The work automates environment creation, data collection, and policy training to address data scarcity in physical human-robot interaction.

Core claim

A zero-shot text2sim2real framework automatically synthesizes pHRI scenarios from natural-language prompts by procedurally generating soft-body human models, scene layouts, and robot motion trajectories via LLMs and VLMs, then trains vision-based imitation learning policies on segmented point clouds that transfer to real assistive tasks with success rates exceeding 80 percent and resilience to variable human motion.

What carries the argument

The generative simulation pipeline that uses LLMs and VLMs to procedurally generate soft-body human models, scene layouts, and robot trajectories from text prompts, enabling autonomous collection of synthetic demonstration data for imitation learning on point clouds.

If this is right

Policies trained only on the generated synthetic data deploy directly in physical environments without real-world fine-tuning or additional data.
Varying natural-language prompts scales the creation of training scenarios for new assistive tasks without manual scene design.
Point-cloud-based imitation learning on the synthetic data produces behaviors robust to unscripted human motion during contact-rich tasks.
The full pipeline from prompt to trained policy removes the need for large-scale real-world data collection in pHRI.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prompt-driven generation approach could be tested on other contact-rich tasks such as dressing or feeding to check whether success rates remain high when force profiles differ.
Replacing the current soft-body models with higher-fidelity physics engines might reduce any remaining sim-to-real gap for tasks that depend on precise contact forces.
Combining the generated point clouds with additional sensor modalities could improve robustness when real environments contain visual clutter not present in the synthetic scenes.

Load-bearing premise

The procedurally generated soft-body human models, scene layouts, and robot motion trajectories produced by LLMs and VLMs from natural-language prompts sufficiently capture the physical dynamics, contact forces, and behavioral variability of real human-robot interactions.

What would settle it

Measure policy success rates in a real-user study where participants introduce body types, motion speeds, or contact patterns outside the range of the procedurally generated models; if rates fall below 80 percent, the transfer claim is falsified.

Figures

Figures reproduced from arXiv: 2604.08664 by Julian Millan, Junxiang Wang, Nir Pechuk, Tiancheng Wu, Xinwen Xu, Zackory Erickson.

**Figure 1.** Figure 1: Overview of our proposed “text2sim2real” pipeline for learning physical human-robot interaction (pHRI) policies. From a high-level textual [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Detailed breakdown of our proposed generative simulation pipeline for physical human-robot interaction. An LLM first generates structured [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Real work setup for both bathing and scratching tasks, visualizing [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

read the original abstract

Developing autonomous physical human-robot interaction (pHRI) systems is limited by the scarcity of large-scale training data to learn robust robot behaviors for real-world applications. In this paper, we introduce a zero-shot "text2sim2real" generative simulation framework that automatically synthesizes diverse pHRI scenarios from high-level natural-language prompts. Leveraging Large Language Models (LLMs) and Vision-Language Models (VLMs), our pipeline procedurally generates soft-body human models, scene layouts, and robot motion trajectories for assistive tasks. We utilize this framework to autonomously collect large-scale synthetic demonstration datasets and then train vision-based imitation learning policies operating on segmented point clouds. We evaluate our approach through a user study on two physically assistive tasks: scratching and bathing. Our learned policies successfully achieve zero-shot sim-to-real transfer, attaining success rates exceeding 80% and demonstrating resilience to unscripted human motion. Overall, we introduce the first generative simulation pipeline for pHRI applications, automating simulation environment synthesis, data collection, and policy learning. Additional information may be found on our project website: https://rchi-lab.github.io/gen_phri/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a working LLM/VLM pipeline for pHRI sim generation and policy training with real-robot results, but the supporting experiments are light on details.

read the letter

The paper introduces a generative simulation framework that turns text prompts into soft-body pHRI scenarios using LLMs and VLMs, then uses the generated data to train vision policies for tasks like scratching and bathing. The key result is zero-shot transfer to real robots with success rates over 80 percent and some handling of unexpected human movements. This is new because it closes the loop from language to simulation to policy in one automated flow for physical interaction, which prior work handled more piecemeal. It does a solid job showing that current foundation models can drive procedural generation for these assistive scenarios, and the project website likely has visuals that make the pipeline clear. The evaluation is where it gets thin. The user study claims high success but gives no numbers on how many runs or people were involved, no baselines, and no breakdown of failure cases. The soft-body models are central to the claim, yet there's no comparison of simulated contact forces or tissue compliance to real sensor data from humans. That leaves open the possibility that the policies succeed because they are tolerant to mismatch rather than because the sim is faithful. Overall this is engineering work aimed at solving the data problem in pHRI. It would interest researchers building sim-to-real systems for close-proximity robot assistance, like in healthcare or eldercare. The idea is practical enough that it merits a full review to see if the details back up the abstract. I would recommend sending it for peer review. The contribution is clear enough to justify referee time, though the authors will need to add more rigorous validation on the simulation accuracy and experimental stats.

Referee Report

2 major / 1 minor

Summary. The paper presents a 'text2sim2real' generative simulation pipeline for physical human-robot interaction (pHRI) that uses LLMs and VLMs to procedurally create soft-body human models, scene layouts, and robot trajectories from natural-language prompts. It collects large-scale synthetic demonstration data in simulation, trains vision-based imitation learning policies on segmented point clouds, and evaluates zero-shot sim-to-real transfer on two assistive tasks (scratching and bathing) via a user study reporting success rates exceeding 80% with resilience to unscripted human motion. The work claims to be the first automated generative simulation framework for pHRI that handles environment synthesis, data collection, and policy learning end-to-end.

Significance. If the zero-shot transfer results hold under rigorous validation, the framework would meaningfully address data scarcity in pHRI by automating diverse scenario generation, enabling scalable training of contact-rich policies without manual simulation engineering. The integration of LLMs/VLMs for procedural soft-body and trajectory synthesis is a novel engineering contribution that could generalize to other assistive robotics domains, provided the generated dynamics sufficiently approximate real contact forces and human variability.

major comments (2)

[Abstract and Evaluation] Abstract and Evaluation section: The central claim of >80% success rates with resilience to unscripted motion in the user study on scratching and bathing lacks any reported trial counts, number of participants, statistical tests, baseline comparisons, or quantification of 'unscripted' motion (e.g., via metrics on human trajectory variance or failure modes). Without these, the performance numbers cannot be assessed for reliability or compared to prior pHRI work.
[Generative Simulation Pipeline] Generative Simulation Pipeline section: The zero-shot sim-to-real claim rests on the assumption that LLM/VLM-generated soft-body humans, layouts, and trajectories produce contact forces, deformations, and behavioral variability close to reality, yet no quantitative validation is provided (e.g., matching of force profiles, friction, or compliance parameters against real human tissue/sensor data). This is load-bearing for the transfer result, as high real-world success could arise from policy robustness rather than simulation fidelity.

minor comments (1)

[Abstract] The abstract mentions a project website but the manuscript should include a brief summary of any additional results or videos hosted there to aid reviewers.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive feedback. We address each major comment point-by-point below, indicating where we will revise the manuscript.

read point-by-point responses

Referee: [Abstract and Evaluation] Abstract and Evaluation section: The central claim of >80% success rates with resilience to unscripted motion in the user study on scratching and bathing lacks any reported trial counts, number of participants, statistical tests, baseline comparisons, or quantification of 'unscripted' motion (e.g., via metrics on human trajectory variance or failure modes). Without these, the performance numbers cannot be assessed for reliability or compared to prior pHRI work.

Authors: We agree that the current presentation of the user study results is insufficiently detailed for rigorous assessment. While the manuscript reports success rates exceeding 80% from the user study on the two tasks, it does not explicitly state trial counts, participant numbers, statistical tests, baseline comparisons, or quantitative metrics for unscripted motion. In the revised manuscript, we will expand the Evaluation section to include these specifics, such as the number of participants and trials performed, any statistical analysis, available baseline comparisons, and metrics on human trajectory variance and failure modes. This will improve transparency and allow direct comparison to prior pHRI work. revision: yes
Referee: [Generative Simulation Pipeline] Generative Simulation Pipeline section: The zero-shot sim-to-real claim rests on the assumption that LLM/VLM-generated soft-body humans, layouts, and trajectories produce contact forces, deformations, and behavioral variability close to reality, yet no quantitative validation is provided (e.g., matching of force profiles, friction, or compliance parameters against real human tissue/sensor data). This is load-bearing for the transfer result, as high real-world success could arise from policy robustness rather than simulation fidelity.

Authors: We acknowledge that the manuscript provides no direct quantitative validation (e.g., force profile matching or compliance parameters) of the generated soft-body dynamics against real human data. The zero-shot transfer results serve as indirect empirical support for the pipeline's utility in policy learning, but we agree this does not fully address the fidelity question. In revision, we will add a new subsection in the Generative Simulation Pipeline section that discusses the procedural generation parameters, any qualitative observations from simulation, and an explicit limitations paragraph on the lack of direct sensor-based validation. We will also suggest future work involving real-world force/torque data collection for more rigorous matching. revision: partial

standing simulated objections not resolved

Direct quantitative validation of generated contact forces, friction, and tissue compliance against real human sensor data, which would require new experimental hardware and data collection outside the scope of the presented generative framework.

Circularity Check

0 steps flagged

No circularity: empirical pipeline validated externally

full rationale

The paper describes an engineering pipeline that uses LLMs/VLMs to procedurally generate soft-body simulations from text prompts, collects synthetic demonstrations, trains point-cloud imitation policies, and evaluates zero-shot transfer via independent real-world user studies on scratching and bathing tasks. No equations, fitted parameters, or self-citations are present that reduce any claimed result to an input by construction. Success metrics (>80% real-world rates, resilience to unscripted motion) are measured against external human participants rather than internal definitions or self-referential fits. The derivation chain is self-contained against real-world benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the generative fidelity of off-the-shelf LLMs and VLMs plus the assumption that imitation learning on segmented point clouds will generalize across the sim-to-real gap; no new physical constants or entities are introduced.

axioms (1)

domain assumption Large language models and vision-language models can generate accurate soft-body human models, scene layouts, and motion trajectories for pHRI tasks from natural-language prompts.
Invoked as the foundation of the text2sim2real pipeline in the abstract.

pith-pipeline@v0.9.0 · 5509 in / 1506 out tokens · 75313 ms · 2026-05-10T17:01:41.646498+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 12 canonical work pages · 2 internal anchors

[1]

Articubot: Learning universal articulated object manipulation policy via large scale simulation,

Y . Wang, Z. Wang, M. Nakura, P. Bhowal, C.-L. Kuo, Y .-T. Chen, Z. Erickson, and D. Held, “Articubot: Learning universal articulated object manipulation policy via large scale simulation,”arXiv preprint arXiv:2503.03045, 2025

work page arXiv 2025
[2]

Local policies enable zero-shot long-horizon ma- nipulation,

M. Dalal, M. Liu, W. Talbott, C. Chen, D. Pathak, J. Zhang, and R. Salakhutdinov, “Local policies enable zero-shot long-horizon ma- nipulation,” in2025 IEEE Intl. Conf. on Robotics and Automation (ICRA). IEEE, 2025, pp. 13 875–13 882

2025
[3]

Fetchbot: Learning generalizable object fetching in cluttered scenes via zero-shot sim2real,

W. Liu, Y . Wan, J. Wang, Y . Kuang, X. Shi, H. Li, D. Zhao, Z. Zhang, and H. Wang, “Fetchbot: Learning generalizable object fetching in cluttered scenes via zero-shot sim2real,” in9th Annual Conf. on Robot Learning, 2025

2025
[4]

Architect: Generating vivid and interactive 3d scenes with hierarchical 2d inpainting,

Y . Wang, X. Qiu, J. Liu, Z. Chen, J. Cai, Y . Wang, T.-H. Wang, Z. Xian, and C. Gan, “Architect: Generating vivid and interactive 3d scenes with hierarchical 2d inpainting,”Advances in Neural Informa- tion Processing Systems, vol. 37, pp. 67 575–67 603, 2024

2024
[5]

Virtual community: An open world for humans, robots, and society,

Q. Zhou, H. Zhang, X. Lin, Z. Zhang, Y . Chen, W. Liu, Z. Zhang, S. Chen, L. Fang, Q. Lyu, X. Sun, J. Yang, Z. Wang, B. C. Dang, Z. Chen, D. Ladia, Q. V . Dang, J. Liu, and C. Gan, “Virtual community: An open world for humans, robots, and society,” inThe Fourteenth Intl. Conf. on Learning Representations, 2026

2026
[6]

Simworld: An open-ended realistic simulator for autonomous agents in physical and social worlds,

J. Ren, Y . Zhuang, X. Ye, L. Mao, X. He, J. Shen, M. Dogra, Y . Liang, R. Zhang, T. Yue,et al., “Simworld: An open-ended realistic simulator for autonomous agents in physical and social worlds,”arXiv preprint arXiv:2512.01078, 2025

work page arXiv 2025
[7]

Splatsim: Zero-shot sim2real transfer of rgb manipulation policies using gaussian splatting,

M. N. Qureshi, S. Garg, F. Yandun, D. Held, G. Kantor, and A. Silwal, “Splatsim: Zero-shot sim2real transfer of rgb manipulation policies using gaussian splatting,” in2025 IEEE Intl. Conf. on Robotics and Automation (ICRA). IEEE, 2025, pp. 6502–6509

2025
[8]

High-fidelity simulated data generation for real-world zero-shot robotic manipulation learning with gaussian splatting,

H. Zhao, C. Zeng, L. Zhuang, Y . Zhao, S. Xue, H. Wang, X. Zhao, Z. Li, K. Li, S. Huang,et al., “High-fidelity simulated data generation for real-world zero-shot robotic manipulation learning with gaussian splatting,”arXiv preprint arXiv:2510.10637, 2025

work page arXiv 2025
[9]

Discoverse: Efficient robot simulation in complex high-fidelity environments,

Y . Jia, G. Wang, Y . Dong, J. Wu, Y . Zeng, H. Lin, Z. Wang, H. Ge, W. Gu, K. Ding,et al., “Discoverse: Efficient robot simulation in complex high-fidelity environments,” in2025 IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS). IEEE, 2025, pp. 6272–6279

2025
[10]

Dexpoint: Gener- alizable point cloud reinforcement learning for sim-to-real dexterous manipulation,

Y . Qin, B. Huang, Z.-H. Yin, H. Su, and X. Wang, “Dexpoint: Gener- alizable point cloud reinforcement learning for sim-to-real dexterous manipulation,” inConf. on Robot Learning. PMLR, 2023, pp. 594– 605

2023
[11]

Crossing the human-robot embodiment gap with sim-to-real rl using one human demonstration,

T. G. W. Lum, O. Y . Lee, K. Liu, and J. Bohg, “Crossing the human-robot embodiment gap with sim-to-real rl using one human demonstration,” in9th Conf. on Robot Learning, ser. Proceedings of Machine Learning Research, J. Lim, S. Song, and H.-W. Park, Eds., vol. 305. PMLR, 27–30 Sep 2025, pp. 4418–4441

2025
[12]

You only teach once: Learn one-shot bimanual robotic manipulation from video demonstrations,

H. Zhou, R. Wang, Y . Tai, Y . Deng, G. Liu, and K. Jia, “You only teach once: Learn one-shot bimanual robotic manipulation from video demonstrations,” inProceedings of Robotics: Science and Systems XXI, 2025, pp. 149–170

2025
[13]

Hermes: Human-to-robot embodied learning from multi-source motion data for mobile dexterous manipulation,

Z. Yuan, T. Wei, L. Gu, P. Hua, T. Liang, Y . Chen, and H. Xu, “Hermes: Human-to-robot embodied learning from multi-source motion data for mobile dexterous manipulation,”arXiv preprint arXiv:2508.20085, 2025

work page arXiv 2025
[14]

Osvi-wm: One-shot visual imitation for unseen tasks using world-model-guided trajectory generation, 2025

R. G. Goswami, P. Krishnamurthy, Y . LeCun, and F. Khorrami, “Osvi- wm: One-shot visual imitation for unseen tasks using world-model- guided trajectory generation,”arXiv preprint arXiv:2505.20425, 2025

work page arXiv 2025
[15]

Text2room: Extracting textured 3d meshes from 2d text-to-image models,

L. H ¨ollein, A. Cao, A. Owens, J. Johnson, and M. Nießner, “Text2room: Extracting textured 3d meshes from 2d text-to-image models,” inIEEE/CVF Intl. Conf. on Computer Vision (ICCV), October 2023, pp. 7909–7920

2023
[16]

Assistive gym: A physics simulation framework for assistive robotics,

Z. Erickson, V . Gangaram, A. Kapusta, C. K. Liu, and C. C. Kemp, “Assistive gym: A physics simulation framework for assistive robotics,”IEEE Intl. Conf. on Robotics and Automation (ICRA), 2020

2020
[17]

Rcare world: A human-centric simulation world for caregiving robots,

R. Ye, W. Xu, H. Fu, R. K. Jenamani, V . Nguyen, C. Lu, K. Dim- itropoulou, and T. Bhattacharjee, “Rcare world: A human-centric simulation world for caregiving robots,” in2022 IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS). IEEE, 2022, pp. 33–40

2022
[18]

Robogen: Towards unleashing infinite data for automated robot learning via generative simulation.arXiv preprint arXiv:2311.01455, 2023

Y . Wang, Z. Xian, F. Chen, T.-H. Wang, Y . Wang, K. Fragkiadaki, Z. Erickson, D. Held, and C. Gan, “Robogen: Towards unleashing infinite data for automated robot learning via generative simulation,” arXiv preprint arXiv:2311.01455, 2023

work page arXiv 2023
[19]

Gensim: Generating robotic simulation tasks via large language models,

L. Wang, Y . Ling, Z. Yuan, M. Shridhar, C. Bao, Y . Qin, B. Wang, H. Xu, and X. Wang, “Gensim: Generating robotic simulation tasks via large language models,”arXiv preprint arXiv:2310.01361, 2023

work page arXiv 2023
[20]

Eureka: Human-Level Reward Design via Coding Large Language Models

Y . J. Ma, W. Liang, G. Wang, D.-A. Huang, O. Bastani, D. Ja- yaraman, Y . Zhu, L. Fan, and A. Anandkumar, “Eureka: Human- level reward design via coding large language models,”arXiv preprint arXiv:2310.12931, 2023

work page internal anchor Pith review arXiv 2023
[21]

Dreureka: Language model guided sim-to-real transfer,

J. Ma, W. Liang, H.-J. Wang, Y . Zhu, L. Fan, O. Bastani, and D. Ja- yaraman, “Dreureka: Language model guided sim-to-real transfer,” in Proceedings of Robotics: Science and Systems XX. RSS, 2024

2024
[22]

Mimicgen: A data generation system for scalable robot learning using human demonstrations

A. Mandlekar, S. Nasiriany, B. Wen, I. Akinola, Y . Narang, L. Fan, Y . Zhu, and D. Fox, “Mimicgen: A data generation system for scalable robot learning using human demonstrations,”arXiv preprint arXiv:2310.17596, 2023

work page arXiv 2023
[23]

Demogen: Syn- thetic demonstration generation for data-efficient visuo- motor policy learning.arXiv preprint arXiv:2502.16932, 2025

Z. Xue, S. Deng, Z. Chen, Y . Wang, Z. Yuan, and H. Xu, “Demogen: Synthetic demonstration generation for data-efficient visuomotor pol- icy learning,”arXiv preprint arXiv:2502.16932, 2025

work page arXiv 2025
[24]

Physdiff: Physics-guided human motion diffusion model,

Y . Yuan, J. Song, U. Iqbal, A. Vahdat, and J. Kautz, “Physdiff: Physics-guided human motion diffusion model,” inProceedings of the IEEE/CVF Intl. Conf. on Computer Vision, 2023, pp. 16 010–16 021

2023
[25]

Rabbit: A robot-assisted bed bathing system with multimodal perception and integrated compliance,

R. Madan, S. Valdez, D. Kim, S. Fang, L. Zhong, D. T. Virtue, and T. Bhattacharjee, “Rabbit: A robot-assisted bed bathing system with multimodal perception and integrated compliance,” in2024 ACM/IEEE Intl. Conf. on human-robot interaction, 2024, pp. 472–481

2024
[26]

Vttb: A visuo-tactile learning approach for robot-assisted bed bathing,

Y . Gu and Y . Demiris, “Vttb: A visuo-tactile learning approach for robot-assisted bed bathing,”IEEE Robotics and Automation Letters, vol. 9, no. 6, pp. 5751–5758, 2024

2024
[27]

Learning bimanual manipulation policies for bathing bed- bound people,

——, “Learning bimanual manipulation policies for bathing bed- bound people,” in2024 IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS). IEEE, 2024, pp. 8936–8943

2024
[28]

Openrobocare: A multimodal multi-task expert demonstration dataset for robot caregiving,

X. Liang, Z. Liu, K. Lin, E. Gu, R. Ye, T. Nguyen, C. Hsu, Z. Wu, X. Yang, C. S. Y . Cheung,et al., “Openrobocare: A multimodal multi-task expert demonstration dataset for robot caregiving,” in2025 IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS). IEEE, 2025, pp. 2661–2668

2025
[29]

i-sim2real: Reinforcement learning of robotic policies in tight human- robot interaction loops,

S. W. Abeyruwan, L. Graesser, D. B. D’Ambrosio, A. Singh, A. Shankar, A. Bewley, D. Jain, K. M. Choromanski, and P. R. Sanketi, “i-sim2real: Reinforcement learning of robotic policies in tight human- robot interaction loops,” inConf. on Robot Learning. PMLR, 2023, pp. 212–224

2023
[30]

Symbridge: A human-in-the-loop cyber- physical interactive system for adaptive human-robot symbiosis,

H. Chen, Y . Xu, Y . Ren, Y . Ye, X. Li, N. Ding, Y . Wu, Y . Liu, P. Cong, Z. Wang,et al., “Symbridge: A human-in-the-loop cyber- physical interactive system for adaptive human-robot symbiosis,” in SIGGRAPH Asia 2025 Conference, 2025, pp. 1–12

2025
[31]

Expressive body capture: 3D hands, face, and body from a single image,

G. Pavlakos, V . Choutas, N. Ghorbani, T. Bolkart, A. A. A. Osman, D. Tzionas, and M. J. Black, “Expressive body capture: 3D hands, face, and body from a single image,” inIEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 10 975–10 985

2019
[32]

Genesis: A generative and universal physics engine for robotics and beyond,

G. Authors, “Genesis: A generative and universal physics engine for robotics and beyond,” December 2024. [Online]. Available: https://github.com/Genesis-Embodied-AI/Genesis

2024
[33]

SMPLitex: A Generative Model and Dataset for 3D Human Texture Estimation from Single Image,

D. Casas and M. Comino-Trinidad, “SMPLitex: A Generative Model and Dataset for 3D Human Texture Estimation from Single Image,” inBritish Machine Vision Conference (BMVC), 2023

2023
[34]

3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations,

Y . Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu, “3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations,” in2nd Workshop on Dexterous Manipulation: Design, Perception and Control (RSS), 2024

2024
[35]

Clip- score: A reference-free evaluation metric for image captioning,

J. Hessel, A. Holtzman, M. Forbes, R. Le Bras, and Y . Choi, “Clip- score: A reference-free evaluation metric for image captioning,” in 2021 conference on empirical methods in natural language processing, 2021, pp. 7514–7528

2021
[36]

Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,

J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,” inInternational conference on machine learning. PMLR, 2023, pp. 19 730–19 742

2023
[37]

Evaluating text-to-visual generation with image-to-text generation, 2024

Z. Lin, D. Pathak, B. Li, J. Li, X. Xia, G. Neubig, P. Zhang, and D. Ramanan, “Evaluating text-to-visual generation with image-to-text generation,”arXiv preprint arXiv:2404.01291, 2024

work page arXiv 2024
[38]

Cerebral representation of the relief of itch by scratching,

V . Vierow, M. Fukuoka, A. Ikoma, A. D ¨orfler, H. O. Handwerker, and C. Forster, “Cerebral representation of the relief of itch by scratching,” Journal of neurophysiology, vol. 102, no. 6, pp. 3216–3224, 2009

2009
[39]

SAM 3: Segment Anything with Concepts

N. Carion, L. Gustafson, Y .-T. Hu, S. Debnath, R. Hu, D. Suris, C. Ryali, K. V . Alwala, H. Khedr, A. Huang,et al., “Sam 3: Segment anything with concepts,”arXiv preprint arXiv:2511.16719, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025