Dex2HOI: Dexterous Bimanual Two-Object Interaction Generation

Chrysa Pratikaki; Jiankang Deng; Pablo Ruiz-Ponce; Rolandos Alexandros Potamias; Stefanos Zafeiriou

arxiv: 2605.30444 · v1 · pith:JG24RX4Hnew · submitted 2026-05-28 · 💻 cs.CV

Dex2HOI: Dexterous Bimanual Two-Object Interaction Generation

Chrysa Pratikaki , Pablo Ruiz-Ponce , Jiankang Deng , Stefanos Zafeiriou , Rolandos Alexandros Potamias This is my paper

Pith reviewed 2026-06-29 08:03 UTC · model grok-4.3

classification 💻 cs.CV

keywords human-object interactionbimanual manipulationdexterous motiondiffusion modeltext-to-motiontwo-object interactionmotion synthesis

0 comments

The pith

A dual-stream diffusion model generates dexterous bimanual two-object interactions from text at real-time speeds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to move human-object interaction generation past single-object cases to the everyday scenario of two objects handled simultaneously with both hands. It builds a text-conditioned diffusion model that assigns each object its own processing stream and links the streams through bidirectional cross-attention. A fusion network then incorporates hand-relative positions and contact signals across the full sequence. Autoregressive sampling over successive prefix windows produces long motions without any separate optimization stage, delivering large speed gains over earlier approaches.

Core claim

Dex2HOI is a unified diffusion model for single- and two-object HOI synthesis from text. It processes each object in a dedicated stream coordinated by bidirectional cross-attention, fuses the streams with a Motion Fusion Network that uses hand-relative object representations and contact-aware conditioning, and samples the diffusion process autoregressively over prefix-conditioned windows to produce arbitrarily long sequences at real-time speed while omitting test-time optimization.

What carries the argument

Dual-Stream Diffusion architecture with bidirectional cross-attention, Motion Fusion Network, hand-relative representations, contact-aware conditioning, and autoregressive prefix-conditioned window sampling.

If this is right

Achieves state-of-the-art quantitative results on single- and two-object HOI benchmarks.
Generates arbitrarily long sequences at real-time speed.
Delivers up to 540 times inference speedup over prior state-of-the-art methods.
Supports both single-object and two-object cases inside one model.
Removes the need for redundant test-time optimization steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Adding further streams could extend the approach to three or more objects.
Real-time performance may suit interactive uses such as virtual environments or robot planning.
Hand-relative representations could improve generalization to novel object geometries.

Load-bearing premise

The dual-stream diffusion with bidirectional cross-attention and contact-aware conditioning produces coherent coordinated bimanual two-object motions without post-hoc optimization.

What would settle it

Test generations on two-object prompts that exhibit frequent hand-object interpenetrations or visibly uncoordinated hand movements.

Figures

Figures reproduced from arXiv: 2605.30444 by Chrysa Pratikaki, Jiankang Deng, Pablo Ruiz-Ponce, Rolandos Alexandros Potamias, Stefanos Zafeiriou.

**Figure 1.** Figure 1: Dex2HOI is a diffusion-based model for Human-Object Interaction that generates dexterous bimanual manipulations from text, supporting simultaneous interaction with up to two objects. Abstract Recent advances in 4D Human-Object Interaction (HOI) generation have enabled increasingly realistic motion synthesis, particularly for single-object manipulation. Yet current research overlooks an inherent property of… view at source ↗

**Figure 2.** Figure 2: Dex2HOI addresses previous HOI limitations: single-object, bodypart omission, multi-step optimization. Recent progress in HOI generation has led to increasingly realistic motions on primarily single-object benchmarks, driven largely by diffusion-based models [30, 31, 34, 58]. Despite this, most existing HOI methods are presented with three limitations. First, nearly all state-of-the-art full-body methods … view at source ↗

**Figure 3.** Figure 3: Dex2HOI generates dexterous bimanual two-object HOI sequences from text in single-shot, combining (a) Hand-Relative Object Motion Representations, (b) Dual-Stream HOI Diffusion, (c) our Motion Fusion Network, and (d) Geometrically-Aware Losses. 3.1 Human and Object Motion Representations We represent each motion sequence as a temporally aligned multimodal feature vector containing human rotations, human ro… view at source ↗

**Figure 4.** Figure 4: Dex2HOI qualitative results on the GRAB [39] dataset (single-object HOI synthesis). Method FID ↓ Diversity → MM → Pen. (mm) ↓ R@3 ↑ MMd ↓ Runtime (s) ↓ 1 object GT — 0.706 0.194 — 0.617 — — MDM [43] 0.514 0.674 0.186 5.85 0.570 1.137 0.56±0.03 IMoS [14] (w/o obj optim) 0.521 0.608 0.207 6.23 0.557 1.165 0.75±0.05 IMoS [14] 0.494 0.632 0.189 7.02 0.557 1.165 59.22±2.15 HOIDiNi [34] (w/o DNO) 0.503 0.715 0.2… view at source ↗

**Figure 5.** Figure 5: Dex2HOI qualitative results: Contact comparison against HIMO [27] and MDM [43]. Method FID ↓ Diversity → MModality ↑ Pen. (mm) ↓ R-Pre@3 ↑ MMdist ↓ GT — 1.061 — — 0.520 1.210 MDM [43] 0.896 0.797 0.009 12.8 0.118 1.420 HIMO [27] 0.886 0.606 0.019 11.7 0.295 1.281 2 objects Ours 0.655 0.780 0.056 9.8 0.275 1.260 GT — 1.192 — — 0.665 1.250 MDM [43] 1.394 0.464 0.053 18.2 0.086 1.445 HIMO [27] 1.418 0.413 0.0… view at source ↗

**Figure 6.** Figure 6: Hand-Representation Ablation Ablation Study. We design our ablation study around the core components of Dex2HOI and we provide the following ablation variants for two-obj evaluation: (i) w/o hand-relative representation, where we replace the proposed representation with global object trajectory prediction. (ii) w/o dual-stream architecture, where we use a single denoising stream that jointly predicts hu… view at source ↗

**Figure 7.** Figure 7: Empirical Evaluation: Left: VLM-eval for 1-object GRAB dataset. Middle: VLM-eval for 2-object HUMOTO dataset. Right: Collective User Preference. Dex2HOI achieves superior scores in realism and contact quality and is preferred by majority of participants over competing approaches. 5 Discussion and Conclusion We presented Dex2HOI, a unified framework for human-object interaction generation that extends beyon… view at source ↗

read the original abstract

Recent advances in 4D Human-Object Interaction (HOI) generation have enabled increasingly realistic motion synthesis, particularly for single-object manipulation. Yet current research overlooks an inherent property of human behavior: people naturally coordinate both hands and manipulate multiple objects simultaneously. To address this gap, we present Dex2HOI, a unified diffusion model for single- and two-object HOI synthesis from text. At its core, Dex2HOI employs a Dual-Stream Diffusion approach, where each object is processed in a dedicated interaction stream and coordinated through bidirectional cross-attention. To synthesize the final motion, we introduce a Motion Fusion Network integrated with novel hand-relative object representations and contact-aware conditioning applied across the whole sequence. By sampling the diffusion process autoregressively over prefix-conditioned windows, Dex2HOI generates arbitrarily long sequences at real-time speed omitting redundant test-time optimization, achieving up to x540 inference speed-up over prior state-of-the-art methods. Extensive evaluation on both single- and two-object benchmarks demonstrates state-of-the-art quantitative results, marking a step beyond conventional single-object HOI generation and toward expressive multi-object manipulation. Code and models will be released upon acceptance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Dex2HOI offers a dual-stream diffusion model for text-driven bimanual two-object HOI generation with autoregressive long-sequence sampling.

read the letter

The key takeaway is that Dex2HOI introduces a dual-stream diffusion architecture specifically designed for generating dexterous bimanual interactions with two objects from text, coordinated via bidirectional cross-attention and a motion fusion network, plus autoregressive sampling for extended sequences.

This is new in moving past single-object HOI to multi-object bimanual cases, which better reflects natural human actions. The approach integrates hand-relative object representations and contact-aware conditioning throughout the sequence, which helps with physical consistency. The autoregressive prefix-conditioned window sampling stands out for enabling arbitrarily long motions at real-time inference speeds, reportedly up to 540 times faster than previous methods that rely on optimization.

The paper handles both single- and two-object scenarios in one model, which is practical. It evaluates on appropriate benchmarks and positions the work as advancing toward more expressive multi-object manipulation.

Where it could be stronger is in the verification of the central assumptions. The claim that the dual-stream with cross-attention produces coherent coordinated motions without additional optimization needs solid quantitative support from ablations on the fusion network and contact conditioning. If those hold, the method is sound; otherwise, the coordination might break in complex scenarios. The abstract does not show internal contradictions, and the speedup follows logically from skipping test-time steps.

This work is for the motion synthesis community, particularly those extending diffusion models to more complex human behaviors. A reader looking for practical ways to generate longer, multi-entity interactions would get concrete ideas from the architecture and sampling strategy.

The paper demonstrates honest engagement with the literature by identifying the gap in two-object cases and building a targeted solution. It deserves a serious referee because the problem is important and the proposed components are clearly motivated, even if revisions might be needed to strengthen the empirical validation.

Recommendation: Send it to peer review rather than desk reject.

Referee Report

0 major / 2 minor

Summary. The paper introduces Dex2HOI, a unified text-conditioned diffusion model for generating dexterous bimanual single- and two-object human-object interactions. Its core components are a dual-stream diffusion architecture with bidirectional cross-attention between object streams, hand-relative object representations, contact-aware conditioning, a Motion Fusion Network, and autoregressive sampling over prefix-conditioned windows that enables arbitrarily long sequences in real time without test-time optimization, yielding up to 540x inference speedup over prior methods while reporting SOTA quantitative results on single- and two-object benchmarks.

Significance. If the architectural claims hold, the work meaningfully extends 4D HOI generation beyond single-object cases to coordinated bimanual multi-object manipulation and removes a major practical bottleneck (test-time optimization) for long-horizon synthesis. The combination of dual-stream coordination and prefix-window autoregression is a concrete step toward scalable, real-time motion generation with potential downstream value in robotics and animation.

minor comments (2)

[Abstract] Abstract states that the model achieves 'state-of-the-art quantitative results' on both single- and two-object benchmarks, but does not name the specific metrics, datasets, or competing methods used for the comparison.
[Abstract] The description of the Motion Fusion Network and how contact-aware conditioning is applied 'across the whole sequence' would benefit from an explicit diagram or pseudocode showing the data flow between the dual streams and the fusion stage.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of Dex2HOI, the recognition of its contributions to bimanual multi-object HOI generation, and the recommendation for minor revision. No major comments were listed in the report.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces a new dual-stream diffusion architecture for bimanual two-object HOI generation, relying on bidirectional cross-attention, hand-relative representations, contact-aware conditioning, and autoregressive prefix-window sampling. No equations, fitted parameters, or predictions are described that reduce by construction to the inputs. No self-citations are invoked as load-bearing justifications for uniqueness theorems, ansatzes, or derivations. The speedup claim follows directly from the architectural choice to omit test-time optimization, and the model is presented as self-contained without any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities can be extracted from the full text.

pith-pipeline@v0.9.1-grok · 5756 in / 971 out tokens · 31202 ms · 2026-06-29T08:03:45.797531+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 10 canonical work pages · 3 internal anchors

[1]

Adobe mixamo.https://www.mixamo.com/

Adobe Inc. Adobe mixamo.https://www.mixamo.com/
[2]

Physically plausible full-body hand-object interaction synthesis

Jona Braun, Sammy Christen, Muhammed Kocabas, Emre Aksan, and Otmar Hilliges. Physically plausible full-body hand-object interaction synthesis. InInternational Conference on 3D Vision (3DV), 2024

2024
[3]

Text2hoi: Text-guided 3d motion generation for hand-object interaction

Junuk Cha, Jihyeon Kim, Jae Shin Yoon, and Seungryul Baek. Text2hoi: Text-guided 3d motion generation for hand-object interaction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024
[4]

Narang, Karl Van Wyk, Umar Iqbal, Stan Birchfield, Jan Kautz, and Dieter Fox

Yu-Wei Chao, Wei Yang, Yu Xiang, Pavlo Molchanov, Ankur Handa, Jonathan Tremblay, Yashraj S. Narang, Karl Van Wyk, Umar Iqbal, Stan Birchfield, Jan Kautz, and Dieter Fox. DexYCB: A benchmark for capturing hand grasping of objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

2021
[5]

Executing your commands via motion diffusion in latent space

Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, and Gang Yu. Executing your commands via motion diffusion in latent space. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

2023
[6]

Object-centric dexterous manipulation from human motion data

Yuanpei Chen, Chen Wang, Yaodong Yang, and Karen Liu. Object-centric dexterous manipulation from human motion data. InConference on Robot Learning (CoRL)
[7]

HO-Flow: Generalizable Hand-Object Interaction Generation with Latent Flow Matching

Zerui Chen, Rolandos Alexandros Potamias, Shizhe Chen, Jiankang Deng, Cordelia Schmid, and Stefanos Zafeiriou. Ho-flow: Generalizable hand-object interaction generation with latent flow matching.arXiv preprint arXiv:2604.10836, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[8]

Diffh2o: Diffusion-based synthesis of hand-object interactions from textual descriptions

Sammy Christen, Shreyas Hampali, Fadime Sener, Edoardo Remelli, Tomas Hodan, Eric Sauser, Shugao Ma, and Bugra Tekin. Diffh2o: Diffusion-based synthesis of hand-object interactions from textual descriptions. InSIGGRAPH Asia Conference Papers, 2024

2024
[9]

Laserhuman: language-guided scene-aware human motion generation in free environment.arXiv preprint arXiv:2403.13307, 2024

Peishan Cong, Ziyi Wang, Zhiyang Dou, Yiming Ren, Wei Yin, Kai Cheng, Yujing Sun, Xiaoxiao Long, Xinge Zhu, and Yuexin Ma. Laserhuman: language-guided scene-aware human motion generation in free environment.arXiv preprint arXiv:2403.13307, 2024

work page arXiv 2024
[10]

Contact-guided 3d human-object interaction generation

Christian Diller et al. Contact-guided 3d human-object interaction generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024
[11]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Black, and Otmar Hilliges

Zicong Fan, Omid Taheri, Dimitrios Tzionas, Muhammed Kocabas, Manuel Kaufmann, Michael J. Black, and Otmar Hilliges. ARCTIC: A dataset for dexterous bimanual hand-object manipulation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

2023
[13]

Egograsp: World-space hand-object interaction estimation from egocentric videos.arXiv preprint arXiv:2601.01050, 2026

Hongming Fu, Wenjia Wang, Xiaozhen Qiao, Rolandos Alexandros Potamias, Taku Komura, Shuo Yang, Zheng Liu, and Bo Zhao. Egograsp: World-space hand-object interaction estimation from egocentric videos.arXiv preprint arXiv:2601.01050, 2026

work page arXiv 2026
[14]

Imos: Intent-driven full-body motion synthesis for human-object interactions

Anindita Ghosh, Rishabh Dabral, Vladislav Golyanik, Christian Theobalt, and Philipp Slusallek. Imos: Intent-driven full-body motion synthesis for human-object interactions. InEurographics, 2023

2023
[15]

Hoigpt: Learning long sequence hand-object interaction with language models

Mingzhen Huang, Fu-Jen Chu, Bugra Tekin, Kevin J Liang, Haoyu Ma, Weiyao Wang, Xingyu Chen, Pierre Gleize, Hongfei Xue, Siwei Lyu, Kris Kitani, Matt Feiszli, and Hao Tang. Hoigpt: Learning long sequence hand-object interaction with language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025
[16]

Diffusion-based generation, optimization, and planning in 3d scenes

Siyuan Huang, Zan Wang, Puhao Li, Baoxiong Jia, Tengyu Liu, Yixin Zhu, Wei Liang, and Song-Chun Zhu. Diffusion-based generation, optimization, and planning in 3d scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 10

2023
[17]

Full-body articulated human-object interaction.arXiv preprint arXiv:2212.10621, 2023

Nan Jiang, Tengyu Liu, Zhexuan Cao, Jieming Cui, Zhiyuan zhang, Yixin Chen, He Wang, Yixin Zhu, and Siyuan Huang. Full-body articulated human-object interaction.arXiv preprint arXiv:2212.10621, 2023

work page arXiv 2023
[18]

Autonomous character-scene interaction synthesis from text instruction

Nan Jiang et al. Autonomous character-scene interaction synthesis from text instruction. InSIGGRAPH Asia Conference Papers, 2024

2024
[19]

Optimizing diffusion noise can serve as universal motion priors

Korrawe Karunratanakul, Konpat Preechakul, Emre Aksan, Thabo Beeler, Supasorn Suwajanakorn, and Siyu Tang. Optimizing diffusion noise can serve as universal motion priors. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024
[20]

ParaHome: Parameterizing everyday home activities towards 3d generative modeling of human-object interactions

Jeonghwan Kim, Jisoo Kim, Jeonghyeon Na, and Hanbyul Joo. ParaHome: Parameterizing everyday home activities towards 3d generative modeling of human-object interactions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025
[21]

Object motion guided human motion synthesis.ACM Transactions on Graphics, 2023

Jiaman Li, Jiajun Wu, and C Karen Liu. Object motion guided human motion synthesis.ACM Transactions on Graphics, 2023

2023
[22]

Controllable human-object interaction synthesis

Jiaman Li, Alexander Clegg, Roozbeh Mottaghi, Jiajun Wu, Xavier Puig, and C Karen Liu. Controllable human-object interaction synthesis. InEuropean Conference on Computer Vision (ECCV), 2024

2024
[23]

Maniptrans: Efficient dexterous bimanual manipulation transfer via residual learning

Kailin Li, Puhao Li, Tengyu Liu, Yuyang Li, and Siyuan Huang. Maniptrans: Efficient dexterous bimanual manipulation transfer via residual learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025
[24]

Latenthoi: On the generalizable hand object motion generation with latent hand diffusion

Muchen Li, Sammy Christen, Chengde Wan, Yujun Cai, Renjie Liao, Leonid Sigal, and Shugao Ma. Latenthoi: On the generalizable hand object motion generation with latent hand diffusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025
[25]

Intergen: Diffusion-based multi-human motion generation under complex interactions.International Journal of Computer Vision, 2024

Han Liang, Wenqian Zhang, Wenxuan Li, Jingyi Yu, and Lan Xu. Intergen: Diffusion-based multi-human motion generation under complex interactions.International Journal of Computer Vision, 2024

2024
[26]

Humoto: A 4d dataset of mocap human object interactions

Jiaxin Lu, Chun-Hao Paul Huang, Uttaran Bhattacharya, Qixing Huang, and Yi Zhou. Humoto: A 4d dataset of mocap human object interactions. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

2025
[27]

HIMO: A new benchmark for full-body human interacting with multiple objects

Xintao Lv, Liang Xu, Yichao Yan, Xin Jin, Congsheng Xu, Shuwen Wu, Yifan Liu, Lincheng Li, Mengxiao Bi, Wenjun Zeng, and Xiaokang Yang. HIMO: A new benchmark for full-body human interacting with multiple objects. InEuropean Conference on Computer Vision (ECCV), 2024

2024
[28]

Synthesizing physically plausible human motions in 3d scenes

Liang Pan, Jingbo Wang, Buzhen Huang, Junyu Zhang, Haofan Wang, Xu Tang, and Yangang Wang. Synthesizing physically plausible human motions in 3d scenes. InInternational Conference on 3D Vision (3DV), 2024

2024
[29]

Expressive body capture: 3d hands, face, and body from a single image

Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed AA Osman, Dimitrios Tzionas, and Michael J Black. Expressive body capture: 3d hands, face, and body from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019

2019
[30]

HOI-Diff: Text-driven synthesis of 3d human-object interactions using diffusion models

Xiaogang Peng, Yiming Xie, Zizhao Wu, Varun Jampani, Deqing Sun, and Huaizu Jiang. HOI-Diff: Text-driven synthesis of 3d human-object interactions using diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2025

2025
[31]

Coda: Coordinated diffusion noise optimization for whole-body manipulation of articulated objects

Huaijin Pi, Zhi Cen, Zhiyang Dou, and Taku Komura. Coda: Coordinated diffusion noise optimization for whole-body manipulation of articulated objects. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

2025
[32]

Efficient learning on point clouds with basis point sets

Sergey Prokudin, Christoph Lassner, and Javier Romero. Efficient learning on point clouds with basis point sets. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019

2019
[33]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, et al. Learning transferable visual models from natural language supervision.arXiv preprint arXiv:2103.00020, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[34]

Roey Ron, Guy Tevet, Haim Sawdayee, and Amit H. Bermano. Hoidini: Human-object interaction through diffusion noise optimization.arXiv preprint arXiv:2506.15625, 2025

work page arXiv 2025
[35]

Mixermdm: Learnable composition of human motion diffusion models

Pablo Ruiz-Ponce, German Barquero, Cristina Palmero, Sergio Escalera, and José García-Rodríguez. Mixermdm: Learnable composition of human motion diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 11

2025
[36]

Interact2ar: Full-body human-human interaction generation via autoregressive diffusion models

Pablo Ruiz-Ponce, Sergio Escalera, José García-Rodríguez, Jiankang Deng, and Rolandos Alexandros Potamias. Interact2ar: Full-body human-human interaction generation via autoregressive diffusion models. arXiv preprint arXiv:2512.19692, 2025

work page arXiv 2025
[37]

Interactive character control with auto-regressive motion diffusion models.ACM Transactions on Graphics, 2024

Yi Shi, Jingbo Wang, Xuekun Jiang, Bingkun Lin, Bo Dai, and Xue Bin Peng. Interactive character control with auto-regressive motion diffusion models.ACM Transactions on Graphics, 2024

2024
[38]

A survey on human interaction motion generation.International Journal of Computer Vision, 2026

Kewei Sui, Anindita Ghosh, Inwoo Hwang, Bing Zhou, Jian Wang, and Chuan Guo. A survey on human interaction motion generation.International Journal of Computer Vision, 2026

2026
[39]

Black, and Dimitrios Tzionas

Omid Taheri, Nima Ghorbani, Michael J. Black, and Dimitrios Tzionas. Grab: A dataset of whole-body human grasping of objects. InEuropean Conference on Computer Vision (ECCV), 2020

2020
[40]

Black, and Dimitrios Tzionas

Omid Taheri, Vasileios Choutas, Michael J. Black, and Dimitrios Tzionas. Goal: Generating 4d whole-body motion for hand-object grasping. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

2022
[41]

Omid Taheri, Yi Zhou, Dimitrios Tzionas, Yang Zhou, Duygu Ceylan, Soren Pirk, and Michael J. Black. Grip: Generating interaction poses using spatial cues and latent consistency. InInternational Conference on 3D Vision (3DV), 2024

2024
[42]

arXiv preprint arXiv:2512.23464 (2025)

Tencent Hunyuan 3D Digital Human Team. Hy-motion 1.0: Scaling flow matching models for text-to- motion generation.arXiv preprint arXiv:2512.23464, 2025

work page arXiv 2025
[43]

Human motion diffusion model

Guy Tevet, Sigal Raab, Brian Gordon, Yoni Shafir, Daniel Cohen-Or, and Amit Haim Bermano. Human motion diffusion model. InInternational Conference on Learning Representations (ICLR), 2023

2023
[44]

CLoSD: Closing the loop between simulation and diffusion for multi-task character control

Guy Tevet, Sigal Raab, Setareh Cohan, Daniele Reda, Zhengyi Luo, Xue Bin Peng, Amit Haim Bermano, and Michiel van de Panne. CLoSD: Closing the loop between simulation and diffusion for multi-task character control. InInternational Conference on Learning Representations (ICLR), 2025

2025
[45]

Physhoi: Physics-based imitation of dynamic human-object interaction.arXiv preprint arXiv:2312.04393, 2023

Yinhuai Wang, Jing Lin, Ailing Zeng, Zhengyi Luo, Jian Zhang, and Lei Zhang. Physhoi: Physics-based imitation of dynamic human-object interaction.arXiv preprint arXiv:2312.04393, 2023

work page arXiv 2023
[46]

Intercontrol: Zero-shot human interaction generation by controlling every joint

Zhenzhi Wang, Jingbo Wang, Yixuan Li, Dahua Lin, and Bo Dai. Intercontrol: Zero-shot human interaction generation by controlling every joint. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024
[47]

Unleashing guidance without classifiers for human-object interaction animation

Ziyin Wang, Sirui Xu, Chuan Guo, Bing Zhou, Jiangshan Gong, Jian Wang, Yu-Xiong Wang, and Liang- Yan Gui. Unleashing guidance without classifiers for human-object interaction animation. InInternational Conference on Learning Representations (ICLR), 2026

2026
[48]

Karen Liu

Zhen Wu, Jiaman Li, Pei Xu, and C. Karen Liu. Human-object interaction from human-level instructions. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

2025
[49]

Interact: Advancing large-scale versatile 3d human-object interaction generation

Sirui Xu, Dongting Li, Yucheng Zhang, Xiyan Xu, Qi Long, Ziyin Wang, Yunzhi Lu, Shuchang Dong, Hezi Jiang, Akshat Gupta, Yu-Xiong Wang, and Liang-Yan Gui. Interact: Advancing large-scale versatile 3d human-object interaction generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025
[50]

Intermimic: Towards universal whole-body control for physics-based human-object interactions

Sirui Xu, Hung Yu Ling, Yu-Xiong Wang, and Liangyan Gui. Intermimic: Towards universal whole-body control for physics-based human-object interactions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025
[51]

OakInk: A large- scale knowledge repository for understanding hand-object interaction

Lixin Yang, Kailin Li, Xinyu Zhan, Fei Wu, Anran Xu, Liu Liu, and Cewu Lu. OakInk: A large- scale knowledge repository for understanding hand-object interaction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

2022
[52]

G-hop: Generative hand-object prior for interaction reconstruction and grasp synthesis

Yufei Ye, Abhinav Gupta, Kris Kitani, and Shubham Tulsiani. G-hop: Generative hand-object prior for interaction reconstruction and grasp synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024
[53]

Black, Xue Bin Peng, and Davis Rempe

Hongwei Yi, Justus Thies, Michael J. Black, Xue Bin Peng, and Davis Rempe. Generating human interaction motions in scenes with text control. InEuropean Conference on Computer Vision (ECCV), 2024

2024
[54]

Chainhoi: Joint-based kinematic chain modeling for human-object interaction generation

Ling-An Zeng, Guohong Huang, Yi-Lin Wei, Shengbo Gu, Yu-Ming Tang, Jingke Meng, and Wei-Shi Zheng. Chainhoi: Joint-based kinematic chain modeling for human-object interaction generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 12

2025
[55]

Manipnet: neural manipulation synthesis with a hand-object spatial representation.ACM Transactions on Graphics, 2021

He Zhang, Yuting Ye, Takaaki Shiratori, and Taku Komura. Manipnet: neural manipulation synthesis with a hand-object spatial representation.ACM Transactions on Graphics, 2021

2021
[56]

Motiondiffuse: Text-driven human motion generation with diffusion model.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

Mingyuan Zhang, Zhongang Cai, Liang Pan, Fangzhou Hong, Xinying Guo, Lei Yang, and Ziwei Liu. Motiondiffuse: Text-driven human motion generation with diffusion model.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

2024
[57]

Zhang et al

X. Zhang et al. Behave: Dataset and method for tracking human object interaction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

2022
[58]

Diffgrasp: Whole-body grasping synthesis guided by object motion using a diffusion model

Yonghao Zhang, Qiang He, Yanguang Wan, Yinda Zhang, Xiaoming Deng, Cuixia Ma, and Hongan Wang. Diffgrasp: Whole-body grasping synthesis guided by object motion using a diffusion model. In Proceedings of the AAAI Conference on Artificial Intelligence, 2025

2025
[59]

DartControl: A diffusion-based autoregressive motion model for real-time text-driven motion control

Kaifeng Zhao, Gen Li, and Siyu Tang. DartControl: A diffusion-based autoregressive motion model for real-time text-driven motion control. InInternational Conference on Learning Representations (ICLR), 2025

2025
[60]

On the continuity of rotation representa- tions in neural networks

Yi Zhou, Connelly Barnes, Lu Jingwan, Yang Jimei, and Li Hao. On the continuity of rotation representa- tions in neural networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 13

2019

[1] [1]

Adobe mixamo.https://www.mixamo.com/

Adobe Inc. Adobe mixamo.https://www.mixamo.com/

[2] [2]

Physically plausible full-body hand-object interaction synthesis

Jona Braun, Sammy Christen, Muhammed Kocabas, Emre Aksan, and Otmar Hilliges. Physically plausible full-body hand-object interaction synthesis. InInternational Conference on 3D Vision (3DV), 2024

2024

[3] [3]

Text2hoi: Text-guided 3d motion generation for hand-object interaction

Junuk Cha, Jihyeon Kim, Jae Shin Yoon, and Seungryul Baek. Text2hoi: Text-guided 3d motion generation for hand-object interaction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024

[4] [4]

Narang, Karl Van Wyk, Umar Iqbal, Stan Birchfield, Jan Kautz, and Dieter Fox

Yu-Wei Chao, Wei Yang, Yu Xiang, Pavlo Molchanov, Ankur Handa, Jonathan Tremblay, Yashraj S. Narang, Karl Van Wyk, Umar Iqbal, Stan Birchfield, Jan Kautz, and Dieter Fox. DexYCB: A benchmark for capturing hand grasping of objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

2021

[5] [5]

Executing your commands via motion diffusion in latent space

Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, and Gang Yu. Executing your commands via motion diffusion in latent space. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

2023

[6] [6]

Object-centric dexterous manipulation from human motion data

Yuanpei Chen, Chen Wang, Yaodong Yang, and Karen Liu. Object-centric dexterous manipulation from human motion data. InConference on Robot Learning (CoRL)

[7] [7]

HO-Flow: Generalizable Hand-Object Interaction Generation with Latent Flow Matching

Zerui Chen, Rolandos Alexandros Potamias, Shizhe Chen, Jiankang Deng, Cordelia Schmid, and Stefanos Zafeiriou. Ho-flow: Generalizable hand-object interaction generation with latent flow matching.arXiv preprint arXiv:2604.10836, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[8] [8]

Diffh2o: Diffusion-based synthesis of hand-object interactions from textual descriptions

Sammy Christen, Shreyas Hampali, Fadime Sener, Edoardo Remelli, Tomas Hodan, Eric Sauser, Shugao Ma, and Bugra Tekin. Diffh2o: Diffusion-based synthesis of hand-object interactions from textual descriptions. InSIGGRAPH Asia Conference Papers, 2024

2024

[9] [9]

Laserhuman: language-guided scene-aware human motion generation in free environment.arXiv preprint arXiv:2403.13307, 2024

Peishan Cong, Ziyi Wang, Zhiyang Dou, Yiming Ren, Wei Yin, Kai Cheng, Yujing Sun, Xiaoxiao Long, Xinge Zhu, and Yuexin Ma. Laserhuman: language-guided scene-aware human motion generation in free environment.arXiv preprint arXiv:2403.13307, 2024

work page arXiv 2024

[10] [10]

Contact-guided 3d human-object interaction generation

Christian Diller et al. Contact-guided 3d human-object interaction generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024

[11] [11]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Black, and Otmar Hilliges

Zicong Fan, Omid Taheri, Dimitrios Tzionas, Muhammed Kocabas, Manuel Kaufmann, Michael J. Black, and Otmar Hilliges. ARCTIC: A dataset for dexterous bimanual hand-object manipulation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

2023

[13] [13]

Egograsp: World-space hand-object interaction estimation from egocentric videos.arXiv preprint arXiv:2601.01050, 2026

Hongming Fu, Wenjia Wang, Xiaozhen Qiao, Rolandos Alexandros Potamias, Taku Komura, Shuo Yang, Zheng Liu, and Bo Zhao. Egograsp: World-space hand-object interaction estimation from egocentric videos.arXiv preprint arXiv:2601.01050, 2026

work page arXiv 2026

[14] [14]

Imos: Intent-driven full-body motion synthesis for human-object interactions

Anindita Ghosh, Rishabh Dabral, Vladislav Golyanik, Christian Theobalt, and Philipp Slusallek. Imos: Intent-driven full-body motion synthesis for human-object interactions. InEurographics, 2023

2023

[15] [15]

Hoigpt: Learning long sequence hand-object interaction with language models

Mingzhen Huang, Fu-Jen Chu, Bugra Tekin, Kevin J Liang, Haoyu Ma, Weiyao Wang, Xingyu Chen, Pierre Gleize, Hongfei Xue, Siwei Lyu, Kris Kitani, Matt Feiszli, and Hao Tang. Hoigpt: Learning long sequence hand-object interaction with language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025

[16] [16]

Diffusion-based generation, optimization, and planning in 3d scenes

Siyuan Huang, Zan Wang, Puhao Li, Baoxiong Jia, Tengyu Liu, Yixin Zhu, Wei Liang, and Song-Chun Zhu. Diffusion-based generation, optimization, and planning in 3d scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 10

2023

[17] [17]

Full-body articulated human-object interaction.arXiv preprint arXiv:2212.10621, 2023

Nan Jiang, Tengyu Liu, Zhexuan Cao, Jieming Cui, Zhiyuan zhang, Yixin Chen, He Wang, Yixin Zhu, and Siyuan Huang. Full-body articulated human-object interaction.arXiv preprint arXiv:2212.10621, 2023

work page arXiv 2023

[18] [18]

Autonomous character-scene interaction synthesis from text instruction

Nan Jiang et al. Autonomous character-scene interaction synthesis from text instruction. InSIGGRAPH Asia Conference Papers, 2024

2024

[19] [19]

Optimizing diffusion noise can serve as universal motion priors

Korrawe Karunratanakul, Konpat Preechakul, Emre Aksan, Thabo Beeler, Supasorn Suwajanakorn, and Siyu Tang. Optimizing diffusion noise can serve as universal motion priors. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024

[20] [20]

ParaHome: Parameterizing everyday home activities towards 3d generative modeling of human-object interactions

Jeonghwan Kim, Jisoo Kim, Jeonghyeon Na, and Hanbyul Joo. ParaHome: Parameterizing everyday home activities towards 3d generative modeling of human-object interactions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025

[21] [21]

Object motion guided human motion synthesis.ACM Transactions on Graphics, 2023

Jiaman Li, Jiajun Wu, and C Karen Liu. Object motion guided human motion synthesis.ACM Transactions on Graphics, 2023

2023

[22] [22]

Controllable human-object interaction synthesis

Jiaman Li, Alexander Clegg, Roozbeh Mottaghi, Jiajun Wu, Xavier Puig, and C Karen Liu. Controllable human-object interaction synthesis. InEuropean Conference on Computer Vision (ECCV), 2024

2024

[23] [23]

Maniptrans: Efficient dexterous bimanual manipulation transfer via residual learning

Kailin Li, Puhao Li, Tengyu Liu, Yuyang Li, and Siyuan Huang. Maniptrans: Efficient dexterous bimanual manipulation transfer via residual learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025

[24] [24]

Latenthoi: On the generalizable hand object motion generation with latent hand diffusion

Muchen Li, Sammy Christen, Chengde Wan, Yujun Cai, Renjie Liao, Leonid Sigal, and Shugao Ma. Latenthoi: On the generalizable hand object motion generation with latent hand diffusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025

[25] [25]

Intergen: Diffusion-based multi-human motion generation under complex interactions.International Journal of Computer Vision, 2024

Han Liang, Wenqian Zhang, Wenxuan Li, Jingyi Yu, and Lan Xu. Intergen: Diffusion-based multi-human motion generation under complex interactions.International Journal of Computer Vision, 2024

2024

[26] [26]

Humoto: A 4d dataset of mocap human object interactions

Jiaxin Lu, Chun-Hao Paul Huang, Uttaran Bhattacharya, Qixing Huang, and Yi Zhou. Humoto: A 4d dataset of mocap human object interactions. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

2025

[27] [27]

HIMO: A new benchmark for full-body human interacting with multiple objects

Xintao Lv, Liang Xu, Yichao Yan, Xin Jin, Congsheng Xu, Shuwen Wu, Yifan Liu, Lincheng Li, Mengxiao Bi, Wenjun Zeng, and Xiaokang Yang. HIMO: A new benchmark for full-body human interacting with multiple objects. InEuropean Conference on Computer Vision (ECCV), 2024

2024

[28] [28]

Synthesizing physically plausible human motions in 3d scenes

Liang Pan, Jingbo Wang, Buzhen Huang, Junyu Zhang, Haofan Wang, Xu Tang, and Yangang Wang. Synthesizing physically plausible human motions in 3d scenes. InInternational Conference on 3D Vision (3DV), 2024

2024

[29] [29]

Expressive body capture: 3d hands, face, and body from a single image

Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed AA Osman, Dimitrios Tzionas, and Michael J Black. Expressive body capture: 3d hands, face, and body from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019

2019

[30] [30]

HOI-Diff: Text-driven synthesis of 3d human-object interactions using diffusion models

Xiaogang Peng, Yiming Xie, Zizhao Wu, Varun Jampani, Deqing Sun, and Huaizu Jiang. HOI-Diff: Text-driven synthesis of 3d human-object interactions using diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2025

2025

[31] [31]

Coda: Coordinated diffusion noise optimization for whole-body manipulation of articulated objects

Huaijin Pi, Zhi Cen, Zhiyang Dou, and Taku Komura. Coda: Coordinated diffusion noise optimization for whole-body manipulation of articulated objects. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

2025

[32] [32]

Efficient learning on point clouds with basis point sets

Sergey Prokudin, Christoph Lassner, and Javier Romero. Efficient learning on point clouds with basis point sets. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019

2019

[33] [33]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, et al. Learning transferable visual models from natural language supervision.arXiv preprint arXiv:2103.00020, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[34] [34]

Roey Ron, Guy Tevet, Haim Sawdayee, and Amit H. Bermano. Hoidini: Human-object interaction through diffusion noise optimization.arXiv preprint arXiv:2506.15625, 2025

work page arXiv 2025

[35] [35]

Mixermdm: Learnable composition of human motion diffusion models

Pablo Ruiz-Ponce, German Barquero, Cristina Palmero, Sergio Escalera, and José García-Rodríguez. Mixermdm: Learnable composition of human motion diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 11

2025

[36] [36]

Interact2ar: Full-body human-human interaction generation via autoregressive diffusion models

Pablo Ruiz-Ponce, Sergio Escalera, José García-Rodríguez, Jiankang Deng, and Rolandos Alexandros Potamias. Interact2ar: Full-body human-human interaction generation via autoregressive diffusion models. arXiv preprint arXiv:2512.19692, 2025

work page arXiv 2025

[37] [37]

Interactive character control with auto-regressive motion diffusion models.ACM Transactions on Graphics, 2024

Yi Shi, Jingbo Wang, Xuekun Jiang, Bingkun Lin, Bo Dai, and Xue Bin Peng. Interactive character control with auto-regressive motion diffusion models.ACM Transactions on Graphics, 2024

2024

[38] [38]

A survey on human interaction motion generation.International Journal of Computer Vision, 2026

Kewei Sui, Anindita Ghosh, Inwoo Hwang, Bing Zhou, Jian Wang, and Chuan Guo. A survey on human interaction motion generation.International Journal of Computer Vision, 2026

2026

[39] [39]

Black, and Dimitrios Tzionas

Omid Taheri, Nima Ghorbani, Michael J. Black, and Dimitrios Tzionas. Grab: A dataset of whole-body human grasping of objects. InEuropean Conference on Computer Vision (ECCV), 2020

2020

[40] [40]

Black, and Dimitrios Tzionas

Omid Taheri, Vasileios Choutas, Michael J. Black, and Dimitrios Tzionas. Goal: Generating 4d whole-body motion for hand-object grasping. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

2022

[41] [41]

Omid Taheri, Yi Zhou, Dimitrios Tzionas, Yang Zhou, Duygu Ceylan, Soren Pirk, and Michael J. Black. Grip: Generating interaction poses using spatial cues and latent consistency. InInternational Conference on 3D Vision (3DV), 2024

2024

[42] [42]

arXiv preprint arXiv:2512.23464 (2025)

Tencent Hunyuan 3D Digital Human Team. Hy-motion 1.0: Scaling flow matching models for text-to- motion generation.arXiv preprint arXiv:2512.23464, 2025

work page arXiv 2025

[43] [43]

Human motion diffusion model

Guy Tevet, Sigal Raab, Brian Gordon, Yoni Shafir, Daniel Cohen-Or, and Amit Haim Bermano. Human motion diffusion model. InInternational Conference on Learning Representations (ICLR), 2023

2023

[44] [44]

CLoSD: Closing the loop between simulation and diffusion for multi-task character control

Guy Tevet, Sigal Raab, Setareh Cohan, Daniele Reda, Zhengyi Luo, Xue Bin Peng, Amit Haim Bermano, and Michiel van de Panne. CLoSD: Closing the loop between simulation and diffusion for multi-task character control. InInternational Conference on Learning Representations (ICLR), 2025

2025

[45] [45]

Physhoi: Physics-based imitation of dynamic human-object interaction.arXiv preprint arXiv:2312.04393, 2023

Yinhuai Wang, Jing Lin, Ailing Zeng, Zhengyi Luo, Jian Zhang, and Lei Zhang. Physhoi: Physics-based imitation of dynamic human-object interaction.arXiv preprint arXiv:2312.04393, 2023

work page arXiv 2023

[46] [46]

Intercontrol: Zero-shot human interaction generation by controlling every joint

Zhenzhi Wang, Jingbo Wang, Yixuan Li, Dahua Lin, and Bo Dai. Intercontrol: Zero-shot human interaction generation by controlling every joint. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024

[47] [47]

Unleashing guidance without classifiers for human-object interaction animation

Ziyin Wang, Sirui Xu, Chuan Guo, Bing Zhou, Jiangshan Gong, Jian Wang, Yu-Xiong Wang, and Liang- Yan Gui. Unleashing guidance without classifiers for human-object interaction animation. InInternational Conference on Learning Representations (ICLR), 2026

2026

[48] [48]

Karen Liu

Zhen Wu, Jiaman Li, Pei Xu, and C. Karen Liu. Human-object interaction from human-level instructions. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

2025

[49] [49]

Interact: Advancing large-scale versatile 3d human-object interaction generation

Sirui Xu, Dongting Li, Yucheng Zhang, Xiyan Xu, Qi Long, Ziyin Wang, Yunzhi Lu, Shuchang Dong, Hezi Jiang, Akshat Gupta, Yu-Xiong Wang, and Liang-Yan Gui. Interact: Advancing large-scale versatile 3d human-object interaction generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025

[50] [50]

Intermimic: Towards universal whole-body control for physics-based human-object interactions

Sirui Xu, Hung Yu Ling, Yu-Xiong Wang, and Liangyan Gui. Intermimic: Towards universal whole-body control for physics-based human-object interactions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025

[51] [51]

OakInk: A large- scale knowledge repository for understanding hand-object interaction

Lixin Yang, Kailin Li, Xinyu Zhan, Fei Wu, Anran Xu, Liu Liu, and Cewu Lu. OakInk: A large- scale knowledge repository for understanding hand-object interaction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

2022

[52] [52]

G-hop: Generative hand-object prior for interaction reconstruction and grasp synthesis

Yufei Ye, Abhinav Gupta, Kris Kitani, and Shubham Tulsiani. G-hop: Generative hand-object prior for interaction reconstruction and grasp synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024

[53] [53]

Black, Xue Bin Peng, and Davis Rempe

Hongwei Yi, Justus Thies, Michael J. Black, Xue Bin Peng, and Davis Rempe. Generating human interaction motions in scenes with text control. InEuropean Conference on Computer Vision (ECCV), 2024

2024

[54] [54]

Chainhoi: Joint-based kinematic chain modeling for human-object interaction generation

Ling-An Zeng, Guohong Huang, Yi-Lin Wei, Shengbo Gu, Yu-Ming Tang, Jingke Meng, and Wei-Shi Zheng. Chainhoi: Joint-based kinematic chain modeling for human-object interaction generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 12

2025

[55] [55]

Manipnet: neural manipulation synthesis with a hand-object spatial representation.ACM Transactions on Graphics, 2021

He Zhang, Yuting Ye, Takaaki Shiratori, and Taku Komura. Manipnet: neural manipulation synthesis with a hand-object spatial representation.ACM Transactions on Graphics, 2021

2021

[56] [56]

Motiondiffuse: Text-driven human motion generation with diffusion model.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

Mingyuan Zhang, Zhongang Cai, Liang Pan, Fangzhou Hong, Xinying Guo, Lei Yang, and Ziwei Liu. Motiondiffuse: Text-driven human motion generation with diffusion model.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

2024

[57] [57]

Zhang et al

X. Zhang et al. Behave: Dataset and method for tracking human object interaction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

2022

[58] [58]

Diffgrasp: Whole-body grasping synthesis guided by object motion using a diffusion model

Yonghao Zhang, Qiang He, Yanguang Wan, Yinda Zhang, Xiaoming Deng, Cuixia Ma, and Hongan Wang. Diffgrasp: Whole-body grasping synthesis guided by object motion using a diffusion model. In Proceedings of the AAAI Conference on Artificial Intelligence, 2025

2025

[59] [59]

DartControl: A diffusion-based autoregressive motion model for real-time text-driven motion control

Kaifeng Zhao, Gen Li, and Siyu Tang. DartControl: A diffusion-based autoregressive motion model for real-time text-driven motion control. InInternational Conference on Learning Representations (ICLR), 2025

2025

[60] [60]

On the continuity of rotation representa- tions in neural networks

Yi Zhou, Connelly Barnes, Lu Jingwan, Yang Jimei, and Li Hao. On the continuity of rotation representa- tions in neural networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 13

2019