arxiv: 2604.18557 · v1 · submitted 2026-04-20 · 💻 cs.CV · cs.GR· cs.RO

Recognition: unknown

SynAgent: Generalizable Cooperative Humanoid Manipulation via Solo-to-Cooperative Agent Synergy

Haohan Ma, Hongwen Zhang, Jinhui Tang, Liangjun Xing, Wei Yao, Yebin Liu, Yuanjun Guo, Yunlian Sun, Zhile Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:11 UTC · model grok-4.3

classification 💻 cs.CV cs.GRcs.RO

keywords cooperative manipulationhumanoid robotsskill transfermotion retargetingmulti-agent reinforcement learninggenerative policyhuman-object interactiondecentralized training

0 comments

The pith

SynAgent transfers single-agent human-object skills to multi-agent cooperative humanoid manipulation via retargeting and policy adaptation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that abundant solo human manipulation data can be repurposed for cooperative multi-human tasks through a skill-transfer pipeline. It introduces an interaction-preserving retargeting step that builds an Interact Mesh with Delaunay tetrahedralization to keep spatial relationships intact when scaling from one human to multiple. This refined data then supports pretraining a single-agent policy that adapts via decentralized training and multi-agent PPO, followed by a conditional VAE policy for trajectory control. A reader would care because it directly tackles data scarcity and coordination complexity in embodied robotics without requiring large cooperative motion datasets.

Core claim

SynAgent is a unified framework that enables scalable and physically plausible cooperative manipulation by leveraging Solo-to-Cooperative Agent Synergy to transfer skills from single-agent human-object interaction to multi-agent human-object-human scenarios. It maintains semantic integrity during motion transfer with an interaction-preserving retargeting method based on an Interact Mesh constructed via Delaunay tetrahedralization, which faithfully maintains spatial relationships among humans and objects. Building on this data, it uses a single-agent pretraining and adaptation paradigm that bootstraps synergistic collaborative behaviors through decentralized training and multi-agent PPO, and,

What carries the argument

Solo-to-Cooperative Agent Synergy, the pipeline that retargets solo motions via an Interact Mesh and adapts them with decentralized PPO plus a conditional VAE policy distilled from motion priors.

If this is right

Cooperative imitation and trajectory control can be achieved without collecting new multi-agent motion data.
Policies generalize across diverse object geometries after training on retargeted solo interactions.
Decentralized training with multi-agent PPO produces stable collaborative behaviors from single-agent priors.
A conditional VAE policy enables controllable object-level trajectory execution in multi-human settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same retargeting approach could reduce data collection costs for other multi-agent tasks such as collaborative assembly or transport.
If the mesh construction preserves contacts reliably, it may serve as a general bridge between single-robot and multi-robot motion datasets.
Real-world validation on physical humanoids would test whether simulation-trained policies transfer without additional fine-tuning.

Load-bearing premise

The retargeting method using the Interact Mesh from Delaunay tetrahedralization faithfully maintains spatial relationships and semantic integrity of human-object interactions when moving from solo to cooperative scenarios.

What would settle it

Demonstrating that retargeted cooperative motions contain unnatural intersections, broken contacts, or lost interaction semantics, or that the resulting policy fails to track object trajectories accurately on unseen object shapes.

Figures

Figures reproduced from arXiv: 2604.18557 by Haohan Ma, Hongwen Zhang, Jinhui Tang, Liangjun Xing, Wei Yao, Yebin Liu, Yuanjun Guo, Yunlian Sun, Zhile Yang.

**Figure 1.** Figure 1: Features of SynAgent. As the first model to address trajectory-following object manipulation with multiple humanoid agents, SynAgent generalizes across diverse object geometries and supports cooperative manipulation. timization (MAPPO) [13] to foster emergent collaborative behaviors. Finally, to achieve precise execution, we develop a trajectory-conditioned policy instantiated as a conditional VAE (CVAE), … view at source ↗

**Figure 3.** Figure 3: Overview of SynAgent Training Pipeline. (1) Stage I pre-trains imitation policies {π s i }N i=0 on single-human HOI data, then adapts them to multi-agent scenarios with MAPPO algorithm. (2) After distilling {π s i }N i=0 into a unified Base Model, Stage II adapts the Base Model to multi-human HOHI data and get policies {πm i }M i=0. (3) Stage III learns a trajectory-conditioned cVAE policy. Motion imitatio… view at source ↗

**Figure 4.** Figure 4: Overview of 25 Objects. Our model can ultimately cover these 25 objects. C. Implementation Details Our experiments are conducted on the OMOMO and CORE4D datasets. OMOMO provides single-human HOI sequences, while CORE4D contains multi-human HOHI data. After automatic filtering to remove low-quality samples, we obtain 2,960 motion sequences covering 9 object categories and 25 distinct objects. Based on thes… view at source ↗

**Figure 5.** Figure 5: Qualitative Results. In the comparison between Ours and existing comparable baselines, the blue and green agents are the test results from Ours. In Comparison of Control, the green ball represents the trajectory control signal. In Performance of Retargeting, “direct” indicates that MoCap data is directly transferred to the agent, “orig” represents the raw MoCap data, and “retarget” represents the effect of… view at source ↗

read the original abstract

Controllable cooperative humanoid manipulation is a fundamental yet challenging problem for embodied intelligence, due to severe data scarcity, complexities in multi-agent coordination, and limited generalization across objects. In this paper, we present SynAgent, a unified framework that enables scalable and physically plausible cooperative manipulation by leveraging Solo-to-Cooperative Agent Synergy to transfer skills from single-agent human-object interaction to multi-agent human-object-human scenarios. To maintain semantic integrity during motion transfer, we introduce an interaction-preserving retargeting method based on an Interact Mesh constructed via Delaunay tetrahedralization, which faithfully maintains spatial relationships among humans and objects. Building upon this refined data, we propose a single-agent pretraining and adaptation paradigm that bootstraps synergistic collaborative behaviors from abundant single-human data through decentralized training and multi-agent PPO. Finally, we develop a trajectory-conditioned generative policy using a conditional VAE, trained via multi-teacher distillation from motion imitation priors to achieve stable and controllable object-level trajectory execution. Extensive experiments demonstrate that SynAgent significantly outperforms existing baselines in both cooperative imitation and trajectory-conditioned control, while generalizing across diverse object geometries. Codes and data will be available after publication. Project Page: http://yw0208.github.io/synagent

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SynAgent combines solo pretraining with a Delaunay-based retargeting mesh and CVAE distillation to tackle cooperative humanoid data scarcity, but the physical fidelity of the transferred motions is the key open question.

read the letter

The core idea is to bootstrap multi-agent humanoid manipulation from abundant single-human data by retargeting motions through an Interact Mesh built with Delaunay tetrahedralization, then running decentralized multi-agent PPO and distilling the results into a trajectory-conditioned VAE policy. That pipeline is the main contribution and it is presented cleanly as a way around the usual data bottleneck in cooperative settings. The approach reuses standard RL and generative modeling pieces but ties them together with the specific retargeting step and the solo-to-cooperative adaptation loop, which is not something I have seen packaged exactly this way before. The framing around object-level trajectory control and generalization across geometries is also practical. On the downside, the retargeting method is load-bearing and the stress-test concern lands: a static tetrahedral mesh does not explicitly encode contact normals, friction, or velocities, so scaling from one human to two could easily introduce or remove contacts and change force paths. If those artifacts are present, the downstream PPO and VAE training starts from compromised demonstrations, which would weaken both the physical-plausibility and generalization claims. The abstract asserts strong outperformance and broad generalization, yet supplies no numbers, baselines, or ablation details here, so the actual gains remain hard to assess without the full experimental section. This paper is aimed at researchers working on embodied multi-agent systems and imitation learning for humanoids. Anyone trying to scale cooperative behaviors from limited data will find the high-level strategy worth examining, even if they end up modifying the retargeting. It deserves a serious referee because the framework is coherent, the problem is real, and the components are reproducible enough to test. I would send it out but flag the retargeting validation and the need for quantitative ablations as the main points for reviewers to check.

Referee Report

1 major / 1 minor

Summary. The manuscript presents SynAgent, a unified framework for controllable cooperative humanoid manipulation. It leverages Solo-to-Cooperative Agent Synergy to transfer skills from single-agent human-object interaction to multi-agent human-object-human scenarios using an interaction-preserving retargeting method based on an Interact Mesh constructed via Delaunay tetrahedralization. The framework includes a single-agent pretraining and adaptation paradigm with decentralized training and multi-agent PPO, and a trajectory-conditioned generative policy using a conditional VAE trained via multi-teacher distillation. The authors claim that extensive experiments show significant outperformance over baselines in cooperative imitation and trajectory-conditioned control, with generalization across diverse object geometries.

Significance. If the central claims hold, this work could meaningfully advance embodied AI by addressing data scarcity in multi-agent cooperative manipulation through efficient transfer from abundant solo demonstrations. The synergy of geometric retargeting, RL-based pretraining, and generative policies provides a scalable paradigm that may improve physical plausibility and object-level generalization in humanoid control tasks.

major comments (1)

[Abstract and Methods (retargeting description)] The interaction-preserving retargeting method based on an Interact Mesh constructed via Delaunay tetrahedralization (described in the abstract and methods) is load-bearing for the Solo-to-Cooperative Agent Synergy paradigm and the downstream physical-plausibility claims. Delaunay tetrahedralization is a static geometric construction on keypoints that does not encode contact normals, friction, or velocity constraints; retargeting from solo to cooperative scenarios can therefore introduce or remove contacts, alter penetration depths, or change force transmission paths. Without quantitative validation (e.g., contact preservation metrics or physics simulation checks on the transferred motions), the pretraining data for multi-agent PPO and the conditional VAE may contain artifacts that undermine both generalization and physical plausibility assertions.

minor comments (1)

[Abstract] The abstract asserts 'significant outperformance' and 'extensive experiments' but supplies no quantitative results, baselines, error bars, or ablation details. A brief summary of key metrics (e.g., success rates or trajectory errors) would improve clarity without altering the technical content.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We are grateful to the referee for their detailed and constructive feedback on our manuscript. The major comment raises an important point about validating the retargeting method, which we address below. We have incorporated additional quantitative analysis in the revised manuscript to strengthen the presentation of our results.

read point-by-point responses

Referee: The interaction-preserving retargeting method based on an Interact Mesh constructed via Delaunay tetrahedralization (described in the abstract and methods) is load-bearing for the Solo-to-Cooperative Agent Synergy paradigm and the downstream physical-plausibility claims. Delaunay tetrahedralization is a static geometric construction on keypoints that does not encode contact normals, friction, or velocity constraints; retargeting from solo to cooperative scenarios can therefore introduce or remove contacts, alter penetration depths, or change force transmission paths. Without quantitative validation (e.g., contact preservation metrics or physics simulation checks on the transferred motions), the pretraining data for multi-agent PPO and the conditional VAE may contain artifacts that undermine both generalization and physical plausibility assertions.

Authors: We thank the referee for this insightful observation. The Interact Mesh via Delaunay tetrahedralization is intended to preserve the geometric configuration of the interaction by connecting keypoints in a way that reflects their spatial arrangement in the solo scenario. Since the retargeting is applied to adapt the solo motion to a cooperative setting while keeping the mesh intact, the contacts defined by close proximity in the original data are maintained through the preserved tetrahedron volumes and edge lengths. That said, we agree that additional quantitative evidence would be beneficial to support the physical plausibility claims. In the revised manuscript, we have included new experiments in Section 4.3 that evaluate the retargeted motions using physics-based simulation. Specifically, we report metrics such as the percentage of preserved contacts (defined as pairs with distance < 5cm), average penetration depth, and force transmission consistency. These results indicate minimal artifacts, with contact preservation rates above 92% and average penetration under 2cm, thereby validating the method for use in pretraining the multi-agent PPO and conditional VAE. We believe this revision addresses the concern and reinforces the effectiveness of the Solo-to-Cooperative Agent Synergy paradigm. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation chain is self-contained with independent components

full rationale

The paper introduces an interaction-preserving retargeting method (Interact Mesh via Delaunay tetrahedralization), a single-agent pretraining/adaptation paradigm using decentralized PPO, and a CVAE-based generative policy with multi-teacher distillation. None of these reduce by construction to their inputs or to self-citations; each is presented as a newly proposed technique building on standard RL and generative modeling. No fitted parameters are relabeled as predictions, no uniqueness theorems are imported from prior self-work, and no ansatzes are smuggled via citation. The central claims rest on the empirical performance of these independent additions rather than definitional equivalence.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 2 invented entities

The framework rests on standard assumptions from reinforcement learning and mesh processing plus newly introduced constructs; many training hyperparameters remain unspecified in the abstract.

free parameters (2)

multi-agent PPO hyperparameters
Tuned for decentralized training and adaptation from single-agent priors; specific values not provided.
conditional VAE architecture parameters
Latent dimensions and conditioning details chosen for trajectory generation and distillation.

axioms (2)

domain assumption Delaunay tetrahedralization of the Interact Mesh preserves semantic and spatial integrity of human-object interactions
Invoked to justify faithful motion transfer without loss of coordination meaning.
domain assumption Single-agent pretraining data contains transferable synergistic behaviors for multi-agent coordination
Central to the solo-to-cooperative bootstrapping paradigm.

invented entities (2)

Interact Mesh no independent evidence
purpose: To maintain spatial relationships among humans and objects during retargeting via Delaunay tetrahedralization
New construct introduced to ensure semantic integrity in motion transfer.
Solo-to-Cooperative Agent Synergy paradigm no independent evidence
purpose: To transfer skills from single-agent to multi-agent human-object-human manipulation
Core methodological contribution of the framework.

pith-pipeline@v0.9.0 · 5544 in / 1612 out tokens · 60451 ms · 2026-05-10T05:11:59.432643+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

68 extracted references · 14 canonical work pages · 2 internal anchors

[1]

Perpetual humanoid control for real-time simulated avatars,

Z. Luo, J. Cao, K. Kitani, W. Xuet al., “Perpetual humanoid control for real-time simulated avatars,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 10 895–10 904

2023
[2]

Universal humanoid motion representations for physics-based control,

Z. Luo, J. Cao, J. Merel, A. Winkler, J. Huang, K. Kitani, and W. Xu, “Universal humanoid motion representations for physics-based control,” arXiv preprint arXiv:2310.04582, 2023

work page arXiv 2023
[3]

Omni- grasp: Grasping diverse objects with simulated humanoids,

Z. Luo, J. Cao, S. Christen, A. Winkler, K. Kitani, and W. Xu, “Omni- grasp: Grasping diverse objects with simulated humanoids,”Advances in Neural Information Processing Systems, vol. 37, pp. 2161–2184, 2024. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 12

2024
[4]

Masked- mimic: Unified physics-based character control through masked motion inpainting,

C. Tessler, Y . Guo, O. Nabati, G. Chechik, and X. B. Peng, “Masked- mimic: Unified physics-based character control through masked motion inpainting,”ACM Transactions on Graphics (TOG), vol. 43, no. 6, pp. 1–21, 2024

2024
[5]

Amass: Archive of motion capture as surface shapes,

N. Mahmood, N. Ghorbani, N. F. Troje, G. Pons-Moll, and M. J. Black, “Amass: Archive of motion capture as surface shapes,” inProceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 5442–5451

2019
[6]

Intergen: Diffusion- based multi-human motion generation under complex interactions,

H. Liang, W. Zhang, W. Li, J. Yu, and L. Xu, “Intergen: Diffusion- based multi-human motion generation under complex interactions,” International Journal of Computer Vision, pp. 1–21, 2024

2024
[7]

Inter-x: Towards versatile human-human interaction analysis,

L. Xu, X. Lv, Y . Yan, X. Jin, S. Wu, C. Xu, Y . Liu, Y . Zhou, F. Rao, X. Shenget al., “Inter-x: Towards versatile human-human interaction analysis,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 22 260–22 271

2024
[8]

Grab: A dataset of whole-body human grasping of objects,

O. Taheri, N. Ghorbani, M. J. Black, and D. Tzionas, “Grab: A dataset of whole-body human grasping of objects,” inEuropean conference on computer vision. Springer, 2020, pp. 581–600

2020
[9]

Au- tonomous character-scene interaction synthesis from text instruction,

N. Jiang, Z. He, Z. Wang, H. Li, Y . Chen, S. Huang, and Y . Zhu, “Au- tonomous character-scene interaction synthesis from text instruction,” in SIGGRAPH Asia 2024 Conference Papers, 2024, pp. 1–11

2024
[10]

Scaling up dynamic human-scene interaction modeling,

N. Jiang, Z. Zhang, H. Li, X. Ma, Z. Wang, Y . Chen, T. Liu, Y . Zhu, and S. Huang, “Scaling up dynamic human-scene interaction modeling,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 1737–1747

2024
[11]

Core4d: A 4d human-object-human interaction dataset for collaborative object rearrangement,

Y . Liu, C. Zhang, R. Xing, B. Tang, B. Yang, and L. Yi, “Core4d: A 4d human-object-human interaction dataset for collaborative object rearrangement,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 1769–1782

2025
[12]

Spatial relationship preserving character motion adaptation,

E. S. Ho, T. Komura, and C.-L. Tai, “Spatial relationship preserving character motion adaptation,” inACM SIGGRAPH 2010 papers, 2010, pp. 1–8

2010
[13]

The surprising effectiveness of ppo in cooperative multi-agent games,

C. Yu, A. Velu, E. Vinitsky, J. Gao, Y . Wang, A. Bayen, and Y . Wu, “The surprising effectiveness of ppo in cooperative multi-agent games,” Advances in neural information processing systems, vol. 35, pp. 24 611– 24 624, 2022

2022
[14]

Deepmimic: Example-guided deep reinforcement learning of physics-based character skills,

X. B. Peng, P. Abbeel, S. Levine, and M. Van de Panne, “Deepmimic: Example-guided deep reinforcement learning of physics-based character skills,”ACM Transactions On Graphics (TOG), vol. 37, no. 4, pp. 1–14, 2018

2018
[15]

Mimickit: A reinforcement learning framework for motion imitation and control

X. B. Peng, “Mimickit: A reinforcement learning framework for motion imitation and control,”arXiv preprint arXiv:2510.13794, 2025

work page arXiv 2025
[16]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[17]

Physiinter: Integrating physical mapping for high-fidelity human interaction generation,

W. Yao, Y . Sun, C. Liu, H. Zhang, and J. Tang, “Physiinter: Integrating physical mapping for high-fidelity human interaction generation,”arXiv preprint arXiv:2506.07456, 2025

work page arXiv 2025
[18]

Sm- plolympics: Sports environments for physically simulated humanoids

Z. Luo, J. Wang, K. Liu, H. Zhang, C. Tessler, J. Wang, Y . Yuan, J. Cao, Z. Lin, F. Wanget al., “Smplolympics: Sports environments for physically simulated humanoids,”arXiv preprint arXiv:2407.00187, 2024

work page arXiv 2024
[19]

Skillmimic: Learning reusable basketball skills from demonstrations,

Y . Wang, Q. Zhao, R. Yu, A. Zeng, J. Lin, Z. Luo, H. W. Tsui, J. Yu, X. Li, Q. Chenet al., “Skillmimic: Learning reusable basketball skills from demonstrations,”arXiv e-prints, pp. arXiv–2408, 2024

2024
[20]

Learning agile soccer skills for a bipedal robot with deep reinforcement learning,

T. Haarnoja, B. Moran, G. Lever, S. H. Huang, D. Tirumala, J. Humplik, M. Wulfmeier, S. Tunyasuvunakool, N. Y . Siegel, R. Hafneret al., “Learning agile soccer skills for a bipedal robot with deep reinforcement learning,”Science Robotics, vol. 9, no. 89, p. eadi8022, 2024

2024
[21]

Pmp: Learning to physically interact with environments using part-wise motion priors,

J. Bae, J. Won, D. Lim, C.-H. Min, and Y . M. Kim, “Pmp: Learning to physically interact with environments using part-wise motion priors,” in ACM SIGGRAPH 2023 Conference Proceedings, 2023, pp. 1–10

2023
[22]

Simulation and retargeting of complex multi-character interactions,

Y . Zhang, D. Gopinath, Y . Ye, J. Hodgins, G. Turk, and J. Won, “Simulation and retargeting of complex multi-character interactions,” inACM SIGGRAPH 2023 Conference Proceedings, 2023, pp. 1–11

2023
[23]

Retargeting human-object interaction to virtual avatars,

Y . Kim, H. Park, S. Bang, and S.-H. Lee, “Retargeting human-object interaction to virtual avatars,”IEEE transactions on visualization and computer graphics, vol. 22, no. 11, pp. 2405–2412, 2016

2016
[24]

Skinned motion retargeting with preservation of body part relationships,

J.-Q. Zhang, M. Wang, F.-C. Zhang, and F.-L. Zhang, “Skinned motion retargeting with preservation of body part relationships,”IEEE Trans- actions on Visualization and Computer Graphics, 2024

2024
[25]

Learning human-to-humanoid real-time whole-body teleoperation,

T. He, Z. Luo, W. Xiao, C. Zhang, K. Kitani, C. Liu, and G. Shi, “Learning human-to-humanoid real-time whole-body teleoperation,” in 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2024, pp. 8944–8951

2024
[26]

Omniretarget: Interaction-preserving data generation for humanoid whole-body loco-manipulation and scene interaction,

L. Yang, X. Huang, Z. Wu, A. Kanazawa, P. Abbeel, C. Sferrazza, C. K. Liu, R. Duan, and G. Shi, “Omniretarget: Interaction-preserving data generation for humanoid whole-body loco-manipulation and scene interaction,”arXiv preprint arXiv:2509.26633, 2025

work page arXiv 2025
[27]

Spider: Scalable physics-informed dexterous retargeting,

C. Pan, C. Wang, H. Qi, Z. Liu, H. Bharadhwaj, A. Sharma, T. Wu, G. Shi, J. Malik, and F. Hogan, “Spider: Scalable physics-informed dexterous retargeting,”arXiv preprint arXiv:2511.09484, 2025

work page arXiv 2025
[28]

A two-part transformer network for controllable motion synthesis,

S. Hou, H. Tao, H. Bao, and W. Xu, “A two-part transformer network for controllable motion synthesis,”IEEE Transactions on Visualization and Computer Graphics, vol. 30, no. 8, pp. 5047–5062, 2023

2023
[29]

Guess: Gradually enriching synthesis for text-driven human motion generation,

X. Gao, Y . Yang, Z. Xie, S. Du, Z. Sun, and Y . Wu, “Guess: Gradually enriching synthesis for text-driven human motion generation,”IEEE Transactions on Visualization and Computer Graphics, vol. 30, no. 12, pp. 7518–7530, 2024

2024
[30]

Text2hoi: Text-guided 3d motion generation for hand-object interaction,

J. Cha, J. Kim, J. S. Yoon, and S. Baek, “Text2hoi: Text-guided 3d motion generation for hand-object interaction,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 1577–1585

2024
[31]

Diffh2o: Diffusion-based synthesis of hand- object interactions from textual descriptions,

S. Christen, S. Hampali, F. Sener, E. Remelli, T. Hodan, E. Sauser, S. Ma, and B. Tekin, “Diffh2o: Diffusion-based synthesis of hand- object interactions from textual descriptions,” inSIGGRAPH Asia 2024 Conference Papers, 2024, pp. 1–11

2024
[32]

Task-oriented human-object interactions generation with implicit neural representations,

Q. Li, J. Wang, C. C. Loy, and B. Dai, “Task-oriented human-object interactions generation with implicit neural representations,” inProceed- ings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 3035–3044

2024
[33]

Diff-ip2d: Diffusion-based hand-object interaction prediction on egocentric videos,

J. Ma, J. Xu, X. Chen, and H. Wang, “Diff-ip2d: Diffusion-based hand-object interaction prediction on egocentric videos,”arXiv preprint arXiv:2405.04370, 2024

work page arXiv 2024
[34]

Gaze-guided hand-object interaction synthesis: Dataset and method,

J. Tian, R. Ji, L. Yang, S. Ni, Y . Ma, L. Xu, J. Yu, Y . Shi, and J. Wang, “Gaze-guided hand-object interaction synthesis: Dataset and method,” arXiv preprint arXiv:2403.16169, 2024

work page arXiv 2024
[35]

Artigrasp: Physically plausible synthesis of bi-manual dexterous grasping and articulation,

H. Zhang, S. Christen, Z. Fan, L. Zheng, J. Hwangbo, J. Song, and O. Hilliges, “Artigrasp: Physically plausible synthesis of bi-manual dexterous grasping and articulation,” in2024 International Conference on 3D Vision (3DV). IEEE, 2024, pp. 235–246

2024
[36]

Manidext: Hand-object manipulation synthesis via continuous corre- spondence embeddings and residual-guided diffusion,

J. Zhang, Y . Zhang, L. An, M. Li, H. Zhang, Z. Hu, and Y . Liu, “Manidext: Hand-object manipulation synthesis via continuous corre- spondence embeddings and residual-guided diffusion,”IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 2025

2025
[37]

Compositional 3d human-object neural animation,

Z. Hou, B. Yu, and D. Tao, “Compositional 3d human-object neural animation,”arXiv preprint arXiv:2304.14070, 2023

work page arXiv 2023
[38]

Ncho: Unsupervised learning for neural 3d composition of humans and objects,

T. Kim, S. Saito, and H. Joo, “Ncho: Unsupervised learning for neural 3d composition of humans and objects,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 14 817–14 828

2023
[39]

Object pop-up: Can we infer 3d objects and their poses from human interactions alone?

I. A. Petrov, R. Marin, J. Chibane, and G. Pons-Moll, “Object pop-up: Can we infer 3d objects and their poses from human interactions alone?” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 4726–4736

2023
[40]

Intertrack: Tracking human object interaction without object templates,

X. Xie, J. E. Lenssen, and G. Pons-Moll, “Intertrack: Tracking human object interaction without object templates,” in2025 International Conference on 3D Vision (3DV). IEEE, 2025, pp. 1427–1439

2025
[41]

Person in place: Generating associative skeleton-guidance maps for human-object interaction image editing,

C. Yang, C. Kang, K. Kong, H. Oh, and S.-J. Kang, “Person in place: Generating associative skeleton-guidance maps for human-object interaction image editing,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 8164–8175

2024
[42]

Lemon: Learning 3d human-object interaction relation from 2d images,

Y . Yang, W. Zhai, H. Luo, Y . Cao, and Z.-J. Zha, “Lemon: Learning 3d human-object interaction relation from 2d images,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 16 284–16 295

2024
[43]

Imos: Intent-driven full-body motion synthesis for human-object interactions,

A. Ghosh, R. Dabral, V . Golyanik, C. Theobalt, and P. Slusallek, “Imos: Intent-driven full-body motion synthesis for human-object interactions,” inComputer Graphics Forum, vol. 42, no. 2. Wiley Online Library, 2023, pp. 1–12

2023
[44]

The kit bimanual manipulation dataset,

F. Krebs, A. Meixner, I. Patzer, and T. Asfour, “The kit bimanual manipulation dataset,” in2020 IEEE-RAS 20th International Conference on Humanoid Robots (Humanoids). IEEE, 2021, pp. 499–506

2021
[45]

Nifty: Neural object interaction fields for guided human motion synthesis,

N. Kulkarni, D. Rempe, K. Genova, A. Kundu, J. Johnson, D. Fouhey, and L. Guibas, “Nifty: Neural object interaction fields for guided human motion synthesis,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 947–957

2024
[46]

Locomotion-action-manipulation: Synthesizing human-scene interactions in complex 3d environments,

J. Lee and H. Joo, “Locomotion-action-manipulation: Synthesizing human-scene interactions in complex 3d environments,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 9663–9674

2023
[47]

Object motion guided human motion synthesis,

J. Li, J. Wu, and C. K. Liu, “Object motion guided human motion synthesis,”ACM Transactions on Graphics (TOG), vol. 42, no. 6, pp. 1–11, 2023. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13

2023
[48]

Action-conditioned generation of bimanual object manipulation sequences,

H. Razali and Y . Demiris, “Action-conditioned generation of bimanual object manipulation sequences,” inProceedings of the AAAI conference on artificial intelligence, vol. 37, no. 2, 2023, pp. 2146–2154

2023
[49]

Goal: Generating 4d whole-body motion for hand-object grasping,

O. Taheri, V . Choutas, M. J. Black, and D. Tzionas, “Goal: Generating 4d whole-body motion for hand-object grasping,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 13 263–13 273

2022
[50]

Learn to predict how humans manipulate large-sized objects from interactive motions,

W. Wan, L. Yang, L. Liu, Z. Zhang, R. Jia, Y .-K. Choi, J. Pan, C. Theobalt, T. Komura, and W. Wang, “Learn to predict how humans manipulate large-sized objects from interactive motions,”IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 4702–4709, 2022

2022
[51]

Saga: Stochastic whole-body grasping with contact,

Y . Wu, J. Wang, Y . Zhang, S. Zhang, O. Hilliges, F. Yu, and S. Tang, “Saga: Stochastic whole-body grasping with contact,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 257–274

2022
[52]

D3d-hoi: Dynamic 3d human-object interactions from videos.arXiv preprint arXiv:2108.08420, 2021

X. Xu, H. Joo, G. Mori, and M. Savva, “D3d-hoi: Dynamic 3d human- object interactions from videos,”arXiv preprint arXiv:2108.08420, 2021

work page arXiv 2021
[53]

Couch: Towards controllable human-chair interactions,

X. Zhang, B. L. Bhatnagar, S. Starke, V . Guzov, and G. Pons-Moll, “Couch: Towards controllable human-chair interactions,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 518–535

2022
[54]

Synthesizing diverse human motions in 3d indoor scenes,

K. Zhao, Y . Zhang, S. Wang, T. Beeler, and S. Tang, “Synthesizing diverse human motions in 3d indoor scenes,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 14 738–14 749

2023
[55]

Hosig: Full-body human-object-scene interaction generation with hierarchical scene perception

W. Yao, Y . Sun, H. Zhang, Y . Liu, and J. Tang, “Hosig: Full-body human-object-scene interaction generation with hierarchical scene per- ception,”arXiv preprint arXiv:2506.01579, 2025

work page arXiv 2025
[56]

Pose2gaze: Eye-body coordination during daily activities for gaze prediction from full-body poses,

Z. Hu, J. Xu, S. Schmitt, and A. Bulling, “Pose2gaze: Eye-body coordination during daily activities for gaze prediction from full-body poses,”IEEE Transactions on Visualization and Computer Graphics, 2024

2024
[57]

Machine learning approaches for 3d motion synthesis and musculoskeletal dynamics estimation: a survey,

I. Loi, E. I. Zacharaki, and K. Moustakas, “Machine learning approaches for 3d motion synthesis and musculoskeletal dynamics estimation: a survey,”IEEE transactions on visualization and computer graphics, vol. 30, no. 8, pp. 5810–5829, 2023

2023
[58]

Multi-character physical and behavioral interactions controller,

J. Vaillant, K. Bouyarmane, and A. Kheddar, “Multi-character physical and behavioral interactions controller,”IEEE transactions on visualiza- tion and computer graphics, vol. 23, no. 6, pp. 1650–1662, 2016

2016
[59]

Heterogeneous crowd simulation using parametric reinforcement learning,

K. Hu, B. Haworth, G. Berseth, V . Pavlovic, P. Faloutsos, and M. Kapa- dia, “Heterogeneous crowd simulation using parametric reinforcement learning,”IEEE Transactions on Visualization and Computer Graphics, vol. 29, no. 4, pp. 2036–2052, 2021

2036
[60]

Evolution-based shape and behavior co-design of virtual agents,

Z. Wang, B. Benes, A. H. Qureshi, and C. Mousas, “Evolution-based shape and behavior co-design of virtual agents,”IEEE Transactions on Visualization and Computer Graphics, vol. 30, no. 12, pp. 7579–7591, 2024

2024
[61]

Real-time physics-based 3d biped character animation using an inverted pendulum model,

Y .-Y . Tsai, W.-C. Lin, K. B. Cheng, J. Lee, and T.-Y . Lee, “Real-time physics-based 3d biped character animation using an inverted pendulum model,”IEEE transactions on visualization and computer graphics, vol. 16, no. 2, pp. 325–337, 2009

2009
[62]

Coohoi: Learning cooperative human-object interaction with manipulated object dynamics,

J. Gao, Z. Wang, Z. Xiao, J. Wang, T. Wang, J. Cao, X. Hu, S. Liu, J. Dai, and J. Pang, “Coohoi: Learning cooperative human-object interaction with manipulated object dynamics,”Advances in Neural Information Processing Systems, vol. 37, pp. 79 741–79 763, 2024

2024
[63]

Adam: A Method for Stochastic Optimization

D. P. Kingma, “Adam: A method for stochastic optimization,”arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[64]

Expressive body capture: 3d hands, face, and body from a single image,

G. Pavlakos, V . Choutas, N. Ghorbani, T. Bolkart, A. A. Osman, D. Tzionas, and M. J. Black, “Expressive body capture: 3d hands, face, and body from a single image,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 10 975–10 985

2019
[65]

Smoothnet: A plug-and-play network for refining human poses in videos,

A. Zeng, L. Yang, X. Ju, J. Li, J. Wang, and Q. Xu, “Smoothnet: A plug-and-play network for refining human poses in videos,”ArXiv, vol. abs/2112.13715, 2021. [Online]. Available: https://api.semanticscholar. org/CorpusID:245502027

work page arXiv 2021
[66]

Intermimic: Towards universal whole-body control for physics-based human-object interac- tions,

S. Xu, H. Y . Ling, Y .-X. Wang, and L.-Y . Gui, “Intermimic: Towards universal whole-body control for physics-based human-object interac- tions,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 12 266–12 277. Wei Yaoreceived the B.E. degree from the Uni- versity of South China, Hengyang, China, in 2021. He is now a Ph.D...

2025
[67]

degree with the Department of Automation, Tsinghua University, Beijing, China, under the supervision of Prof

He is currently pursuing the M.S. degree with the Department of Automation, Tsinghua University, Beijing, China, under the supervision of Prof. Yebin Liu. His current research interests include embodied AI, with a specific focus on mobile manipulation and dexterous manipulation for humanoid robots. His work involves reinforcement learning for whole- body ...

2021
[68]

Her research interests include power big data anal- ysis, artificial intelligence, fault diagnosis, and other applications in energy and power systems

She is currently an Assistant Professor with the Shenzhen Institute of Science and Technology, Chinese Academy of Sciences, Shenzhen, China. Her research interests include power big data anal- ysis, artificial intelligence, fault diagnosis, and other applications in energy and power systems. Yebin Liu(Member, IEEE) received the BE degree from the Beijing ...

2002