arxiv: 2605.05756 · v1 · submitted 2026-05-07 · 💻 cs.RO · cs.CV

Recognition: unknown

MaMi-HOI: Harmonizing Global Kinematics and Local Geometry for Human-Object Interaction Generation

Hao Wang , Shiqi Wang , Qi Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-08 09:19 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords Human-Object Interaction3D Motion GenerationDiffusion ModelsGeometric ForgettingKinematic AdapterContact PrecisionMotion Synthesis

0 comments

The pith

MaMi-HOI overcomes geometric forgetting in diffusion models to generate human-object interactions that are both naturally moving and precisely contacting objects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies geometric forgetting as the root cause of imprecise object contacts in prior 3D HOI generators, where semantic features overpower geometry details deeper in the diffusion process. It introduces a hierarchical MaMi-HOI framework with two adapters: GAPA to re-inject dense object geometry for residual snapping corrections and KHA to proactively align whole-body posture with spatial goals. A reader would care because the result is motion that stays both fluid and physically accurate, which prior methods could not sustain together.

Core claim

MaMi-HOI is a hierarchical framework reconciling macro-level kinematic fluidity with micro-level spatial precision. It counters geometric forgetting by using the Geometry-Aware Proximity Adapter (GAPA) to explicitly re-inject dense object details and perform residual snapping corrections for precise contact, paired with the Kinematic Harmony Adapter (KHA) to align whole-body posture with spatial objectives so the skeleton accommodates constraints naturally.

What carries the argument

The MaMi-HOI framework's Geometry-Aware Proximity Adapter (GAPA) for re-injecting object geometry to enable residual snapping corrections, combined with the Kinematic Harmony Adapter (KHA) for aligning whole-body posture with spatial objectives.

If this is right

Precise object contacts are achieved simultaneously with natural whole-body motion.
Generation extends reliably to long-term tasks involving complex trajectories.
Global navigation and high-fidelity manipulation become bridged within the same 3D scene generation process.
Quantitative and qualitative experiments confirm improvements in both contact accuracy and motion realism.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The adapter pattern could apply to other diffusion-based motion generators where local geometric details degrade in deeper layers.
Robotics simulation and planning systems might gain more reliable contact-rich behaviors by incorporating similar geometry re-injection steps.
Evaluating the method on a wider range of object shapes and interaction types would test whether the forgetting correction generalizes beyond the paper's scenes.
Real-time VR or game engines could integrate the adapters to produce interactive, physically plausible human-object sequences on the fly.

Load-bearing premise

That geometric forgetting is the primary cause of imprecise contacts in existing methods and that the GAPA and KHA adapters can be added without introducing new motion artifacts or requiring major changes to the base diffusion process.

What would settle it

A controlled ablation on a standard HOI benchmark dataset measuring contact error rates and kinematic smoothness metrics over long sequences, where removing either adapter causes the method to lose its reported gains in precision or naturalness relative to baselines.

Figures

Figures reproduced from arXiv: 2605.05756 by Hao Wang, Qi Liu, Shiqi Wang.

**Figure 1.** Figure 1: High-fidelity interaction synthesis generated by our proposed MaMi-HOI framework. Note the precise contact and natural human motion. Abstract Generating realistic 3D Human-Object Interactions (HOI) is a fundamental task for applications ranging from embodied AI to virtual content creation, which requires harmonizing high-level semantic intent with strict low-level physical constraints. Existing methods ex… view at source ↗

**Figure 2.** Figure 2: Representative Example of Geometric Forgetting. This figure shows the feature cosine similarity between the hidden states of our diffusion transformer and the input conditions for a single interaction sequence. While semantic alignment (red line) increases in deeper layers, geometric alignment (blue line) degrades, illustrating that high-level semantic features can overshadow finegrained spatial cues. Da… view at source ↗

**Figure 3.** Figure 3: Overview of MaMi-HOI. Our framework employs a Dual-Adapter paradigm to reconcile motion naturalness and contact precision. The Kinematic Harmony Adapter acts as a trainable copy to maintain global pose coherence, while the Geometry-Aware Proximity Adapter serves as a terminal module to explicitly inject local geometric constraints for accurate contact alignment. geometry embedding G ∈ˆ R T ×256, utilized f… view at source ↗

**Figure 4.** Figure 4: Architecture of GAPA. This module recovers fine-grained spatial cues via distance-aware cross-attention. It computes relative geometric context from object BPS features and injects it as a residual correction to the motion stream. where γ is a learnable scalar controlling sensitivity to physical proximity. This formulation naturally suppresses irrelevant long-range interactions, focusing the model’s atte… view at source ↗

**Figure 5.** Figure 5: Qualitative comparisons. We visualize sample sequences generated by CHOIS and our MaMi-HOI. As highlighted by the red dashed boxes, CHOIS frequently suffers from severe physical violations, including hand-object penetration, floating artifacts, and inharmonious body movements. In contrast, MaMi-HOI generates physically plausible interactions with precise surface contact and coherent whole-body dynamics, ef… view at source ↗

**Figure 6.** Figure 6: Application in realistic 3D scenes. We visualize the synthesized interactions within a scene from the Replica dataset, conditioned on text instructions. These examples demonstrate MaMi-HOI’s capability to generate environment-conscious and physically plausible motions that adapt to diverse object geometries. 4.5. Application in 3D Scenes To demonstrate robust applicability, we deploy MaMiHOI in complex 3D… view at source ↗

**Figure 7.** Figure 7: Diverse Interactions with Complex Trajectories. Visualization of MaMi-HOI generating sequences for various object categories. The results demonstrate the model’s ability to synthesize coherent full-body motion and precise object manipulation even when following non-linear, curved spatial waypoints view at source ↗

**Figure 8.** Figure 8: Generalization to Challenging Geometries. Additional qualitative results showcasing interactions with objects possessing complex topologies, such as tall floor lamps and coat racks. Despite the geometric complexity, MaMi-HOI accurately retrieves spatial cues to establish stable contacts and physically plausible dynamics for both lifting and dragging tasks. 17 view at source ↗

read the original abstract

Generating realistic 3D Human-Object Interactions (HOI) is a fundamental task for applications ranging from embodied AI to virtual content creation, which requires harmonizing high-level semantic intent with strict low-level physical constraints. Existing methods excel at semantic alignment, however, they struggle to maintain precise object contact. We reveal a key finding termed \textit{Geometric Forgetting}: as diffusion model depth increases, semantic feature tend to overshadow object geometry feature, causing the model to lose its perception to object geometry. To address this, we propose MaMi-HOI, a hierarchical framework reconciling \textbf{Ma}cro-level kinematic fluidity with \textbf{Mi}cro-level spatial precision. First, to counteract geometric forgetting, we introduce the Geometry-Aware Proximity Adapter (GAPA), which explicitly re-injects dense object details to perform residual snapping corrections for precise contact. Nevertheless, such aggressive local enforcement can disrupt global dynamics, leading to robotic stiffness. In response, we introduce the Kinematic Harmony Adapter (KHA), which proactively aligns whole-body posture with spatial objectives, ensuring the skeleton actively accommodates constraints without compromising naturalness. Extensive experiments validate that MaMi-HOI simultaneously achieves natural motion and precise contact. Crucially, it extends generation capabilities to long-term tasks with complex trajectories, effectively bridging the gap between global navigation and high-fidelity manipulation in 3D scenes. Code is available at https://github.com/DON738110198/MaMi-HOI.git

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MaMi-HOI adds two adapters to diffusion models to counter geometric forgetting and improve contact precision in HOI generation, but the abstract shows no metrics or ablations to confirm the diagnosis or the fixes.

read the letter

Hi colleague, the one or two things to know are that the paper names geometric forgetting as the reason diffusion models lose object geometry at deeper layers and introduces GAPA for residual contact snapping plus KHA for posture alignment to keep motions natural. It claims this hierarchical setup works for longer, complex trajectories without the usual stiffness or imprecision trade-off. What is new is the explicit framing of that forgetting phenomenon and the two adapter modules as a targeted reconciliation of global kinematics with local geometry. The paper does well in making the code public and in outlining a practical engineering response that could help embodied AI and virtual content tools. The hierarchical split avoids treating the whole problem as one big optimization. The soft spots are more substantial. The abstract mentions extensive validation yet supplies zero numbers, baselines, or layer-wise measurements, so there is no way to check whether semantic features actually overshadow geometry or whether contact errors come from loss design or data instead. Without those details the adapters risk fixing a symptom rather than the root, and it remains unclear if they add artifacts in long rollouts. The full paper would need to show concrete error curves and ablations before the central claim lands. This work is for people already working on diffusion models for 3D interactions in robotics or graphics who want adapter ideas to try. A reader focused on physical constraints in generation could extract useful design patterns. It deserves peer review because the problem is real and the approach is concrete enough for referees to test the results directly.

Referee Report

2 major / 2 minor

Summary. The paper claims that diffusion-based HOI generation suffers from 'Geometric Forgetting,' where semantic features overshadow object geometry features at increasing model depth, causing imprecise contacts. To fix this, MaMi-HOI introduces a hierarchical framework with the Geometry-Aware Proximity Adapter (GAPA) for residual snapping corrections and the Kinematic Harmony Adapter (KHA) for whole-body posture alignment, claiming to achieve both natural motion and precise contact while extending to long-horizon tasks with complex trajectories in 3D scenes.

Significance. If validated, the approach could meaningfully advance HOI synthesis by reconciling global kinematics with local geometric constraints, with direct relevance to embodied AI and virtual content creation. The introduction of targeted adapters to mitigate a diagnosed diffusion pathology is a concrete engineering contribution, though its impact hinges on whether the adapters generalize without new artifacts.

major comments (2)

[Abstract] Abstract: The central claim that 'Geometric Forgetting' is the key cause of imprecise contacts is presented as a revealed finding, yet no supporting quantitative evidence (layer-wise feature norms, attention weights, depth-vs-contact-error curves, or ablation on semantic vs. geometry feature dominance) is referenced. This makes it impossible to confirm that GAPA addresses the dominant failure mode rather than a secondary symptom, directly undermining the justification for the proposed adapters.
[Abstract] Abstract / Experiments (implied): The assertion that GAPA and KHA can be inserted without creating new motion artifacts in long-horizon rollouts is load-bearing for the 'bridging global navigation and high-fidelity manipulation' claim, but the abstract provides no metrics, baselines, or ablation studies on contact precision, naturalness scores, or trajectory success rates to substantiate this. Without such data, the hierarchical reconciliation remains unverified.

minor comments (2)

[Abstract] Abstract: Minor grammatical issues ('semantic feature tend' should be 'semantic features tend'; 'perception to object geometry' should be 'perception of object geometry') should be corrected for clarity.
[Abstract] Abstract: The code link is provided, which is positive for reproducibility; ensure the repository includes the full training and evaluation scripts referenced in the experiments section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major comment point by point below, providing clarifications based on the manuscript content and indicating where we will make revisions to strengthen the presentation of our claims.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'Geometric Forgetting' is the key cause of imprecise contacts is presented as a revealed finding, yet no supporting quantitative evidence (layer-wise feature norms, attention weights, depth-vs-contact-error curves, or ablation on semantic vs. geometry feature dominance) is referenced. This makes it impossible to confirm that GAPA addresses the dominant failure mode rather than a secondary symptom, directly undermining the justification for the proposed adapters.

Authors: The manuscript presents Geometric Forgetting as a key finding supported by analysis of feature behavior across model depths, including comparisons of semantic and geometric feature influence and their correlation with contact errors. To improve accessibility and directly address the concern about substantiation in the abstract, we will revise the abstract to include a concise reference to this supporting analysis and the relevant figures. revision: yes
Referee: [Abstract] Abstract / Experiments (implied): The assertion that GAPA and KHA can be inserted without creating new motion artifacts in long-horizon rollouts is load-bearing for the 'bridging global navigation and high-fidelity manipulation' claim, but the abstract provides no metrics, baselines, or ablation studies on contact precision, naturalness scores, or trajectory success rates to substantiate this. Without such data, the hierarchical reconciliation remains unverified.

Authors: The full manuscript includes quantitative evaluations of long-horizon generation in the experiments section, reporting metrics for contact precision, naturalness via perceptual studies, trajectory success rates, and ablations against baselines that confirm the adapters do not introduce new artifacts. We will revise the abstract to summarize key quantitative results from these studies to better support the claims. revision: yes

Circularity Check

0 steps flagged

No circularity; adapters introduced as direct engineering fix to empirically noted limitation without self-referential definitions or fitted predictions.

full rationale

The paper's chain begins with an observed limitation (geometric forgetting in diffusion depth) and responds by proposing two new adapters (GAPA for residual snapping, KHA for posture alignment). No equations appear that define a target quantity in terms of itself, no 'predictions' reduce to parameters fitted on the same data, and no self-citations or uniqueness theorems are invoked as load-bearing justification. Claims of improved natural motion and precise contact rest on experimental validation rather than tautological construction, making the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Based on the abstract alone, the central claim rests on standard diffusion model assumptions plus the newly introduced adapters; no explicit free parameters or invented physical entities are described.

axioms (1)

domain assumption Semantic features tend to overshadow object geometry features as diffusion model depth increases
Presented as the key finding that motivates the adapters.

invented entities (2)

Geometry-Aware Proximity Adapter (GAPA) no independent evidence
purpose: Re-injects dense object details to perform residual snapping corrections for precise contact
New module proposed to counteract geometric forgetting
Kinematic Harmony Adapter (KHA) no independent evidence
purpose: Proactively aligns whole-body posture with spatial objectives to maintain naturalness
New module proposed to prevent stiffness from local corrections

pith-pipeline@v0.9.0 · 5569 in / 1354 out tokens · 46716 ms · 2026-05-08T09:19:03.270634+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 10 canonical work pages · 4 internal anchors

[1]

2024 International Conference on 3D Vision (3DV) , pages=

Interaction replica: Tracking human--object interaction and scene changes from human motion , author=. 2024 International Conference on 3D Vision (3DV) , pages=. 2024 , organization=

2024
[2]

ACM Transactions on Graphics , volume=

Neural state machine for character-scene interactions , author=. ACM Transactions on Graphics , volume=. 2019 , publisher=

2019
[3]

Proceedings of the ACM on Computer Graphics and Interactive Techniques , volume=

Hierarchical planning and control for box loco-manipulation , author=. Proceedings of the ACM on Computer Graphics and Interactive Techniques , volume=. 2023 , publisher=

2023
[4]

Wong, and Ziwei Liu

Avatargo: Zero-shot 4d human-object interaction generation and animation , author=. arXiv preprint arXiv:2410.07164 , year=

work page arXiv
[5]

Bharadhwaj, A

Zero-shot robot manipulation from passive human videos , author=. arXiv preprint arXiv:2302.02011 , year=

work page arXiv
[6]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

On the continuity of rotation representations in neural networks , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[7]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Expressive body capture: 3d hands, face, and body from a single image , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[8]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Efficient learning on point clouds with basis point sets , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
[9]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

2021
[10]

ACM Transactions on Graphics (TOG) , volume=

Object motion guided human motion synthesis , author=. ACM Transactions on Graphics (TOG) , volume=. 2023 , publisher=

2023
[11]

Advances in neural information processing systems , volume=

Denoising diffusion probabilistic models , author=. Advances in neural information processing systems , volume=
[12]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Point transformer , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
[13]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Guiding Human-Object Interactions with Rich Geometry and Relations , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
[14]

European Conference on Computer Vision , pages=

Controllable human-object interaction synthesis , author=. European Conference on Computer Vision , pages=. 2024 , organization=

2024
[15]

International Journal of Computer Vision , volume=

3d-future: 3d furniture shape with texture , author=. International Journal of Computer Vision , volume=. 2021 , publisher=

2021
[16]

Advances in neural information processing systems , volume=

Gans trained by a two time-scale update rule converge to a local nash equilibrium , author=. Advances in neural information processing systems , volume=
[17]

Hoi-dyn: Learn- ing interaction dynamics for human-object motion diffusion

HOI-Dyn: Learning Interaction Dynamics for Human-Object Motion Diffusion , author=. arXiv preprint arXiv:2507.01737 , year=

work page arXiv
[18]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Interdiff: Generating 3d human-object interactions with physics-informed diffusion , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[19]

Human Motion Diffusion Model

Human motion diffusion model , author=. arXiv preprint arXiv:2209.14916 , year=

work page internal anchor Pith review arXiv
[20]

The Replica Dataset: A Digital Replica of Indoor Spaces

The replica dataset: A digital replica of indoor spaces , author=. arXiv preprint arXiv:1906.05797 , year=

work page internal anchor Pith review arXiv 1906
[21]

arXiv preprint arXiv:2310.08580 , year=

Omnicontrol: Control any joint at any time for human motion generation , author=. arXiv preprint arXiv:2310.08580 , year=

work page arXiv
[22]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Cg-hoi: Contact-guided 3d human-object interaction generation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[23]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Hoi-diff: Text-driven synthesis of 3d human-object interactions using diffusion models , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
[24]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Chainhoi: Joint-based kinematic chain modeling for human-object interaction generation , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
[25]

arXiv preprint arXiv:2407.20545 , year=

Stackflow: Monocular human-object reconstruction by stacked normalizing flow with offset , author=. arXiv preprint arXiv:2407.20545 , year=

work page arXiv
[26]

European Conference on Computer Vision , pages=

Chore: Contact, human and object reconstruction from a single rgb image , author=. European Conference on Computer Vision , pages=. 2022 , organization=

2022
[27]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Joint reconstruction of 3d human and object via contact-based refinement transformer , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[28]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Hoianimator: Generating text-prompt human-object animations using novel perceptive diffusion models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[29]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Text2hoi: Text-guided 3d motion generation for hand-object interaction , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[30]

Thor: Text to human-object inter- action diffusion via relation intervention.arXiv preprint arXiv:2403.11208, 2024

Thor: Text to human-object interaction diffusion via relation intervention , author=. arXiv preprint arXiv:2403.11208 , year=

work page arXiv
[31]

International conference on machine learning , pages=

Improved denoising diffusion probabilistic models , author=. International conference on machine learning , pages=. 2021 , organization=

2021
[32]

Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

Sora: A review on background, technology, limitations, and opportunities of large vision models , author=. arXiv preprint arXiv:2402.17177 , year=

work page internal anchor Pith review arXiv
[33]

Classifier-Free Diffusion Guidance

Classifier-free diffusion guidance , author=. arXiv preprint arXiv:2207.12598 , year=

work page internal anchor Pith review arXiv
[34]

Advances in neural information processing systems , volume=

Diffusion models beat gans on image synthesis , author=. Advances in neural information processing systems , volume=
[35]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Diffusion-based generation, optimization, and planning in 3d scenes , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[36]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Generating human motion in 3d scenes from text descriptions , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[37]

Advances in Neural Information Processing Systems , volume=

Humanise: Language-conditioned human motion generation in 3d scenes , author=. Advances in Neural Information Processing Systems , volume=
[38]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Synthesizing diverse human motions in 3d indoor scenes , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
[39]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Human-object interaction from human-level instructions , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[40]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Generating diverse and natural 3d human motions from text , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=