arxiv: 2605.14462 · v1 · submitted 2026-05-14 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Real2Sim in HOI: Toward Physically Plausible HOI Reconstruction from Monocular Videos

Yubo Zhao , Yujin Chai , Yunao Dong , Chengfeng Zhao , Zijiao Zeng , Yuan Liu , Chi-Keung Tang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:40 UTC · model grok-4.3

classification 💻 cs.CV

keywords human-object interactionmonocular reconstructionphysically plausiblephysics simulation4D HOIcontact consistencyembodied AI

0 comments

The pith

HA-HOI recovers 4D human-object interactions from monocular videos that remain stable under physics simulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes HA-HOI to reconstruct human-object interactions from single videos in a way that produces motions suitable for physics simulation. It does this by first recovering human motion as the anchor for the interaction and then reconstructing and aligning the object to follow that motion. The combined trajectory is fed into a physics simulator to ensure contacts are stable and manipulations are functional. This matters because many existing methods create visually appealing but physically invalid animations that cannot be used directly in embodied AI or simulation training. If successful, it turns everyday videos into usable demonstrations for humanoid robots interacting with objects.

Core claim

HA-HOI uses a human-first, object-follow approach where the human motion recovered from the video serves as the interaction anchor, the object is reconstructed relative to the human action, and the resulting kinematic trajectory is projected into a physics-based simulation to generate a stable physical rollout.

What carries the argument

The human-first, object-follow formulation that anchors object reconstruction to human motion and projects the result into physics simulation for teacher-guided rollout.

Load-bearing premise

The human motion extracted from the monocular video is sufficiently accurate to serve as a reliable anchor for object placement and physics simulation without requiring further corrections.

What would settle it

Running the physics simulation on the HA-HOI output trajectories and observing whether the object maintains stable contact with the human without penetration or flying away across multiple test videos.

Figures

Figures reproduced from arXiv: 2605.14462 by Chengfeng Zhao, Chi-Keung Tang, Yuan Liu, Yubo Zhao, Yujin Chai, Yunao Dong, Zijiao Zeng.

**Figure 1.** Figure 1: Reconstructing interaction, not just trajectories. From monocular RGB videos, HA-HOI reconstructs simulation-ready 4D human-object interaction. Abstract Recovering 4D human-object interaction (HOI) from monocular video is a key step toward scalable 3D content creation, embodied AI, and simulation-based learning. Recent methods can reconstruct temporally coherent human and object trajectories, but these tra… view at source ↗

**Figure 2.** Figure 2: Overview of HA-HOI. From a monocular HOI video, our pipeline recovers human motion and object geometry, aligns them into a visually consistent 4D HOI sequence, and uses VLM-proposed contact priors to refine body-object interaction. The resulting physically plausible HOI trajectory serves as a teacher motion for humanoid-object simulation, enabling stable physical rollout from in-the-wild video. rather than… view at source ↗

**Figure 3.** Figure 3: Qualitative comparison with CARI4D on BEHAVE test sequences, BEHAVE training [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Recovering 4D human-object interaction (HOI) from monocular video is a key step toward scalable 3D content creation, embodied AI, and simulation-based learning. Recent methods can reconstruct temporally coherent human and object trajectories, but these trajectories often remain visual artifacts while failing to preserve stable contact, functional manipulation, or physical plausibility when used as reference motions for humanoid-object simulation. This reveals a fundamental interaction gap: HOI reconstruction should not stop at tracking a human and an object, but should recover the relation that makes their motion a coherent interaction. We introduce $\textbf{HA-HOI}$, a framework for reconstructing physically plausible 4D HOI animation from in-the-wild monocular videos. Instead of treating the human and object as independent entities in an ambiguous monocular 3D space, we propose a $\textit{human-first, object-follow}$ formulation. The human motion is recovered as the interaction anchor, and the object is reconstructed, aligned, and refined relative to the human action. The resulting kinematic trajectory is then projected into a physics-based humanoid-object simulation, where it acts as a teacher trajectory for stable physical rollout. Across benchmark and in-the-wild videos, $\textbf{HA-HOI}$ improves human-object alignment, contact consistency, temporal stability, and simulation readiness over prior monocular HOI reconstruction methods. By moving beyond visually plausible trajectory recovery toward physically grounded interaction animation, our work takes a step toward turning general monocular HOI videos into scalable demonstrations for humanoid-object behavior. Project page: https://knoxzhao.github.io/real2sim_in_HOI/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HA-HOI's human-first anchoring plus physics teacher step gives a clean way to push monocular HOI tracks toward simulation use, but the abstract leaves the actual gains unmeasured.

read the letter

HA-HOI recovers human motion from monocular video as the main anchor, reconstructs the object relative to it, then projects the trajectory into a physics simulator to act as a teacher for stable rollout. That ordering and the explicit physics step are the clearest new elements compared with earlier symmetric monocular HOI methods. The paper does a good job naming the practical problem: visual tracks often break contacts or produce motions that fail when dropped into humanoid-object simulation. Framing the output as simulation-ready data is useful for anyone building embodied datasets or training policies from video. The human-first choice also cuts down on some monocular depth ambiguity by letting the more observable human motion guide the object. The main soft spot is the lack of any numbers, ablations, or error analysis in the abstract. Claims of improved alignment, contact consistency, and temporal stability are stated without showing how large the gains are or how the physics step handles the centimeter-scale contact and depth errors that standard monocular estimators produce. If the initial human trajectory already violates penetration or contact constraints, it is not obvious that the simulator recovers functional manipulation without per-sequence fixes that are not described. This work is aimed at groups doing video-to-simulation pipelines for robotics or animation. A reader who needs concrete HOI data for downstream training would find the framing worth looking at, even if the experiments still need tightening. It should go to peer review so the implementation details and quantitative results can be checked directly.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces HA-HOI, a framework for reconstructing physically plausible 4D human-object interaction (HOI) animations from monocular videos. It proposes a human-first, object-follow formulation where human motion is recovered as the interaction anchor, the object is reconstructed and aligned relative to the human, and the kinematic trajectory is projected into a physics-based simulation to achieve stable physical rollout. The paper claims that this approach improves human-object alignment, contact consistency, temporal stability, and simulation readiness compared to prior monocular HOI reconstruction methods on both benchmark and in-the-wild videos.

Significance. If the central claims hold, this work has the potential to bridge the gap between visual HOI reconstruction and physically grounded simulation, which is significant for applications in embodied AI, robotics, and scalable 3D content creation. The emphasis on physical plausibility through simulation projection addresses a key limitation in existing methods that produce visually plausible but physically inconsistent trajectories. The availability of a project page suggests potential for reproducibility.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): The central claim of improvements in alignment, contact consistency, temporal stability, and simulation readiness is stated without any quantitative metrics, tables, or ablation results in the abstract. If the results section does not provide numerical comparisons (e.g., contact error, penetration depth, or rollout stability scores) against baselines, the evidence for superiority over prior methods cannot be verified and is load-bearing for the contribution.
[§3.3] §3.3 (Physics Projection): The formulation treats recovered monocular human motion as a reliable teacher trajectory for physics simulation, but provides no quantitative bound on contact or depth errors from the upstream pose estimator nor describes the mechanism for resolving kinematic-dynamic mismatches without per-sequence tuning. This assumption is load-bearing for the physical plausibility claim, as standard monocular estimators routinely produce centimeter-scale contact violations.

minor comments (2)

[§3] Ensure all equations in the human-first formulation are numbered and cross-referenced consistently in the text.
[Figure 1] Figure 1 (overview) would benefit from explicit arrows or labels distinguishing the kinematic recovery step from the physics rollout step.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger quantitative grounding in the abstract and clearer analysis of error assumptions in the physics projection stage. We address each major comment below and will incorporate revisions to improve verifiability while preserving the core human-first, object-follow formulation.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): The central claim of improvements in alignment, contact consistency, temporal stability, and simulation readiness is stated without any quantitative metrics, tables, or ablation results in the abstract. If the results section does not provide numerical comparisons (e.g., contact error, penetration depth, or rollout stability scores) against baselines, the evidence for superiority over prior methods cannot be verified and is load-bearing for the contribution.

Authors: We agree that the abstract should explicitly reference quantitative evidence to make the claims immediately verifiable. Section 4 already contains tables reporting contact error, penetration depth, rollout stability, and alignment metrics against baselines on both benchmark and in-the-wild sequences, along with ablations on the physics projection component. We will revise the abstract to include a concise summary of these key numerical improvements (e.g., average reductions in contact violation and penetration depth). This change will be made in the next version. revision: yes
Referee: [§3.3] §3.3 (Physics Projection): The formulation treats recovered monocular human motion as a reliable teacher trajectory for physics simulation, but provides no quantitative bound on contact or depth errors from the upstream pose estimator nor describes the mechanism for resolving kinematic-dynamic mismatches without per-sequence tuning. This assumption is load-bearing for the physical plausibility claim, as standard monocular estimators routinely produce centimeter-scale contact violations.

Authors: We acknowledge that the current manuscript does not supply explicit quantitative bounds on upstream monocular pose errors or a detailed error-propagation analysis. The physics projection resolves mismatches via a general contact-aware optimization that applies corrective forces and friction constraints within the simulator; this process is formulated without sequence-specific hyperparameter tuning. To address the concern directly, we will add a new subsection in §3.3 (and corresponding experiments in §4) that reports measured contact and depth errors from the pose estimator on the evaluation set and quantifies how the projection step reduces them. This will be included as a partial revision. revision: partial

Circularity Check

0 steps flagged

Minor self-citation without load-bearing circularity in derivation

full rationale

The paper introduces a human-first object-follow formulation and projects kinematic trajectories into physics simulation as a teacher signal. No equations or steps in the provided abstract reduce the final output directly to fitted parameters from the same data by construction. The approach builds on standard monocular pose estimators and physics engines with a novel ordering and projection step that adds independent content. Any self-citations are not load-bearing for the core claims of improved alignment and simulation readiness, as the central method does not rely on a self-referential uniqueness theorem or ansatz smuggled from prior author work. This yields a low circularity score consistent with honest non-finding for most papers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on standard assumptions from monocular 3D human pose estimation and rigid-body physics simulation; no new free parameters, invented entities, or ad-hoc axioms are introduced in the abstract.

axioms (1)

domain assumption Human motion recovered from monocular video provides a sufficiently accurate anchor for object alignment
Central premise of the human-first formulation stated in the abstract

pith-pipeline@v0.9.0 · 5619 in / 1188 out tokens · 25291 ms · 2026-05-15T02:40:43.037392+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

human-first, object-follow formulation... human motion recovered as the interaction anchor... projected into a physics-based humanoid-object simulation, where it acts as a teacher trajectory
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

contact loss... signed-distance-field contact objective... PPO residual controller with tracking reward

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 3 internal anchors

[1]

Reconstructing 4d spatial intelligence: A survey.arXiv preprint arXiv:2507.21045,

Yukang Cao, Jiahao Lu, Zhisheng Huang, Zhuowen Shen, Chengfeng Zhao, Fangzhou Hong, Zhaoxi Chen, Xin Li, Wenping Wang, Yuan Liu, et al. Reconstructing 4d spatial intelligence: A survey.arXiv preprint arXiv:2507.21045,

work page arXiv
[2]

arXiv preprint arXiv:2212.06870 (2022)

Yann Labbé, Lucas Manuelli, Arsalan Mousavian, Stephen Tyree, Stan Birchfield, Jonathan Tremblay, Justin Carpentier, Mathieu Aubry, Dieter Fox, and Josef Sivic. Megapose: 6d pose estimation of novel objects via render & compare.arXiv preprint arXiv:2212.06870,

work page arXiv
[3]

Scorehoi: Physically plausible reconstruction of human- object interaction via score-guided diffusion.arXiv preprint arXiv:2509.07920,

Ao Li, Jinpeng Liu, Yixuan Zhu, and Yansong Tang. Scorehoi: Physically plausible reconstruction of human- object interaction via score-guided diffusion.arXiv preprint arXiv:2509.07920,

work page arXiv
[4]

Unish: Unifying scene and human reconstruction in a feed-forward pass.arXiv preprint arXiv:2601.01222,

Mengfei Li, Peng Li, Zheng Zhang, Jiahao Lu, Chengfeng Zhao, Wei Xue, Qifeng Liu, Sida Peng, Wenxiao Zhang, Wenhan Luo, et al. Unish: Unifying scene and human reconstruction in a feed-forward pass.arXiv preprint arXiv:2601.01222,

work page arXiv
[5]

Track4world: Feedforward world-centric dense 3d tracking of all pixels.arXiv preprint arXiv:2603.02573,

Jiahao Lu, Jiayi Xu, Wenbo Hu, Ruijie Zhu, Chengfeng Zhao, Sai-Kit Yeung, Ying Shan, and Yuan Liu. Track4world: Feedforward world-centric dense 3d tracking of all pixels.arXiv preprint arXiv:2603.02573,

work page arXiv
[6]

World-grounded human motion recovery via gravity-view coordinates

Zehong Shen, Huaijin Pi, Yan Xia, Zhi Cen, Sida Peng, Zechen Hu, Hujun Bao, Ruizhen Hu, and Xiaowei Zhou. World-grounded human motion recovery via gravity-view coordinates. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11,

work page 2024
[7]

Wham: Reconstructing world-grounded humans with accurate 3d motion

Soyong Shin, Juyong Kim, Eni Halilaj, and Michael J Black. Wham: Reconstructing world-grounded humans with accurate 3d motion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2070–2080,

work page 2070
[8]

Closd: Closing the loop between simulation and diffusion for multi-task character control.arXiv preprint arXiv:2410.03441,

Guy Tevet, Sigal Raab, Setareh Cohan, Daniele Reda, Zhengyi Luo, Xue Bin Peng, Amit H Bermano, and Michiel van de Panne. Closd: Closing the loop between simulation and diffusion for multi-task character control.arXiv preprint arXiv:2410.03441,

work page arXiv
[9]

MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025a. Ruicheng Wang, Sicheng Xu, Yue Dong, Yu Deng, Jianfeng Xiang, Zelong Lv, Guangzhong Sun, Xin Tong, and Jiaolong Yang...

work page internal anchor Pith review Pith/arXiv arXiv
[10]

InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models

Jiale Xu, Weihao Cheng, Yiming Gao, Xintao Wang, Shenghua Gao, and Ying Shan. Instantmesh: Efficient 3d mesh generation from a single image with sparse-view large reconstruction models.arXiv preprint arXiv:2404.07191,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Sam 3d body: Robust full-body human mesh recovery.arXiv preprint arXiv:2602.15989,

Xitong Yang, Devansh Kukreja, Don Pinkus, Anushka Sagar, Taosha Fan, Jinhyung Park, Soyong Shin, Jinkun Cao, Jiawei Liu, Nicolas Ugrinovic, Matt Feiszli, Jitendra Malik, Piotr Dollar, and Kris Kitani. Sam 3d body: Robust full-body human mesh recovery.arXiv preprint arXiv:2602.15989,

work page arXiv
[12]

End-to-end spatial-temporal transformer for real-time 4d hoi reconstruction.arXiv preprint arXiv:2603.14435,

Haoyu Zhang, Wei Zhai, Yuhang Yang, Yang Cao, and Zheng-Jun Zha. End-to-end spatial-temporal transformer for real-time 4d hoi reconstruction.arXiv preprint arXiv:2603.14435,

work page arXiv
[13]

Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation

Zibo Zhao, Zeqiang Lai, Qingxiang Lin, Yunfei Zhao, Haolin Liu, Shuhui Yang, Yifei Feng, Mingxin Yang, Sheng Zhang, Xianghui Yang, et al. Hunyuan3d 2.0: Scaling diffusion models for high resolution textured 3d assets generation.arXiv preprint arXiv:2501.12202,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

12 This appendix provides additional qualitative results, implementation details, and evaluation protocols that support the main paper. • Appendix A: Technical Appendices.Implementation details of the reconstruction and simulation pipeline, including: – Appendix A.1: Object Scale Estimator.Monocular object-scale estimation using silhouette IoU and depth c...

work page 2024
[15]

Penetration LossPenalize interpenetration using signed distance: Lpen = X v∈Vpen max(0,−d(v)) 2 (12) whered(v)<0indicates penetration depth

A.3.2 Loss Function Details Contact LossBidirectional chamfer distance between contact regions: Lcontact = 1 |Vh| X v∈Vh min u∈Vo ∥v−u∥ 2 2 + 1 |Vo| X u∈Vo min v∈Vh ∥u−v∥ 2 2 (11) whereV h andV o are vertices in human and object contact regions. Penetration LossPenalize interpenetration using signed distance: Lpen = X v∈Vpen max(0,−d(v)) 2 (12) whered(v)<...

work page 2048