pith. machine review for the scientific record. sign in

arxiv: 2604.05436 · v1 · submitted 2026-04-07 · 💻 cs.CV · cs.AI

Recognition: 1 theorem link

· Lean Theorem

Human Interaction-Aware 3D Reconstruction from a Single Image

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:52 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords 3D human reconstructionmulti-person scenessingle imagediffusion modelsphysics-based priorsgroup contextorthographic viewinter-human contact
0
0 comments X

The pith

HUG3D reconstructs textured 3D models of multiple interacting people from one image by jointly modeling group context and physical contacts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Single-image 3D human reconstruction works for isolated figures but produces overlaps, missing occluded surfaces, and implausible contacts when people interact. The paper addresses this by first mapping the photo into a canonical orthographic view to remove perspective distortions. It then runs a joint diffusion process that generates consistent multi-view normals and images while treating both individual bodies and the overall group as coupled variables. A follow-on optimization step adds explicit physics-based priors that penalize interpenetration and enforce realistic contact geometry. The final textured mesh is obtained by fusing the generated views. Experiments show the pipeline yields higher-fidelity and more physically plausible results than either single-person or prior multi-person baselines.

Core claim

The HUG3D framework first transforms the input image to canonical orthographic space, then applies Human Group-Instance Multi-View Diffusion (HUG-MVD) to produce complete multi-view normals and images by jointly reasoning over instance and group context, and finally uses Human Group-Instance Geometric Reconstruction (HUG-GR) to optimize geometry with explicit physics-based interaction priors that enforce physical plausibility and accurate inter-human contact; the multi-view images are fused into a high-fidelity texture.

What carries the argument

Human Group-Instance Multi-View Diffusion (HUG-MVD), which jointly generates multi-view normals and images by modeling both individual and group-level information to resolve occlusions and proximity artifacts.

If this is right

  • Reconstructions of crowded scenes become usable for AR/VR without manual cleanup of overlaps or holes.
  • Inter-human contact can be modeled directly from monocular input rather than added as post-processing.
  • Multi-view consistency is achieved in one forward pass instead of independent per-person generation.
  • Physics priors become part of the core optimization rather than an optional constraint.
  • Textured output quality improves because the diffusion stage already produces coherent multi-view appearance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The orthographic canonicalization step could be applied to other multi-object reconstruction tasks where perspective distortion currently limits accuracy.
  • Replacing the diffusion backbone with newer video or 3D diffusion models might allow the same group-instance coupling to work on dynamic interactions.
  • The physics-prior optimization could be extended to include soft-tissue deformation or clothing dynamics for more lifelike results.
  • Quantitative failure cases in extreme occlusion or unusual body configurations would reveal whether the joint diffusion truly generalizes beyond the training distribution.

Load-bearing premise

Transforming the input to canonical orthographic space and then applying joint group-instance diffusion plus explicit physics priors will remove perspective distortions and enforce realistic contacts without introducing new geometric errors or requiring extra input.

What would settle it

Quantitative comparison of reconstructed meshes against ground-truth 3D scans of known multi-person contact scenes, measuring interpenetration volume, contact-point error, and surface reconstruction accuracy in occluded regions.

Figures

Figures reproduced from arXiv: 2604.05436 by Gwanghyun Kim, Jason Park, Junghun James Kim, Se Young Chun, Suh Yoon Jeon.

Figure 1
Figure 1. Figure 1: Core challenges in monocular multi-human 3D reconstruction: (a) geometric complexity and perspective distortion, (b) lack of [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our HUG3D framework. Given a single perspective image, (1) the [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of results from multi-view diffusion trained [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison on multi-human 3D reconstruction from a single image. HUG3D outperforms baselines by correcting [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative results on in-the-wild images. Our method demonstrates robust multi-human reconstructions, outperforming baseline [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Ablation study on key components. (a) Pers2Ortho improves projection sharpness.(b) Each HUG-GR loss boosts geometric [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
read the original abstract

Reconstructing textured 3D human models from a single image is fundamental for AR/VR and digital human applications. However, existing methods mostly focus on single individuals and thus fail in multi-human scenes, where naive composition of individual reconstructions often leads to artifacts such as unrealistic overlaps, missing geometry in occluded regions, and distorted interactions. These limitations highlight the need for approaches that incorporate group-level context and interaction priors. We introduce a holistic method that explicitly models both group- and instance-level information. To mitigate perspective-induced geometric distortions, we first transform the input into a canonical orthographic space. Our primary component, Human Group-Instance Multi-View Diffusion (HUG-MVD), then generates complete multi-view normals and images by jointly modeling individuals and group context to resolve occlusions and proximity. Subsequently, the Human Group-Instance Geometric Reconstruction (HUG-GR) module optimizes the geometry by leveraging explicit, physics-based interaction priors to enforce physical plausibility and accurately model inter-human contact. Finally, the multi-view images are fused into a high-fidelity texture. Together, these components form our complete framework, HUG3D. Extensive experiments show that HUG3D significantly outperforms both single-human and existing multi-human methods, producing physically plausible, high-fidelity 3D reconstructions of interacting people from a single image. Project page: https://jongheean11.github.io/HUG3D_project

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces HUG3D, a framework for textured 3D reconstruction of multiple interacting humans from a single image. It first maps the input to canonical orthographic space, then applies Human Group-Instance Multi-View Diffusion (HUG-MVD) to jointly generate multi-view normals and images while modeling both instance and group context to address occlusions and proximity artifacts. The Human Group-Instance Geometric Reconstruction (HUG-GR) stage then optimizes geometry using explicit physics-based interaction priors to enforce contact plausibility. Multi-view images are finally fused for texture. The central claim is that this pipeline significantly outperforms prior single-human and multi-human methods in producing high-fidelity, physically plausible results.

Significance. If validated, the work would advance single-image 3D human reconstruction by explicitly handling group interactions, a gap in existing methods that often produce overlaps or distorted contacts. The sequential design—orthographic canonicalization, joint diffusion, and physics priors—offers a structured way to incorporate group-level context, which could benefit AR/VR and digital human applications. The approach credits standard diffusion and optimization tools while adding interaction-specific components; however, its impact hinges on whether the priors reliably correct diffusion-stage errors, as asserted but not quantified in the provided description.

major comments (2)
  1. [HUG-GR module] HUG-GR module: the claim that explicit physics-based interaction priors reliably enforce plausible contacts and resolve occlusions assumes the initial geometry from HUG-MVD is sufficiently accurate. If the priors are soft penalty terms, large initial errors (common in single-image diffusion) can trap the optimizer in local minima with residual penetrations or distorted limbs, directly undermining the central claim of physically plausible multi-human reconstructions.
  2. [Abstract and Experiments] Abstract and Experiments section: the assertion of 'extensive experiments' showing significant outperformance over single-human and multi-human baselines is load-bearing for the contribution, yet the abstract supplies no metrics, specific baselines, ablation results, or implementation details. This prevents verification that gains are not due to post-hoc selection or that the physics priors contribute measurably beyond the diffusion stage.
minor comments (2)
  1. [Abstract] The abstract would be strengthened by briefly stating one or two key quantitative improvements (e.g., contact error or IoU metrics) to support the outperformance claim.
  2. [Method] Notation for the orthographic transform and the precise form of the physics priors (e.g., whether they are hard constraints or weighted penalties) should be defined explicitly in the method section for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have revised the paper accordingly to improve clarity and strengthen the presentation of our contributions.

read point-by-point responses
  1. Referee: [HUG-GR module] HUG-GR module: the claim that explicit physics-based interaction priors reliably enforce plausible contacts and resolve occlusions assumes the initial geometry from HUG-MVD is sufficiently accurate. If the priors are soft penalty terms, large initial errors (common in single-image diffusion) can trap the optimizer in local minima with residual penetrations or distorted limbs, directly undermining the central claim of physically plausible multi-human reconstructions.

    Authors: We appreciate this important observation on the sensitivity of the optimization. The HUG-GR stage formulates an energy function combining data fidelity terms (from HUG-MVD normals and images) with soft physics-based penalties for inter-human contacts and non-penetration. To reduce the risk of poor local minima, the optimization proceeds in stages: first aligning coarse poses, then refining geometry with the priors. In the revised manuscript we have added quantitative results showing contact accuracy metrics (penetration volume and contact point error) before versus after HUG-GR, as well as an ablation disabling the physics terms. These demonstrate measurable reductions in implausible contacts. We also include a short discussion of remaining failure modes when HUG-MVD produces severe initial errors. While we cannot guarantee perfect recovery from arbitrarily bad initializations, the reported experiments indicate that the diffusion outputs are sufficiently reliable for the evaluated cases, supporting the overall claim. revision: partial

  2. Referee: [Abstract and Experiments] Abstract and Experiments section: the assertion of 'extensive experiments' showing significant outperformance over single-human and multi-human baselines is load-bearing for the contribution, yet the abstract supplies no metrics, specific baselines, ablation results, or implementation details. This prevents verification that gains are not due to post-hoc selection or that the physics priors contribute measurably beyond the diffusion stage.

    Authors: We agree that the abstract would benefit from concrete numbers to make the claims immediately verifiable. In the revised version we have updated the abstract to report key quantitative gains, including reductions in penetration depth and improvements in normal consistency and perceptual metrics relative to both single-human and multi-human baselines. The experiments section already details all baselines (single-human methods such as PIFuHD and multi-human approaches), the full set of metrics (Chamfer distance, contact plausibility, etc.), component ablations isolating the contribution of the group-instance diffusion and the physics priors, and implementation hyperparameters. We have further clarified the evaluation protocol to confirm that comparisons were performed uniformly without post-hoc selection. These additions directly address the verifiability concern. revision: yes

Circularity Check

0 steps flagged

No significant circularity; pipeline is modular and externally grounded

full rationale

The provided abstract and description outline a sequential pipeline: canonical orthographic transform, followed by HUG-MVD joint diffusion for multi-view normals/images, then HUG-GR optimization using physics-based priors, and final texture fusion. No equations, fitted parameters, or self-citations are quoted that reduce any output (e.g., reconstructed geometry or contacts) to an input by construction. The central claims rest on empirical comparisons to baselines rather than tautological redefinitions or load-bearing self-references. This matches the default expectation of non-circularity for a methods paper describing standard diffusion-plus-optimization stages.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review based on abstract only; no free parameters, axioms, or invented entities are specified in the provided text.

pith-pipeline@v0.9.0 · 5562 in / 1049 out tokens · 61338 ms · 2026-05-10T18:52:02.468299+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 7 canonical work pages · 3 internal anchors

  1. [1]

    Multi-hmr: Multi-person whole-body hu- man mesh recovery in a single shot.ECCV, 2024

    Fabien Baradel, Matthieu Armando, Salma Galaaoui, Ro- main Br ´egier, Philippe Weinzaepfel, Gr ´egory Rogez, and Thomas Lucas. Multi-hmr: Multi-person whole-body hu- man mesh recovery in a single shot.ECCV, 2024. 2

  2. [2]

    In- stant3dit: Multiview inpainting for fast editing of 3d objects

    Amir Barda, Matheus Gadelha, Vladimir G Kim, Noam Aigerman, Amit H Bermano, and Thibault Groueix. In- stant3dit: Multiview inpainting for fast editing of 3d objects. arXiv preprint arXiv:2412.00518, 2024. 2, 30

  3. [3]

    Keep it smpl: Automatic estimation of 3d human pose and shape from a single image.ECCV, 2016

    Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J Black. Keep it smpl: Automatic estimation of 3d human pose and shape from a single image.ECCV, 2016. 2

  4. [4]

    Mvinpainter: Learning multi-view consistent inpainting to bridge 2d and 3d editing.arXiv preprint arXiv:2408.08000, 2024

    Chenjie Cao, Chaohui Yu, Fan Wang, Xiangyang Xue, and Yanwei Fu. Mvinpainter: Learning multi-view consistent inpainting to bridge 2d and 3d editing.arXiv preprint arXiv:2408.08000, 2024. 2, 30

  5. [5]

    3d reconstruc- tion of interacting multi-person in clothing from a single im- age.WACV, 2024

    Junuk Cha, Hansol Lee, Jaewon Kim, Nhat Nguyen Bao Truong, Jaeshin Yoon, and Seungryul Baek. 3d reconstruc- tion of interacting multi-person in clothing from a single im- age.WACV, 2024. 2, 7, 20, 21, 36

  6. [6]

    Robust-pifu: Robust pixel-aligned im- plicit function for 3d human digitalization from a single im- age.ICLR, 2024

    Kennard Chan, Fayao Liu, Guosheng Lin, Chuan-Sheng Foo, and Weisi Lin. Robust-pifu: Robust pixel-aligned im- plicit function for 3d human digitalization from a single im- age.ICLR, 2024. 2

  7. [7]

    Retinaface: Single-shot multi-level face localisation in the wild.CVPR, 2020

    Jiankang Deng, Jia Guo, Evangelos Ververas, Irene Kotsia, and Stefanos Zafeiriou. Retinaface: Single-shot multi-level face localisation in the wild.CVPR, 2020. 37

  8. [8]

    Remips: Phys- ically consistent 3d reconstruction of multiple interacting people under weak supervision.NeurIPS, 2021

    Mihai Fieraru, Mihai Zanfir, Teodor Szente, Eduard Baza- van, Vlad Olaru, and Cristian Sminchisescu. Remips: Phys- ically consistent 3d reconstruction of multiple interacting people under weak supervision.NeurIPS, 2021. 2

  9. [9]

    Populating 3d scenes by learning human-scene interaction.CVPR, 2021

    Mohamed Hassan, Partha Ghosh, Joachim Tesch, Dimitrios Tzionas, and Michael J Black. Populating 3d scenes by learning human-scene interaction.CVPR, 2021. 2

  10. [10]

    Learn- ing locally editable virtual humans.CVPR, 2023

    Hsuan-I Ho, Lixin Xue, Jie Song, and Otmar Hilliges. Learn- ing locally editable virtual humans.CVPR, 2023. 4, 6, 8, 14, 15, 37

  11. [11]

    Sith: Single-view textured human reconstruction with image-conditioned dif- fusion.CVPR, 2024

    I Ho, Jie Song, Otmar Hilliges, et al. Sith: Single-view textured human reconstruction with image-conditioned dif- fusion.CVPR, 2024. 1, 2, 7, 20, 21

  12. [12]

    Classifier-free diffusion guidance.NeurIPS, 2021

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.NeurIPS, 2021. 16

  13. [13]

    Closely interactive human reconstruction with proxemics and physics-guided adaption

    Buzhen Huang, Chen Li, Chongyang Xu, Liang Pan, Yan- gang Wang, and Gim Hee Lee. Closely interactive human reconstruction with proxemics and physics-guided adaption. CVPR, 2024. 18

  14. [14]

    Multiply: Re- construction of multiple people from monocular video in the wild.CVPR, 2024

    Zeren Jiang, Chen Guo, Manuel Kaufmann, Tianjian Jiang, Julien Valentin, Otmar Hilliges, and Jie Song. Multiply: Re- construction of multiple people from monocular video in the wild.CVPR, 2024. 1, 2, 7, 20, 21, 23

  15. [15]

    Total cap- ture: A 3d deformation model for tracking faces, hands, and bodies.CVPR, 2018

    Hanbyul Joo, Tomas Simon, and Yaser Sheikh. Total cap- ture: A 3d deformation model for tracking faces, hands, and bodies.CVPR, 2018. 2

  16. [16]

    End-to-end recovery of human shape and pose.CVPR, 2018

    Angjoo Kanazawa, Michael J Black, David W Jacobs, and Jitendra Malik. End-to-end recovery of human shape and pose.CVPR, 2018. 2

  17. [17]

    YOLOv11: An Overview of the Key Architectural Enhancements

    Rahima Khanam and Muhammad Hussain. Yolov11: An overview of the key architectural enhancements.arXiv preprint arXiv:2410.17725, 2024. 12

  18. [18]

    Sapiens: Foundation for human vision mod- els.ECCV, 2024

    Rawal Khirodkar, Timur Bagautdinov, Julieta Martinez, Su Zhaoen, Austin James, Peter Selednik, Stuart Anderson, and Shunsuke Saito. Sapiens: Foundation for human vision mod- els.ECCV, 2024. 3

  19. [19]

    Chupa: Carv- ing 3d clothed humans from skinned shape priors using 2d diffusion probabilistic models.ICCV, 2023

    Byungjun Kim, Patrick Kwon, Kwangho Lee, Myunggi Lee, Sookwan Han, Daesik Kim, and Hanbyul Joo. Chupa: Carv- ing 3d clothed humans from skinned shape priors using 2d diffusion probabilistic models.ICCV, 2023. 1

  20. [20]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv:1412.6980, 2014. 6, 16

  21. [21]

    Segment any- thing.ICCV, 2023

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing.ICCV, 2023. 12

  22. [22]

    Vibe: Video inference for human body pose and shape estimation.CVPR, 2020

    Muhammed Kocabas, Nikos Athanasiou, and Michael J Black. Vibe: Video inference for human body pose and shape estimation.CVPR, 2020. 2

  23. [23]

    Flux.https://github.com/ black-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/ black-forest-labs/flux, 2024. 2, 30

  24. [24]

    Guess the unseen: Dynamic 3d scene reconstruction from partial 2d glimpses

    Inhee Lee, Byungjun Kim, and Hanbyul Joo. Guess the unseen: Dynamic 3d scene reconstruction from partial 2d glimpses. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1062–1071,

  25. [25]

    Crowdpose: Efficient crowded scenes pose estimation and a new benchmark.CVPR, 2019

    Jiefeng Li, Can Wang, Hao Zhu, Yihuan Mao, Hao-Shu Fang, and Cewu Lu. Crowdpose: Efficient crowded scenes pose estimation and a new benchmark.CVPR, 2019. 2

  26. [26]

    Era3d: high-resolution multiview diffusion using efficient row-wise attention.NeurIPS, 2024

    Peng Li, Yuan Liu, Xiaoxiao Long, Feihu Zhang, Cheng Lin, Mengfei Li, Xingqun Qi, Shanghang Zhang, Wei Xue, Wen- han Luo, et al. Era3d: high-resolution multiview diffusion using efficient row-wise attention.NeurIPS, 2024. 2, 8, 16

  27. [27]

    Pshuman: Photorealistic single-view human reconstruction using cross-scale diffusion.arXiv preprint arXiv:2409.10141, 2024

    Peng Li, Wangguandong Zheng, Yuan Liu, Tao Yu, Yang- guang Li, Xingqun Qi, Mengfei Li, Xiaowei Chi, Siyu Xia, Wei Xue, et al. Pshuman: Photorealistic single-view human reconstruction using cross-scale diffusion.arXiv preprint arXiv:2409.10141, 2024. 1, 2, 4, 6, 7, 15, 16, 20, 21, 23, 37

  28. [28]

    Smpl: A skinned multi- person linear model.ACM Transactions on Graphics (TOG),

    Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. Smpl: A skinned multi- person linear model.ACM Transactions on Graphics (TOG),

  29. [29]

    Pixel codec avatars.CVPR, 2021

    Shugao Ma, Tomas Simon, Jason Saragih, Dawei Wang, Yuecheng Li, Fernando De La Torre, and Yaser Sheikh. Pixel codec avatars.CVPR, 2021. 1

  30. [30]

    Generative proxemics: A prior for 3d social interaction from images.CVPR, 2024

    Lea M ¨uller, Vickie Ye, Georgios Pavlakos, Michael Black, and Angjoo Kanazawa. Generative proxemics: A prior for 3d social interaction from images.CVPR, 2024. 2, 3, 6, 12, 30, 33

  31. [31]

    Multi-person implicit reconstruction from a single image.CVPR, 2021

    Armin Mustafa, Akin Caliskan, Lourdes Agapito, and Adrian Hilton. Multi-person implicit reconstruction from a single image.CVPR, 2021. 2

  32. [32]

    Holoportation: Virtual 3d teleportation in real-time.ACM Symposium on User Interface Software and Technology, pages 741–754, 2016

    Sergio Orts-Escolano, Christoph Rhemann, Sean Fanello, Wayne Chang, Adarsh Kowdle, Yury Degtyarev, David Kim, Philip L Davidson, Sameh Khamis, Mingsong Dou, et al. Holoportation: Virtual 3d teleportation in real-time.ACM Symposium on User Interface Software and Technology, pages 741–754, 2016. 1

  33. [33]

    Pytorch: An imperative style, high-performance deep learning library

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zem- ing Lin, Natalia Gimelshein, Luca Antiga, Alban Desmai- son, Andreas Kopf, Edward Yang, Zachary DeVito, Mar- tin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high...

  34. [34]

    Expressive body capture: 3d hands, face, and body from a single image.CVPR, 2019

    Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed Osman, Dimitrios Tzionas, and Michael J Black. Expressive body capture: 3d hands, face, and body from a single image.CVPR, 2019. 2

  35. [35]

    LHM: large animatable human reconstruction model from a single image in seconds

    Lingteng Qiu, Xiaodong Gu, Peihao Li, Qi Zuo, Weichao Shen, Junfei Zhang, Kejie Qiu, Weihao Yuan, Guanying Chen, Zilong Dong, et al. Lhm: Large animatable human reconstruction model from a single image in seconds.arXiv preprint arXiv:2503.10625, 2025. 2

  36. [36]

    Accelerating 3D Deep Learning with PyTorch3D

    Nikhila Ravi, Jeremy Reizenstein, David Novotny, Tay- lor Gordon, Wan-Yen Lo, Justin Johnson, and Georgia Gkioxari. Accelerating 3d deep learning with pytorch3d. arXiv preprint arXiv:2007.08501, 2020. 14, 37

  37. [37]

    High-resolution image syn- thesis with latent diffusion models.CVPR, 2022

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models.CVPR, 2022. 2, 6, 15, 30, 37

  38. [38]

    Pifu: Pixel-aligned implicit function for high-resolution clothed human digitiza- tion.ICCV, 2019

    Shunsuke Saito, Zeng Huang, Ryota Natsume, Shigeo Mor- ishima, Angjoo Kanazawa, and Hao Li. Pifu: Pixel-aligned implicit function for high-resolution clothed human digitiza- tion.ICCV, 2019. 1, 2

  39. [39]

    Pifuhd: Multi-level pixel-aligned implicit function for high-resolution 3d human digitization.CVPR, 2020

    Shunsuke Saito, Tomas Simon, Jason Saragih, and Hanbyul Joo. Pifuhd: Multi-level pixel-aligned implicit function for high-resolution 3d human digitization.CVPR, 2020. 1, 2

  40. [40]

    Putting people in their place: Monocular regression of 3d people in depth.CVPR, 2022

    Yu Sun, Wu Liu, Qian Bao, Yili Fu, Tao Mei, and Michael J Black. Putting people in their place: Monocular regression of 3d people in depth.CVPR, 2022. 2, 30

  41. [41]

    Dif- fusers: State-of-the-art diffusion models, 2022

    Patrick von Platen, Luke Lewis, and Thomas Wolf. Dif- fusers: State-of-the-art diffusion models, 2022. 37

  42. [42]

    Meat: Multiview diffu- sion model for human generation on megapixels with mesh attention.CVPR, 2025

    Yuhan Wang, Fangzhou Hong, Shuai Yang, Liming Jiang, Wayne Wu, and Chen Change Loy. Meat: Multiview diffu- sion model for human generation on megapixels with mesh attention.CVPR, 2025. 2

  43. [43]

    Individual comparisons by ranking meth- ods.Biometrics bulletin, 1(6):80–83, 1945

    Frank Wilcoxon. Individual comparisons by ranking meth- ods.Biometrics bulletin, 1(6):80–83, 1945. 31

  44. [44]

    Icon: Implicit clothed humans obtained from nor- mals.CVPR, 2022

    Yuliang Xiu, Jinlong Yang, Dimitrios Tzionas, and Michael J Black. Icon: Implicit clothed humans obtained from nor- mals.CVPR, 2022. 1

  45. [45]

    Econ: Explicit clothed humans optimized via normal integration.CVPR, 2023

    Yuliang Xiu, Jinlong Yang, Xu Cao, Dimitrios Tzionas, and Michael J Black. Econ: Explicit clothed humans optimized via normal integration.CVPR, 2023. 7, 20, 21, 23

  46. [46]

    Vit- pose: Simple vision transformer baselines for human pose estimation.NeurIPS, 2022

    Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao. Vit- pose: Simple vision transformer baselines for human pose estimation.NeurIPS, 2022. 12

  47. [47]

    Hi4d: 4d instance segmen- tation of close human interaction.CVPR, 2023

    Yifei Yin, Chen Guo, Manuel Kaufmann, Juan Jose Zarate, Jie Song, and Otmar Hilliges. Hi4d: 4d instance segmen- tation of close human interaction.CVPR, 2023. License: Non-commercial academic use only. Seehttps://hi4d. ait.ethz.ch/. 2, 4, 6, 8, 14, 15, 16, 37

  48. [48]

    Function4d: Real-time human vol- umetric capture from very sparse consumer rgbd sensors

    Tao Yu, Zerong Zheng, Kaiwen Guo, Pengpeng Liu, Qiong- hai Dai, and Yebin Liu. Function4d: Real-time human vol- umetric capture from very sparse consumer rgbd sensors. CVPR, 2021. 4, 6, 8, 14, 15, 37

  49. [49]

    Adding conditional control to text-to-image diffusion models.ICCV,

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models.ICCV,

  50. [50]

    Sifu: Side- view conditioned implicit function for real-world usable clothed human reconstruction.CVPR, 2024

    Zechuan Zhang, Zongxin Yang, and Yi Yang. Sifu: Side- view conditioned implicit function for real-world usable clothed human reconstruction.CVPR, 2024. 1, 2, 7, 20, 21

  51. [51]

    Deepmulticap: Perfor- mance capture of multiple characters using sparse multiview cameras.ICCV, 2021

    Yang Zheng, Ruizhi Shao, Yuxiang Zhang, Tao Yu, Zerong Zheng, Qionghai Dai, and Yebin Liu. Deepmulticap: Perfor- mance capture of multiple characters using sparse multiview cameras.ICCV, 2021. 1, 2, 6, 7, 20, 21, 23, 30, 37

  52. [52]

    A rendering image of 3D human, [v] view, [M] map

    Shangchen Zhou, Kelvin Chan, Chongyi Li, and Chen Change Loy. Towards robust blind face restora- tion with codebook lookup transformer.NeurIPS, 2022. 18, 37 Human Interaction-Aware 3D Reconstruction from a Single Image Supplementary Material List of Contents A. Details on Methods and Implementation A.1. Robust Instance Segmentation and SMPL-X Estimation A...