pith. machine review for the scientific record. sign in

arxiv: 2604.17818 · v1 · submitted 2026-04-20 · 💻 cs.CV

Recognition: unknown

AnyLift: Scaling Motion Reconstruction from Internet Videos via 2D Diffusion

C. Karen Liu, Ehsan Adeli, Heng Yu, Hongjie Li, Hong-Xing Yu, Jiajun Wu, Jiaman Li

Pith reviewed 2026-05-10 05:34 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D motion reconstructionhuman-object interactiondiffusion modelsinternet videosmulti-view synthesis2D to 3D liftinghuman pose estimationdynamic camera
0
0 comments X

The pith

A two-stage 2D diffusion framework reconstructs globally consistent 3D human motion and object interactions from Internet videos by synthesizing multi-view training data from 2D keypoints.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to scale 3D motion capture beyond small MoCap datasets by pulling diverse actions straight from web videos. It first extracts 2D keypoints from those videos and uses them to build synthetic multi-view 2D sequences that include rare motions such as gymnastics. A camera-conditioned diffusion model is then trained on this synthetic data to lift the 2D inputs into coherent 3D trajectories in world space, including human-object contacts. The approach is demonstrated on dynamic-camera footage and in-the-wild interaction clips, where it produces more realistic results than earlier lifting methods.

Core claim

AnyLift is a two-stage pipeline that first synthesizes domain-specific multi-view 2D motion data from 2D keypoints extracted from Internet videos, then trains a camera-conditioned multi-view 2D motion diffusion model on that data to recover 3D human motion and 3D human-object interactions in world space.

What carries the argument

Camera-conditioned multi-view 2D motion diffusion model trained on synthetic data generated from 2D keypoints extracted from Internet videos.

If this is right

  • The method recovers motions such as gymnastics that are missing from standard motion-capture collections.
  • It produces coherent 3D human-object interaction geometry from ordinary in-the-wild videos.
  • Global consistency under dynamic camera motion improves over prior single-view or static-camera approaches.
  • Large-scale 3D human-behavior datasets can be assembled directly from existing Internet video archives.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same keypoint-to-synthetic-data step could be reused for other articulated objects once reliable 2D detectors exist for them.
  • Combining the lifted 3D output with existing video-generation models might allow text-to-3D animation pipelines that respect physical contact.
  • Keypoint errors on heavily occluded or low-resolution clips remain a practical failure mode that would require additional robustness measures.

Load-bearing premise

That 2D keypoints taken from Internet videos are accurate and complete enough to let the synthetic multi-view data train a diffusion model that still works on real, noisy footage with moving cameras.

What would settle it

Apply the method to Internet videos that also have independent ground-truth 3D motion capture of the same performance and measure whether the reconstructed joint positions and object trajectories match the ground truth within acceptable error bounds.

Figures

Figures reproduced from arXiv: 2604.17818 by C. Karen Liu, Ehsan Adeli, Heng Yu, Hongjie Li, Hong-Xing Yu, Jiajun Wu, Jiaman Li.

Figure 1
Figure 1. Figure 1: Human and human-object interaction (HOI) motions lifted by our approach. Trained on 2D keypoints and corresponding camera trajectories, our framework AnyLift reconstructs world-coordinated 3D human motion and HOI from monocular videos captured by dynamic cameras. We demonstrate its effectiveness on human motion reconstruction from Internet gymnastics videos (left) and on HOI reconstruction from captured re… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of AnyLift. (a) We first train a single-view 2D motion diffusion model conditioned on camera trajectories and epipolar lines to synthesize multi-view 2D training data. (b) During training, we employ a hybrid data source strategy that enhances viewpoint coverage by combining global 2D pose sequences from videos with locally reprojected poses. (c) Finally, we train a multi-view 2D motion diffusion m… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison of human motion reconstruction on our collected Internet videos. AnyLift produces more plausible motions, mitigating the root trajectory errors, inaccurate local body pose, and self-penetration artifacts observed in baselines. Input Video and 2D Pose SMPLify VisTracker AnyLift (Ours) [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison of HOI reconstruction on the BEHAVE [1] dataset. We show results on two object categories, chair and table. AnyLift produces coherent and physically plausible human-object interactions with accurate contact and minimal penetration. lighting its robustness to real-world camera motion. The quantitative results on our collected Internet videos are pre￾sented in Tab. 2. AnyLift achieves … view at source ↗
read the original abstract

Reconstructing 3D human motion and human-object interactions (HOI) from Internet videos is a fundamental step toward building large-scale datasets of human behavior. Existing methods struggle to recover globally consistent 3D motion under dynamic cameras, especially for motion types underrepresented in current motion-capture datasets, and face additional difficulty recovering coherent human-object interactions in 3D. We introduce a two-stage framework leveraging 2D diffusion that reconstructs 3D human motion and HOI from Internet videos. In the first stage, we synthesize multi-view 2D motion data for each domain, leveraging 2D keypoints extracted from Internet videos to incorporate human motions that rarely appear in existing MoCap datasets. In the second stage, a camera-conditioned multi-view 2D motion diffusion model is trained on the domain-specific synthetic data to recover 3D human motion and 3D HOI in the world space. We demonstrate the effectiveness of our method on Internet videos featuring challenging motions such as gymnastics, as well as in-the-wild HOI videos, and show that it outperforms prior work in producing realistic human motion and human-object interaction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces AnyLift, a two-stage framework for reconstructing 3D human motion and human-object interactions (HOI) from Internet videos. Stage 1 synthesizes multi-view 2D motion sequences by leveraging 2D keypoints extracted from Internet videos to incorporate rare motions absent from MoCap datasets. Stage 2 trains a camera-conditioned multi-view 2D motion diffusion model on the resulting domain-specific synthetic data to recover globally consistent 3D motion and HOI in world space. The authors claim the approach outperforms prior work on challenging gymnastics motions and in-the-wild HOI videos.

Significance. If the central claims are substantiated, the work could enable scalable construction of large 3D motion datasets from abundant Internet video sources, addressing the limited diversity and coverage of traditional MoCap data for underrepresented actions such as gymnastics and complex HOI. The two-stage 2D-diffusion strategy offers a potential path to handling dynamic cameras without requiring multi-view captures at inference time.

major comments (2)
  1. [Abstract] Abstract: the claim that the method 'outperforms prior work' on gymnastics and in-the-wild HOI videos is presented without any quantitative metrics, ablation studies, error bars, or validation details. This absence leaves the central empirical claim without visible supporting evidence.
  2. [Method (data synthesis stage)] First-stage synthesis (described in the method overview): the assumption that noisy single-view 2D keypoints extracted from Internet videos can be turned into sufficiently clean and diverse multi-view 2D sequences whose statistics match real-world motion and camera distributions is load-bearing for the entire pipeline. Any systematic bias from motion blur, truncation, or depth ambiguity would be baked into the training set for the diffusion model, yet no isolating experiment (e.g., 2D reprojection error of synthesized views against held-out multi-view captures, or ablation replacing synthetic data with clean MoCap) is referenced.
minor comments (1)
  1. [Abstract] The abstract would be strengthened by a single sentence summarizing the quantitative metrics used to demonstrate outperformance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review of our manuscript. We address each major comment point by point below, providing clarifications from the full paper and outlining revisions where they strengthen the work.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that the method 'outperforms prior work' on gymnastics and in-the-wild HOI videos is presented without any quantitative metrics, ablation studies, error bars, or validation details. This absence leaves the central empirical claim without visible supporting evidence.

    Authors: We acknowledge that the abstract is a high-level summary and does not include specific numerical values. The full manuscript provides quantitative comparisons with prior work (using metrics such as MPJPE and PA-MPJPE), ablations, and error analysis in Section 4 (Experiments), supported by figures and tables on both gymnastics and HOI sequences. To make the central claim more self-contained in the abstract, we will revise it to briefly reference the key quantitative improvements while retaining its concise nature. revision: partial

  2. Referee: [Method (data synthesis stage)] First-stage synthesis (described in the method overview): the assumption that noisy single-view 2D keypoints extracted from Internet videos can be turned into sufficiently clean and diverse multi-view 2D sequences whose statistics match real-world motion and camera distributions is load-bearing for the entire pipeline. Any systematic bias from motion blur, truncation, or depth ambiguity would be baked into the training set for the diffusion model, yet no isolating experiment (e.g., 2D reprojection error of synthesized views against held-out multi-view captures, or ablation replacing synthetic data with clean MoCap) is referenced.

    Authors: We agree that direct validation of the first-stage synthesis is important to substantiate the pipeline. The manuscript already includes qualitative visualizations of the synthesized multi-view 2D sequences and demonstrates their impact through end-to-end 3D reconstruction results. To isolate potential biases, we will add an ablation study in the revision that compares training the diffusion model on the synthetic data versus clean MoCap data, along with 2D reprojection error metrics on held-out captures where available. revision: yes

Circularity Check

0 steps flagged

No significant circularity; two-stage pipeline uses external keypoint extraction and independent synthetic data generation.

full rationale

The described method extracts 2D keypoints from internet videos (external process), synthesizes multi-view 2D sequences as training data, and trains a separate camera-conditioned diffusion model to lift to 3D. No equations or steps reduce the final 3D output to the input by construction, no fitted parameters are relabeled as predictions, and no self-citation chains or uniqueness theorems are invoked to force the architecture. The pipeline remains open to external validation via held-out multi-view data or ablations, consistent with a non-circular empirical approach.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that 2D keypoints from arbitrary internet videos provide enough signal to synthesize training data that captures underrepresented 3D motions and interactions.

axioms (1)
  • domain assumption 2D keypoints extracted from internet videos can be used to synthesize accurate multi-view 2D motion data for rare motions
    Invoked in the first stage to incorporate human motions that rarely appear in existing MoCap datasets.

pith-pipeline@v0.9.0 · 5518 in / 1219 out tokens · 39651 ms · 2026-05-10T05:34:52.988404+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SAM 3D Animal: Promptable Animal 3D Reconstruction from Images in the Wild

    cs.CV 2026-05 unverdicted novelty 7.0

    SAM 3D Animal is the first promptable framework for multi-animal 3D reconstruction from single images, built on SMAL+ and trained on the new Herd3D dataset, achieving SOTA results on Animal3D, APTv2, and Animal Kingdo...

Reference graph

Works this paper leans on

54 extracted references · 2 canonical work pages · cited by 1 Pith paper

  1. [1]

    Behave: Dataset and method for tracking human object in- teractions

    Bharat Lal Bhatnagar, Xianghui Xie, Ilya A Petrov, Cristian Sminchisescu, Christian Theobalt, and Gerard Pons-Moll. Behave: Dataset and method for tracking human object in- teractions. InConference on Computer Vision and Pattern Recognition (CVPR), 2022. 3, 6, 7, 8, S1, S2

  2. [2]

    Keep it smpl: Automatic estimation of 3d human pose and shape from a single image

    Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J Black. Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. InEuropean Conference on Computer Vision (ECCV), 2016. 6, 8

  3. [3]

    Exploit- ing spatial-temporal relationships for 3d pose estimation via graph convolutional networks

    Yujun Cai, Liuhao Ge, Jun Liu, Jianfei Cai, Tat-Jen Cham, Junsong Yuan, and Nadia Magnenat Thalmann. Exploit- ing spatial-temporal relationships for 3d pose estimation via graph convolutional networks. InInternational Conference on Computer Vision (ICCV), 2019. 2

  4. [4]

    Objaverse: A universe of annotated 3d objects

    Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. InConference on Computer Vision and Pattern Recognition (CVPR), 2023. 3

  5. [5]

    Nemf: Neural motion fields for kinematic animation

    Chengan He, Jun Saito, James Zachary, Holly Rushmeier, and Yi Zhou. Nemf: Neural motion fields for kinematic animation. InAdvances in Neural Information Processing Systems (NeurIPS), 2022. 6

  6. [6]

    Denoising diffu- sion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. InAdvances in Neural Information Processing Systems (NeurIPS), 2020. 4

  7. [7]

    Intercap: joint markerless 3d tracking of humans and objects in interaction from multi-view rgb-d images.Interna- tional Journal of Computer Vision (IJCV), 132(7):2551–2566,

    Yinghao Huang, Omid Taheri, Michael J Black, and Dimitrios Tzionas. Intercap: joint markerless 3d tracking of humans and objects in interaction from multi-view rgb-d images.Interna- tional Journal of Computer Vision (IJCV), 132(7):2551–2566,

  8. [8]

    End-to-end recovery of human shape and pose

    Angjoo Kanazawa, Michael J Black, David W Jacobs, and Jitendra Malik. End-to-end recovery of human shape and pose. InConference on Computer Vision and Pattern Recognition (CVPR), 2018. 2

  9. [9]

    Mas: Multi-view ancestral sampling for 3d motion generation using 2d diffusion

    Roy Kapon, Guy Tevet, Daniel Cohen-Or, and Amit H Bermano. Mas: Multi-view ancestral sampling for 3d motion generation using 2d diffusion. InConference on Computer Vision and Pattern Recognition (CVPR), 2024. 2

  10. [10]

    Kingma and Jimmy Ba

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InInternational Conference on Learning Representations (ICLR), 2015. S1, S3

  11. [11]

    Muhammed Kocabas, Nikos Athanasiou, and Michael J. Black. Vibe: Video inference for human body pose and shape estimation. InConference on Computer Vision and Pattern Recognition (CVPR), 2020. 2

  12. [12]

    Pare: Part attention regressor for 3d human body estimation

    Muhammed Kocabas, Chun-Hao P Huang, Otmar Hilliges, and Michael J Black. Pare: Part attention regressor for 3d human body estimation. InInternational Conference on Com- puter Vision (ICCV), 2021. 2

  13. [13]

    Pace: Human and motion estimation from in-the-wild videos

    Muhammed Kocabas, Ye Yuan, Pavlo Molchanov, Yunrong Guo, Michael J Black, Otmar Hilliges, Jan Kautz, and Umar Iqbal. Pace: Human and motion estimation from in-the-wild videos. InInternational Conference on 3D Vision (3DV),

  14. [14]

    Learning to reconstruct 3d human pose and shape via model-fitting in the loop

    Nikos Kolotouros, Georgios Pavlakos, Michael J Black, and Kostas Daniilidis. Learning to reconstruct 3d human pose and shape via model-fitting in the loop. InInternational Conference on Computer Vision (ICCV), 2019. 2

  15. [15]

    Collab- orative video diffusion: Consistent multi-video generation with camera control

    Zhengfei Kuang, Shengqu Cai, Hao He, Yinghao Xu, Hong- sheng Li, Leonidas Guibas, and Gordon Wetzstein. Collab- orative video diffusion: Consistent multi-video generation with camera control. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. 3

  16. [16]

    Hybrik: A hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation

    Jiefeng Li, Chao Xu, Zhicun Chen, Siyuan Bian, Lixin Yang, and Cewu Lu. Hybrik: A hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation. InConference on Computer Vision and Pattern Recognition (CVPR), 2021. 2

  17. [17]

    Object motion guided human motion synthesis.ACM Transactions on Graphics (TOG), 42(6), 2023

    Jiaman Li, Jiajun Wu, and C Karen Liu. Object motion guided human motion synthesis.ACM Transactions on Graphics (TOG), 42(6), 2023. 3, 6

  18. [18]

    Lifting motion to the 3d world via 2d diffusion

    Jiaman Li, C Karen Liu, and Jiajun Wu. Lifting motion to the 3d world via 2d diffusion. InConference on Computer Vision and Pattern Recognition (CVPR), 2025. 2, 3, 4, 5, 6

  19. [19]

    Ai choreographer: Music conditioned 3d dance generation with aist++

    Ruilong Li, Shan Yang, David A Ross, and Angjoo Kanazawa. Ai choreographer: Music conditioned 3d dance generation with aist++. InInternational Conference on Computer Vision (ICCV), 2021. 4, 5, 6, S1

  20. [20]

    Mhformer: Multi-hypothesis transformer for 3d human pose estimation

    Wenhao Li, Hong Liu, Hao Tang, Pichao Wang, and Luc Van Gool. Mhformer: Multi-hypothesis transformer for 3d human pose estimation. InConference on Computer Vision and Pattern Recognition (CVPR), 2022. 2

  21. [21]

    Megasam: Accurate, fast and robust structure and motion from casual dynamic videos

    Zhengqi Li, Richard Tucker, Forrester Cole, Qianqian Wang, Linyi Jin, Vickie Ye, Angjoo Kanazawa, Aleksander Holyn- ski, and Noah Snavely. Megasam: Accurate, fast and robust structure and motion from casual dynamic videos. InConfer- ence on Computer Vision and Pattern Recognition (CVPR),

  22. [22]

    Zero-1-to-3: Zero- shot one image to 3d object

    Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl V ondrick. Zero-1-to-3: Zero- shot one image to 3d object. InInternational Conference on Computer Vision (ICCV), 2023. 3

  23. [23]

    Syncdreamer: Gener- ating multiview-consistent images from a single-view image

    Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie Liu, Taku Komura, and Wenping Wang. Syncdreamer: Gener- ating multiview-consistent images from a single-view image. InInternational Conference on Learning Representations (ICLR), 2024

  24. [24]

    Wonder3d: Single im- age to 3d using cross-domain diffusion

    Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin, Yuan Liu, Zhiyang Dou, Lingjie Liu, Yuexin Ma, Song-Hai Zhang, Marc Habermann, Christian Theobalt, et al. Wonder3d: Single im- age to 3d using cross-domain diffusion. InConference on Computer Vision and Pattern Recognition (CVPR), 2024. 3

  25. [25]

    Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. SMPL: A skinned multi- 9 person linear model.ACM Transactions on Graphics (TOG), 34(6):248:1–248:16, 2015. 2, 3, 5, 6

  26. [26]

    Amass: Archive of motion capture as surface shapes

    Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Ger- ard Pons-Moll, and Michael J Black. Amass: Archive of motion capture as surface shapes. InInternational Confer- ence on Computer Vision (ICCV), 2019. 1, 2, 6

  27. [27]

    A simple yet effective baseline for 3d human pose estimation

    Julieta Martinez, Rayat Hossain, Javier Romero, and James J Little. A simple yet effective baseline for 3d human pose estimation. InInternational Conference on Computer Vision (ICCV), 2017. 2

  28. [28]

    Delta: Dense efficient long-range 3d tracking for any video

    Tuan Duc Ngo, Peiye Zhuang, Chuang Gan, Evangelos Kalogerakis, Sergey Tulyakov, Hsin-Ying Lee, and Chaoyang Wang. Delta: Dense efficient long-range 3d tracking for any video. InInternational Conference on Learning Representa- tions (ICLR), 2025. 5, S2, S3

  29. [29]

    Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. Expressive body capture: 3D hands, face, and body from a single image. InConference on Computer Vision and Pattern Recognition (CVPR), 2019. 5

  30. [30]

    3d human pose estimation in video with tempo- ral convolutions and semi-supervised training

    Dario Pavllo, Christoph Feichtenhofer, David Grangier, and Michael Auli. 3d human pose estimation in video with tempo- ral convolutions and semi-supervised training. InConference on Computer Vision and Pattern Recognition (CVPR), 2019. 2

  31. [31]

    Dreamfusion: Text-to-3d using 2d diffusion

    Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. InInternational Conference on Learning Representations (ICLR), 2023. 5

  32. [32]

    Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

    Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, et al. Grounded sam: Assembling open-world models for diverse visual tasks.arXiv preprint arXiv:2401.14159,

  33. [33]

    Global-to-local modeling for video-based 3d human pose and shape estimation

    Xiaolong Shen, Zongxin Yang, Xiaohan Wang, Jianxin Ma, Chang Zhou, and Yi Yang. Global-to-local modeling for video-based 3d human pose and shape estimation. InConfer- ence on Computer Vision and Pattern Recognition (CVPR),

  34. [34]

    World-grounded human motion recovery via gravity-view co- ordinates

    Zehong Shen, Huaijin Pi, Yan Xia, Zhi Cen, Sida Peng, Zechen Hu, Hujun Bao, Ruizhen Hu, and Xiaowei Zhou. World-grounded human motion recovery via gravity-view co- ordinates. InACM SIGGRAPH Asia Conference Proceedings,

  35. [35]

    arXiv preprint arXiv:2308.16512 , year=

    Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. Mvdream: Multi-view diffusion for 3d gener- ation.arXiv preprint arXiv:2308.16512, 2023. 3

  36. [36]

    Wham: Reconstructing world-grounded humans with accu- rate 3d motion

    Soyong Shin, Juyong Kim, Eni Halilaj, and Michael J Black. Wham: Reconstructing world-grounded humans with accu- rate 3d motion. InConference on Computer Vision and Pat- tern Recognition (CVPR), 2024. 1, 2, 6

  37. [37]

    Trace: 5d temporal regression of avatars with dynamic cam- eras in 3d environments

    Yu Sun, Qian Bao, Wu Liu, Tao Mei, and Michael J Black. Trace: 5d temporal regression of avatars with dynamic cam- eras in 3d environments. InConference on Computer Vision and Pattern Recognition (CVPR), 2023. 2

  38. [38]

    Human motion diffusion model

    Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir, Daniel Cohen-Or, and Amit H Bermano. Human motion diffusion model. InInternational Conference on Learning Representa- tions (ICLR), 2023. 4

  39. [39]

    Generative camera dolly: Extreme monoc- ular dynamic novel view synthesis

    Basile Van Hoorick, Rundi Wu, Ege Ozguroglu, Kyle Sargent, Ruoshi Liu, Pavel Tokmakov, Achal Dave, Changxi Zheng, and Carl V ondrick. Generative camera dolly: Extreme monoc- ular dynamic novel view synthesis. InEuropean Conference on Computer Vision (ECCV), 2024. 3

  40. [40]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems (NeurIPS), 2017. 4

  41. [41]

    Encoder-decoder with multi-level atten- tion for 3d human shape and pose estimation

    Ziniu Wan, Zhengjia Li, Maoqing Tian, Jianbo Liu, Shuai Yi, and Hongsheng Li. Encoder-decoder with multi-level atten- tion for 3d human shape and pose estimation. InInternational Conference on Computer Vision (ICCV), 2021. 2

  42. [42]

    Elepose: Unsupervised 3d human pose estimation by predicting cam- era elevation and learning normalizing flows on 2d poses

    Bastian Wandt, James J Little, and Helge Rhodin. Elepose: Unsupervised 3d human pose estimation by predicting cam- era elevation and learning normalizing flows on 2d poses. InConference on Computer Vision and Pattern Recognition (CVPR), 2022. 2

  43. [43]

    Motion guided 3d pose estimation from videos

    Jingbo Wang, Sijie Yan, Yuanjun Xiong, and Dahua Lin. Motion guided 3d pose estimation from videos. InEuropean Conference on Computer Vision (ECCV), 2020. 2

  44. [44]

    Chore: Contact, human and object reconstruction from a single rgb image

    Xianghui Xie, Bharat Lal Bhatnagar, and Gerard Pons-Moll. Chore: Contact, human and object reconstruction from a single rgb image. InEuropean Conference on Computer Vision (ECCV), 2022. 2

  45. [45]

    Visibility aware human-object interaction tracking from single rgb camera

    Xianghui Xie, Bharat Lal Bhatnagar, and Gerard Pons-Moll. Visibility aware human-object interaction tracking from single rgb camera. InConference on Computer Vision and Pattern Recognition (CVPR), 2023. 2, 8

  46. [46]

    Vitpose: Simple vision transformer baselines for human pose estimation

    Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao. Vitpose: Simple vision transformer baselines for human pose estimation. InAdvances in Neural Information Processing Systems (NeurIPS), 2022. 5, 6

  47. [47]

    Decoupling human and camera motion from videos in the wild

    Vickie Ye, Georgios Pavlakos, Jitendra Malik, and Angjoo Kanazawa. Decoupling human and camera motion from videos in the wild. InConference on Computer Vision and Pattern Recognition (CVPR), 2023. 2

  48. [48]

    Glamr: Global occlusion-aware human mesh recovery with dynamic cameras

    Ye Yuan, Umar Iqbal, Pavlo Molchanov, Kris Kitani, and Jan Kautz. Glamr: Global occlusion-aware human mesh recovery with dynamic cameras. InConference on Computer Vision and Pattern Recognition (CVPR), 2022. 2

  49. [49]

    Pymaf-x: To- wards well-aligned full-body model regression from monoc- ular images.Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 45(10):12287–12303, 2023

    Hongwen Zhang, Yating Tian, Yuxiang Zhang, Mengcheng Li, Liang An, Zhenan Sun, and Yebin Liu. Pymaf-x: To- wards well-aligned full-body model regression from monoc- ular images.Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 45(10):12287–12303, 2023. 2

  50. [50]

    Mixste: Seq2seq mixed spatio-temporal encoder for 3d human pose estimation in video

    Jinlu Zhang, Zhigang Tu, Jianyu Yang, Yujin Chen, and Jun- song Yuan. Mixste: Seq2seq mixed spatio-temporal encoder for 3d human pose estimation in video. InConference on Computer Vision and Pattern Recognition (CVPR), 2022. 2

  51. [51]

    Neural- dome: A neural modeling pipeline on multi-view human- object interactions

    Juze Zhang, Haimin Luo, Hongdi Yang, Xinru Xu, Qianyang Wu, Ye Shi, Jingyi Yu, Lan Xu, and Jingya Wang. Neural- dome: A neural modeling pipeline on multi-view human- object interactions. InConference on Computer Vision and Pattern Recognition (CVPR), 2023. 3, 6

  52. [52]

    I’m hoi: Inertia- aware monocular capture of 3d human-object interactions

    Chengfeng Zhao, Juze Zhang, Jiashen Du, Ziwei Shan, Junye Wang, Jingyi Yu, Jingya Wang, and Lan Xu. I’m hoi: Inertia- aware monocular capture of 3d human-object interactions. 10 InConference on Computer Vision and Pattern Recognition (CVPR), 2024. 3, 6

  53. [53]

    On the continuity of rotation representations in neural networks

    Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, and Hao Li. On the continuity of rotation representations in neural networks. InConference on Computer Vision and Pattern Recognition (CVPR), 2019. 5, S1, S2 11 A. Overview In this supplementary material, we provide additional de- tails on implementation (Sec. B) and video data processing (Sec. C). We high...

  54. [54]

    For the first frame of each sequence, we perform 200 random restarts of R and retain the solution with the lowest Chamfer loss

    (S7) In implementation, we parameterizeR using a continuous 6D rotation representation [53] and initializetheuristically S2 from the 2D mask bounding box and the object’s 3D extent, which stabilizes optimization under large rotations. For the first frame of each sequence, we perform 200 random restarts of R and retain the solution with the lowest Chamfer ...