Recognition: unknown
AnyLift: Scaling Motion Reconstruction from Internet Videos via 2D Diffusion
Pith reviewed 2026-05-10 05:34 UTC · model grok-4.3
The pith
A two-stage 2D diffusion framework reconstructs globally consistent 3D human motion and object interactions from Internet videos by synthesizing multi-view training data from 2D keypoints.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AnyLift is a two-stage pipeline that first synthesizes domain-specific multi-view 2D motion data from 2D keypoints extracted from Internet videos, then trains a camera-conditioned multi-view 2D motion diffusion model on that data to recover 3D human motion and 3D human-object interactions in world space.
What carries the argument
Camera-conditioned multi-view 2D motion diffusion model trained on synthetic data generated from 2D keypoints extracted from Internet videos.
If this is right
- The method recovers motions such as gymnastics that are missing from standard motion-capture collections.
- It produces coherent 3D human-object interaction geometry from ordinary in-the-wild videos.
- Global consistency under dynamic camera motion improves over prior single-view or static-camera approaches.
- Large-scale 3D human-behavior datasets can be assembled directly from existing Internet video archives.
Where Pith is reading between the lines
- The same keypoint-to-synthetic-data step could be reused for other articulated objects once reliable 2D detectors exist for them.
- Combining the lifted 3D output with existing video-generation models might allow text-to-3D animation pipelines that respect physical contact.
- Keypoint errors on heavily occluded or low-resolution clips remain a practical failure mode that would require additional robustness measures.
Load-bearing premise
That 2D keypoints taken from Internet videos are accurate and complete enough to let the synthetic multi-view data train a diffusion model that still works on real, noisy footage with moving cameras.
What would settle it
Apply the method to Internet videos that also have independent ground-truth 3D motion capture of the same performance and measure whether the reconstructed joint positions and object trajectories match the ground truth within acceptable error bounds.
Figures
read the original abstract
Reconstructing 3D human motion and human-object interactions (HOI) from Internet videos is a fundamental step toward building large-scale datasets of human behavior. Existing methods struggle to recover globally consistent 3D motion under dynamic cameras, especially for motion types underrepresented in current motion-capture datasets, and face additional difficulty recovering coherent human-object interactions in 3D. We introduce a two-stage framework leveraging 2D diffusion that reconstructs 3D human motion and HOI from Internet videos. In the first stage, we synthesize multi-view 2D motion data for each domain, leveraging 2D keypoints extracted from Internet videos to incorporate human motions that rarely appear in existing MoCap datasets. In the second stage, a camera-conditioned multi-view 2D motion diffusion model is trained on the domain-specific synthetic data to recover 3D human motion and 3D HOI in the world space. We demonstrate the effectiveness of our method on Internet videos featuring challenging motions such as gymnastics, as well as in-the-wild HOI videos, and show that it outperforms prior work in producing realistic human motion and human-object interaction.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces AnyLift, a two-stage framework for reconstructing 3D human motion and human-object interactions (HOI) from Internet videos. Stage 1 synthesizes multi-view 2D motion sequences by leveraging 2D keypoints extracted from Internet videos to incorporate rare motions absent from MoCap datasets. Stage 2 trains a camera-conditioned multi-view 2D motion diffusion model on the resulting domain-specific synthetic data to recover globally consistent 3D motion and HOI in world space. The authors claim the approach outperforms prior work on challenging gymnastics motions and in-the-wild HOI videos.
Significance. If the central claims are substantiated, the work could enable scalable construction of large 3D motion datasets from abundant Internet video sources, addressing the limited diversity and coverage of traditional MoCap data for underrepresented actions such as gymnastics and complex HOI. The two-stage 2D-diffusion strategy offers a potential path to handling dynamic cameras without requiring multi-view captures at inference time.
major comments (2)
- [Abstract] Abstract: the claim that the method 'outperforms prior work' on gymnastics and in-the-wild HOI videos is presented without any quantitative metrics, ablation studies, error bars, or validation details. This absence leaves the central empirical claim without visible supporting evidence.
- [Method (data synthesis stage)] First-stage synthesis (described in the method overview): the assumption that noisy single-view 2D keypoints extracted from Internet videos can be turned into sufficiently clean and diverse multi-view 2D sequences whose statistics match real-world motion and camera distributions is load-bearing for the entire pipeline. Any systematic bias from motion blur, truncation, or depth ambiguity would be baked into the training set for the diffusion model, yet no isolating experiment (e.g., 2D reprojection error of synthesized views against held-out multi-view captures, or ablation replacing synthetic data with clean MoCap) is referenced.
minor comments (1)
- [Abstract] The abstract would be strengthened by a single sentence summarizing the quantitative metrics used to demonstrate outperformance.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive review of our manuscript. We address each major comment point by point below, providing clarifications from the full paper and outlining revisions where they strengthen the work.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that the method 'outperforms prior work' on gymnastics and in-the-wild HOI videos is presented without any quantitative metrics, ablation studies, error bars, or validation details. This absence leaves the central empirical claim without visible supporting evidence.
Authors: We acknowledge that the abstract is a high-level summary and does not include specific numerical values. The full manuscript provides quantitative comparisons with prior work (using metrics such as MPJPE and PA-MPJPE), ablations, and error analysis in Section 4 (Experiments), supported by figures and tables on both gymnastics and HOI sequences. To make the central claim more self-contained in the abstract, we will revise it to briefly reference the key quantitative improvements while retaining its concise nature. revision: partial
-
Referee: [Method (data synthesis stage)] First-stage synthesis (described in the method overview): the assumption that noisy single-view 2D keypoints extracted from Internet videos can be turned into sufficiently clean and diverse multi-view 2D sequences whose statistics match real-world motion and camera distributions is load-bearing for the entire pipeline. Any systematic bias from motion blur, truncation, or depth ambiguity would be baked into the training set for the diffusion model, yet no isolating experiment (e.g., 2D reprojection error of synthesized views against held-out multi-view captures, or ablation replacing synthetic data with clean MoCap) is referenced.
Authors: We agree that direct validation of the first-stage synthesis is important to substantiate the pipeline. The manuscript already includes qualitative visualizations of the synthesized multi-view 2D sequences and demonstrates their impact through end-to-end 3D reconstruction results. To isolate potential biases, we will add an ablation study in the revision that compares training the diffusion model on the synthetic data versus clean MoCap data, along with 2D reprojection error metrics on held-out captures where available. revision: yes
Circularity Check
No significant circularity; two-stage pipeline uses external keypoint extraction and independent synthetic data generation.
full rationale
The described method extracts 2D keypoints from internet videos (external process), synthesizes multi-view 2D sequences as training data, and trains a separate camera-conditioned diffusion model to lift to 3D. No equations or steps reduce the final 3D output to the input by construction, no fitted parameters are relabeled as predictions, and no self-citation chains or uniqueness theorems are invoked to force the architecture. The pipeline remains open to external validation via held-out multi-view data or ablations, consistent with a non-circular empirical approach.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption 2D keypoints extracted from internet videos can be used to synthesize accurate multi-view 2D motion data for rare motions
Forward citations
Cited by 1 Pith paper
-
SAM 3D Animal: Promptable Animal 3D Reconstruction from Images in the Wild
SAM 3D Animal is the first promptable framework for multi-animal 3D reconstruction from single images, built on SMAL+ and trained on the new Herd3D dataset, achieving SOTA results on Animal3D, APTv2, and Animal Kingdo...
Reference graph
Works this paper leans on
-
[1]
Behave: Dataset and method for tracking human object in- teractions
Bharat Lal Bhatnagar, Xianghui Xie, Ilya A Petrov, Cristian Sminchisescu, Christian Theobalt, and Gerard Pons-Moll. Behave: Dataset and method for tracking human object in- teractions. InConference on Computer Vision and Pattern Recognition (CVPR), 2022. 3, 6, 7, 8, S1, S2
2022
-
[2]
Keep it smpl: Automatic estimation of 3d human pose and shape from a single image
Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J Black. Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. InEuropean Conference on Computer Vision (ECCV), 2016. 6, 8
2016
-
[3]
Exploit- ing spatial-temporal relationships for 3d pose estimation via graph convolutional networks
Yujun Cai, Liuhao Ge, Jun Liu, Jianfei Cai, Tat-Jen Cham, Junsong Yuan, and Nadia Magnenat Thalmann. Exploit- ing spatial-temporal relationships for 3d pose estimation via graph convolutional networks. InInternational Conference on Computer Vision (ICCV), 2019. 2
2019
-
[4]
Objaverse: A universe of annotated 3d objects
Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. InConference on Computer Vision and Pattern Recognition (CVPR), 2023. 3
2023
-
[5]
Nemf: Neural motion fields for kinematic animation
Chengan He, Jun Saito, James Zachary, Holly Rushmeier, and Yi Zhou. Nemf: Neural motion fields for kinematic animation. InAdvances in Neural Information Processing Systems (NeurIPS), 2022. 6
2022
-
[6]
Denoising diffu- sion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. InAdvances in Neural Information Processing Systems (NeurIPS), 2020. 4
2020
-
[7]
Intercap: joint markerless 3d tracking of humans and objects in interaction from multi-view rgb-d images.Interna- tional Journal of Computer Vision (IJCV), 132(7):2551–2566,
Yinghao Huang, Omid Taheri, Michael J Black, and Dimitrios Tzionas. Intercap: joint markerless 3d tracking of humans and objects in interaction from multi-view rgb-d images.Interna- tional Journal of Computer Vision (IJCV), 132(7):2551–2566,
-
[8]
End-to-end recovery of human shape and pose
Angjoo Kanazawa, Michael J Black, David W Jacobs, and Jitendra Malik. End-to-end recovery of human shape and pose. InConference on Computer Vision and Pattern Recognition (CVPR), 2018. 2
2018
-
[9]
Mas: Multi-view ancestral sampling for 3d motion generation using 2d diffusion
Roy Kapon, Guy Tevet, Daniel Cohen-Or, and Amit H Bermano. Mas: Multi-view ancestral sampling for 3d motion generation using 2d diffusion. InConference on Computer Vision and Pattern Recognition (CVPR), 2024. 2
2024
-
[10]
Kingma and Jimmy Ba
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InInternational Conference on Learning Representations (ICLR), 2015. S1, S3
2015
-
[11]
Muhammed Kocabas, Nikos Athanasiou, and Michael J. Black. Vibe: Video inference for human body pose and shape estimation. InConference on Computer Vision and Pattern Recognition (CVPR), 2020. 2
2020
-
[12]
Pare: Part attention regressor for 3d human body estimation
Muhammed Kocabas, Chun-Hao P Huang, Otmar Hilliges, and Michael J Black. Pare: Part attention regressor for 3d human body estimation. InInternational Conference on Com- puter Vision (ICCV), 2021. 2
2021
-
[13]
Pace: Human and motion estimation from in-the-wild videos
Muhammed Kocabas, Ye Yuan, Pavlo Molchanov, Yunrong Guo, Michael J Black, Otmar Hilliges, Jan Kautz, and Umar Iqbal. Pace: Human and motion estimation from in-the-wild videos. InInternational Conference on 3D Vision (3DV),
-
[14]
Learning to reconstruct 3d human pose and shape via model-fitting in the loop
Nikos Kolotouros, Georgios Pavlakos, Michael J Black, and Kostas Daniilidis. Learning to reconstruct 3d human pose and shape via model-fitting in the loop. InInternational Conference on Computer Vision (ICCV), 2019. 2
2019
-
[15]
Collab- orative video diffusion: Consistent multi-video generation with camera control
Zhengfei Kuang, Shengqu Cai, Hao He, Yinghao Xu, Hong- sheng Li, Leonidas Guibas, and Gordon Wetzstein. Collab- orative video diffusion: Consistent multi-video generation with camera control. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. 3
2024
-
[16]
Hybrik: A hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation
Jiefeng Li, Chao Xu, Zhicun Chen, Siyuan Bian, Lixin Yang, and Cewu Lu. Hybrik: A hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation. InConference on Computer Vision and Pattern Recognition (CVPR), 2021. 2
2021
-
[17]
Object motion guided human motion synthesis.ACM Transactions on Graphics (TOG), 42(6), 2023
Jiaman Li, Jiajun Wu, and C Karen Liu. Object motion guided human motion synthesis.ACM Transactions on Graphics (TOG), 42(6), 2023. 3, 6
2023
-
[18]
Lifting motion to the 3d world via 2d diffusion
Jiaman Li, C Karen Liu, and Jiajun Wu. Lifting motion to the 3d world via 2d diffusion. InConference on Computer Vision and Pattern Recognition (CVPR), 2025. 2, 3, 4, 5, 6
2025
-
[19]
Ai choreographer: Music conditioned 3d dance generation with aist++
Ruilong Li, Shan Yang, David A Ross, and Angjoo Kanazawa. Ai choreographer: Music conditioned 3d dance generation with aist++. InInternational Conference on Computer Vision (ICCV), 2021. 4, 5, 6, S1
2021
-
[20]
Mhformer: Multi-hypothesis transformer for 3d human pose estimation
Wenhao Li, Hong Liu, Hao Tang, Pichao Wang, and Luc Van Gool. Mhformer: Multi-hypothesis transformer for 3d human pose estimation. InConference on Computer Vision and Pattern Recognition (CVPR), 2022. 2
2022
-
[21]
Megasam: Accurate, fast and robust structure and motion from casual dynamic videos
Zhengqi Li, Richard Tucker, Forrester Cole, Qianqian Wang, Linyi Jin, Vickie Ye, Angjoo Kanazawa, Aleksander Holyn- ski, and Noah Snavely. Megasam: Accurate, fast and robust structure and motion from casual dynamic videos. InConfer- ence on Computer Vision and Pattern Recognition (CVPR),
-
[22]
Zero-1-to-3: Zero- shot one image to 3d object
Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl V ondrick. Zero-1-to-3: Zero- shot one image to 3d object. InInternational Conference on Computer Vision (ICCV), 2023. 3
2023
-
[23]
Syncdreamer: Gener- ating multiview-consistent images from a single-view image
Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie Liu, Taku Komura, and Wenping Wang. Syncdreamer: Gener- ating multiview-consistent images from a single-view image. InInternational Conference on Learning Representations (ICLR), 2024
2024
-
[24]
Wonder3d: Single im- age to 3d using cross-domain diffusion
Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin, Yuan Liu, Zhiyang Dou, Lingjie Liu, Yuexin Ma, Song-Hai Zhang, Marc Habermann, Christian Theobalt, et al. Wonder3d: Single im- age to 3d using cross-domain diffusion. InConference on Computer Vision and Pattern Recognition (CVPR), 2024. 3
2024
-
[25]
Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. SMPL: A skinned multi- 9 person linear model.ACM Transactions on Graphics (TOG), 34(6):248:1–248:16, 2015. 2, 3, 5, 6
2015
-
[26]
Amass: Archive of motion capture as surface shapes
Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Ger- ard Pons-Moll, and Michael J Black. Amass: Archive of motion capture as surface shapes. InInternational Confer- ence on Computer Vision (ICCV), 2019. 1, 2, 6
2019
-
[27]
A simple yet effective baseline for 3d human pose estimation
Julieta Martinez, Rayat Hossain, Javier Romero, and James J Little. A simple yet effective baseline for 3d human pose estimation. InInternational Conference on Computer Vision (ICCV), 2017. 2
2017
-
[28]
Delta: Dense efficient long-range 3d tracking for any video
Tuan Duc Ngo, Peiye Zhuang, Chuang Gan, Evangelos Kalogerakis, Sergey Tulyakov, Hsin-Ying Lee, and Chaoyang Wang. Delta: Dense efficient long-range 3d tracking for any video. InInternational Conference on Learning Representa- tions (ICLR), 2025. 5, S2, S3
2025
-
[29]
Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. Expressive body capture: 3D hands, face, and body from a single image. InConference on Computer Vision and Pattern Recognition (CVPR), 2019. 5
2019
-
[30]
3d human pose estimation in video with tempo- ral convolutions and semi-supervised training
Dario Pavllo, Christoph Feichtenhofer, David Grangier, and Michael Auli. 3d human pose estimation in video with tempo- ral convolutions and semi-supervised training. InConference on Computer Vision and Pattern Recognition (CVPR), 2019. 2
2019
-
[31]
Dreamfusion: Text-to-3d using 2d diffusion
Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. InInternational Conference on Learning Representations (ICLR), 2023. 5
2023
-
[32]
Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks
Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, et al. Grounded sam: Assembling open-world models for diverse visual tasks.arXiv preprint arXiv:2401.14159,
-
[33]
Global-to-local modeling for video-based 3d human pose and shape estimation
Xiaolong Shen, Zongxin Yang, Xiaohan Wang, Jianxin Ma, Chang Zhou, and Yi Yang. Global-to-local modeling for video-based 3d human pose and shape estimation. InConfer- ence on Computer Vision and Pattern Recognition (CVPR),
-
[34]
World-grounded human motion recovery via gravity-view co- ordinates
Zehong Shen, Huaijin Pi, Yan Xia, Zhi Cen, Sida Peng, Zechen Hu, Hujun Bao, Ruizhen Hu, and Xiaowei Zhou. World-grounded human motion recovery via gravity-view co- ordinates. InACM SIGGRAPH Asia Conference Proceedings,
-
[35]
arXiv preprint arXiv:2308.16512 , year=
Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. Mvdream: Multi-view diffusion for 3d gener- ation.arXiv preprint arXiv:2308.16512, 2023. 3
-
[36]
Wham: Reconstructing world-grounded humans with accu- rate 3d motion
Soyong Shin, Juyong Kim, Eni Halilaj, and Michael J Black. Wham: Reconstructing world-grounded humans with accu- rate 3d motion. InConference on Computer Vision and Pat- tern Recognition (CVPR), 2024. 1, 2, 6
2024
-
[37]
Trace: 5d temporal regression of avatars with dynamic cam- eras in 3d environments
Yu Sun, Qian Bao, Wu Liu, Tao Mei, and Michael J Black. Trace: 5d temporal regression of avatars with dynamic cam- eras in 3d environments. InConference on Computer Vision and Pattern Recognition (CVPR), 2023. 2
2023
-
[38]
Human motion diffusion model
Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir, Daniel Cohen-Or, and Amit H Bermano. Human motion diffusion model. InInternational Conference on Learning Representa- tions (ICLR), 2023. 4
2023
-
[39]
Generative camera dolly: Extreme monoc- ular dynamic novel view synthesis
Basile Van Hoorick, Rundi Wu, Ege Ozguroglu, Kyle Sargent, Ruoshi Liu, Pavel Tokmakov, Achal Dave, Changxi Zheng, and Carl V ondrick. Generative camera dolly: Extreme monoc- ular dynamic novel view synthesis. InEuropean Conference on Computer Vision (ECCV), 2024. 3
2024
-
[40]
Attention is all you need
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems (NeurIPS), 2017. 4
2017
-
[41]
Encoder-decoder with multi-level atten- tion for 3d human shape and pose estimation
Ziniu Wan, Zhengjia Li, Maoqing Tian, Jianbo Liu, Shuai Yi, and Hongsheng Li. Encoder-decoder with multi-level atten- tion for 3d human shape and pose estimation. InInternational Conference on Computer Vision (ICCV), 2021. 2
2021
-
[42]
Elepose: Unsupervised 3d human pose estimation by predicting cam- era elevation and learning normalizing flows on 2d poses
Bastian Wandt, James J Little, and Helge Rhodin. Elepose: Unsupervised 3d human pose estimation by predicting cam- era elevation and learning normalizing flows on 2d poses. InConference on Computer Vision and Pattern Recognition (CVPR), 2022. 2
2022
-
[43]
Motion guided 3d pose estimation from videos
Jingbo Wang, Sijie Yan, Yuanjun Xiong, and Dahua Lin. Motion guided 3d pose estimation from videos. InEuropean Conference on Computer Vision (ECCV), 2020. 2
2020
-
[44]
Chore: Contact, human and object reconstruction from a single rgb image
Xianghui Xie, Bharat Lal Bhatnagar, and Gerard Pons-Moll. Chore: Contact, human and object reconstruction from a single rgb image. InEuropean Conference on Computer Vision (ECCV), 2022. 2
2022
-
[45]
Visibility aware human-object interaction tracking from single rgb camera
Xianghui Xie, Bharat Lal Bhatnagar, and Gerard Pons-Moll. Visibility aware human-object interaction tracking from single rgb camera. InConference on Computer Vision and Pattern Recognition (CVPR), 2023. 2, 8
2023
-
[46]
Vitpose: Simple vision transformer baselines for human pose estimation
Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao. Vitpose: Simple vision transformer baselines for human pose estimation. InAdvances in Neural Information Processing Systems (NeurIPS), 2022. 5, 6
2022
-
[47]
Decoupling human and camera motion from videos in the wild
Vickie Ye, Georgios Pavlakos, Jitendra Malik, and Angjoo Kanazawa. Decoupling human and camera motion from videos in the wild. InConference on Computer Vision and Pattern Recognition (CVPR), 2023. 2
2023
-
[48]
Glamr: Global occlusion-aware human mesh recovery with dynamic cameras
Ye Yuan, Umar Iqbal, Pavlo Molchanov, Kris Kitani, and Jan Kautz. Glamr: Global occlusion-aware human mesh recovery with dynamic cameras. InConference on Computer Vision and Pattern Recognition (CVPR), 2022. 2
2022
-
[49]
Pymaf-x: To- wards well-aligned full-body model regression from monoc- ular images.Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 45(10):12287–12303, 2023
Hongwen Zhang, Yating Tian, Yuxiang Zhang, Mengcheng Li, Liang An, Zhenan Sun, and Yebin Liu. Pymaf-x: To- wards well-aligned full-body model regression from monoc- ular images.Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 45(10):12287–12303, 2023. 2
2023
-
[50]
Mixste: Seq2seq mixed spatio-temporal encoder for 3d human pose estimation in video
Jinlu Zhang, Zhigang Tu, Jianyu Yang, Yujin Chen, and Jun- song Yuan. Mixste: Seq2seq mixed spatio-temporal encoder for 3d human pose estimation in video. InConference on Computer Vision and Pattern Recognition (CVPR), 2022. 2
2022
-
[51]
Neural- dome: A neural modeling pipeline on multi-view human- object interactions
Juze Zhang, Haimin Luo, Hongdi Yang, Xinru Xu, Qianyang Wu, Ye Shi, Jingyi Yu, Lan Xu, and Jingya Wang. Neural- dome: A neural modeling pipeline on multi-view human- object interactions. InConference on Computer Vision and Pattern Recognition (CVPR), 2023. 3, 6
2023
-
[52]
I’m hoi: Inertia- aware monocular capture of 3d human-object interactions
Chengfeng Zhao, Juze Zhang, Jiashen Du, Ziwei Shan, Junye Wang, Jingyi Yu, Jingya Wang, and Lan Xu. I’m hoi: Inertia- aware monocular capture of 3d human-object interactions. 10 InConference on Computer Vision and Pattern Recognition (CVPR), 2024. 3, 6
2024
-
[53]
On the continuity of rotation representations in neural networks
Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, and Hao Li. On the continuity of rotation representations in neural networks. InConference on Computer Vision and Pattern Recognition (CVPR), 2019. 5, S1, S2 11 A. Overview In this supplementary material, we provide additional de- tails on implementation (Sec. B) and video data processing (Sec. C). We high...
2019
-
[54]
For the first frame of each sequence, we perform 200 random restarts of R and retain the solution with the lowest Chamfer loss
(S7) In implementation, we parameterizeR using a continuous 6D rotation representation [53] and initializetheuristically S2 from the 2D mask bounding box and the object’s 3D extent, which stabilizes optimization under large rotations. For the first frame of each sequence, we perform 200 random restarts of R and retain the solution with the lowest Chamfer ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.