pith. sign in

arxiv: 2412.13111 · v2 · pith:UBPKBCUAnew · submitted 2024-12-17 · 💻 cs.CV · cs.GR

Motion-2-To-3: Leveraging 2D Motion Data for 3D Motion Generations

Pith reviewed 2026-05-23 06:34 UTC · model grok-4.3

classification 💻 cs.CV cs.GR
keywords text-driven motion synthesis2D to 3D motion transferhuman motion generationmotion disentanglementvideo-based training3D animation from text
0
0 comments X

The pith

Disentangling local joint motion from global movements lets 2D video data train more diverse text-driven 3D human motion generators.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates a way to use the large supply of 2D human videos as training data for generating 3D motions from text descriptions. Current approaches need expensive 3D motion-capture recordings that limit variety, while videos cover many more activities and styles. The method first trains a generator on text paired with 2D joint motions extracted from video, then adapts it with a small amount of 3D data so the output stays consistent across viewpoints. If successful, this reduces the cost barrier and expands the range of realistic motions available for animation and games.

Core claim

A single-view 2D local motion generator trained on large text-2D motion pairs can be fine-tuned with 3D data to become a multi-view generator that outputs view-consistent local joint motion together with root dynamics, thereby using 2D data to support a wider range of realistic 3D human motion generation from text.

What carries the argument

Disentanglement of local joint motion from global movements, which isolates transferable local priors that can be learned from 2D video before 3D fine-tuning.

Load-bearing premise

That separating local joint motion from global movements allows local priors learned from 2D data to transfer usefully when the model is later fine-tuned on 3D data.

What would settle it

A side-by-side test on the same text prompts showing that the 2D-pretrained model produces lower motion diversity scores or poorer multi-view consistency than an otherwise identical model trained only on 3D data.

Figures

Figures reproduced from arXiv: 2412.13111 by Huaijin Pi, Qing Shuai, Ruizhen Hu, Ruoxi Guo, Sida Peng, Taku Komura, Xiaowei Zhou, Yajiao Dong, Zechen Hu, Zehong Shen, Zhumei Wang.

Figure 1
Figure 1. Figure 1: Illustration of our key idea. (a) Our approach leverages 2D motion data to improve 3D motion generation by unifying 2D and 3D motion data. (b) Our framework yields better FID and generates a broader range of motion types. Abstract Text-driven human motion synthesis is capturing signifi￾cant attention for its ability to effortlessly generate intricate movements from abstract text cues, showcasing its poten￾… view at source ↗
Figure 2
Figure 2. Figure 2: Challenge of 2D motion from the real world. In the real-world videos [30], both the camera and humans move in 3D space, resulting in 2D motion that combines both movements. velocity head to predict the 2D root movement specific to each view. Using 3D data, we could create synthetic multi￾view 2D motion sequences by projecting 3D local motion and root velocities into different views. These multi-view sequen… view at source ↗
Figure 3
Figure 3. Figure 3: Our Pipeline. We design a Multi-view Diffusion model (a) to generate multi-view results (for simplicity, camera embedding is omitted in the figure). During inference, the Multi-view Diffusion model predicts 2D local motion and root velocity (b). Then, we use triangulation [23] to recover 3D local joint positions (c) and accumulate root velocity to obtain 3D global trajectory (d), resulting in the final 3D … view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison. The first two lines are motion out of the HumanML3D [20] dataset. Baseline methods produce incorrect types of motion, while ours are more consistent with the text descriptions. The last row demonstrates that our approach successfully generates standing motion in alignment with the text descriptions, whereas baseline methods fail to produce this correctly. The unnatural poses are hig… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative results of different strategies using 2D data. Baseline methods [27, 90] fail to generate a motion with a global movement. The variant using 2D condition as input [45] may generate incorrect root movements, leading to the floating motion. The unnatural poses are highlighted in the red boxes. 7 [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative results of ablation study. Our full model generates more natural motion than the ablations. The unnatural poses are highlighted in the red boxes. The semantics misalignment is highlighted in the dashed boxes. Methods R-Precision ↑ FID↓ MM Dist↓ Diversity→ Real 0.797 0.002 2.974 9.503 w/o pretrain 0.545 4.950 4.717 7.360 w/ CB 0.679 0.642 3.723 9.522 View=3 0.689 0.654 3.607 8.946 View=5 0.691 0… view at source ↗
read the original abstract

Text-driven human motion synthesis has showcased its potential for revolutionizing motion design in the movie and game industry. Existing methods often rely on 3D motion capture data, which requires special setups, resulting in high costs for data acquisition, ultimately limiting the diversity and scope of human motion. In contrast, 2D human videos offer a vast and accessible source of motion data, covering a wider range of styles and activities. In this paper, we explore the use of 2D human motion extracted from videos as an alternative data source to improve text-driven 3D motion generation. Our approach introduces a novel framework that disentangles local joint motion from global movements, enabling efficient learning of local motion priors from 2D data. We first train a single-view 2D local motion generator on a large dataset of text-2D motion pairs. Then we fine-tune the generator with 3D data, transforming it into a multi-view generator that predicts view-consistent local joint motion and root dynamics. Evaluations on the well-acknowledged dataset and novel text prompts demonstrate that our method can efficiently utilize 2D data, supporting a wider range of realistic 3D human motion generation. Our code is publicly available at https://zju3dv.github.io/Motion-2-to-3/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Motion-2-To-3, a framework for text-driven 3D human motion generation that uses abundant 2D video data as an alternative to scarce 3D mocap. It disentangles local joint motion from global movements, pre-trains a single-view 2D local motion generator on large text-2D pairs, then fine-tunes on 3D data to produce a multi-view generator outputting view-consistent local motions and root dynamics. Evaluations on standard datasets and novel prompts are claimed to show efficient 2D data utilization and a wider range of realistic 3D motions; code is released publicly.

Significance. If the transfer of 2D-derived local priors through fine-tuning holds and produces measurable gains in diversity and realism without sacrificing multi-view consistency, the work would meaningfully expand accessible training data for 3D motion synthesis, reducing reliance on expensive mocap setups. The public code release supports reproducibility and is a clear strength.

major comments (2)
  1. [Approach] Approach section (disentanglement step): the central claim that disentangling local joint motion from global movements isolates view-invariant local dynamics from 2D videos (despite projection ambiguities, depth loss, and camera motion) is load-bearing for the efficiency argument, yet no concrete mechanism, loss terms, or validation is described to show that the priors survive fine-tuning without being overwritten or introducing inconsistencies.
  2. [Experiments] Experiments/Evaluation: the abstract states that evaluations 'demonstrate that our method can efficiently utilize 2D data,' but no quantitative ablations, baselines (e.g., 3D-only training), error bars, or metrics isolating the 2D pre-training contribution are visible; without these, the claim that 2D data supports a 'wider range' cannot be assessed and is not yet load-bearing evidence.
minor comments (1)
  1. [Abstract] The abstract refers to 'the well-acknowledged dataset' without naming it (e.g., HumanML3D or AMASS); this should be explicit in the first paragraph of the evaluation section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and commit to revisions that will strengthen the manuscript.

read point-by-point responses
  1. Referee: [Approach] Approach section (disentanglement step): the central claim that disentangling local joint motion from global movements isolates view-invariant local dynamics from 2D videos (despite projection ambiguities, depth loss, and camera motion) is load-bearing for the efficiency argument, yet no concrete mechanism, loss terms, or validation is described to show that the priors survive fine-tuning without being overwritten or introducing inconsistencies.

    Authors: We acknowledge that the current description of the disentanglement step is high-level and would benefit from greater specificity. In the revised manuscript we will expand the Approach section to explicitly describe the mechanism: local joint motion is obtained by removing global root translation and orientation from the pose sequence, yielding a representation that is independent of camera viewpoint. We will also detail the loss terms used in the 2D pre-training stage (reconstruction, adversarial, and text-alignment losses) and the fine-tuning stage, together with new validation experiments (e.g., consistency checks across held-out views and ablation of the disentanglement) that demonstrate the local priors are preserved rather than overwritten. These additions will make the load-bearing claim fully transparent. revision: yes

  2. Referee: [Experiments] Experiments/Evaluation: the abstract states that evaluations 'demonstrate that our method can efficiently utilize 2D data,' but no quantitative ablations, baselines (e.g., 3D-only training), error bars, or metrics isolating the 2D pre-training contribution are visible; without these, the claim that 2D data supports a 'wider range' cannot be assessed and is not yet load-bearing evidence.

    Authors: We agree that the experimental evidence for the contribution of 2D pre-training must be strengthened with quantitative ablations. In the revision we will add: (i) a direct 3D-only training baseline, (ii) an ablation that removes the 2D pre-training stage, (iii) standard metrics (FID, diversity, multi-view consistency) reported with error bars from multiple runs, and (iv) results on the novel-prompt set that quantify the increase in motion variety. These new results will isolate the benefit of 2D data and provide load-bearing support for the efficiency claim. revision: yes

Circularity Check

0 steps flagged

No circularity: standard pretrain-then-finetune pipeline with no equations or self-referential reductions.

full rationale

The paper presents a two-stage training procedure (pretrain single-view 2D local motion generator on text-2D pairs, then fine-tune on 3D data to obtain multi-view consistency) after a disentanglement step. No equations, fitted parameters, or predictions are described that reduce by construction to the inputs. No self-citations, uniqueness theorems, or ansatzes are invoked in the provided text. The central claim rests on empirical evaluation of the resulting generator rather than any definitional or citation-based loop, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that 2D motion priors learned after local-global disentanglement transfer usefully to 3D via fine-tuning; no free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption Disentangling local joint motion from global movements enables efficient learning of local motion priors from 2D data that transfer to 3D.
    This premise is invoked to justify the two-stage training procedure described in the abstract.

pith-pipeline@v0.9.0 · 5804 in / 1185 out tokens · 24270 ms · 2026-05-23T06:34:32.495114+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CoMoVi: Co-Generation of 3D Human Motions and Realistic Videos

    cs.CV 2026-01 unverdicted novelty 7.0

    CoMoVi co-generates 3D human motions and 2D videos synchronously in a single diffusion denoising loop using 3D-to-2D projection and dual-branch diffusion with 3D-2D cross attentions.

Reference graph

Works this paper leans on

90 extracted references · 90 canonical work pages · cited by 1 Pith paper · 6 internal anchors

  1. [1]

    Easymocap - make human motion capture easier. Github,

  2. [2]

    Lan- guage2pose: Natural language grounded pose forecasting

    Chaitanya Ahuja and Louis-Philippe Morency. Lan- guage2pose: Natural language grounded pose forecasting. In 3DV, 2019. 2

  3. [3]

    Black, and G¨ul Varol

    Nikos Athanasiou, Mathis Petrovich, Michael J. Black, and G¨ul Varol. SINC: Spatial composition of 3D human motions for simultaneous action generation. ICCV, 2023. 3

  4. [4]

    Make-an-animation: Large-scale text- conditional 3d human motion generation

    Samaneh Azadi, Akbar Shah, Thomas Hayes, Devi Parikh, and Sonal Gupta. Make-an-animation: Large-scale text- conditional 3d human motion generation. In Proceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 15039–15048, 2023. 2, 3, 6

  5. [5]

    Belfusion: Latent diffusion for behavior-driven human mo- tion prediction

    German Barquero, Sergio Escalera, and Cristina Palmero. Belfusion: Latent diffusion for behavior-driven human mo- tion prediction. In ICCV, 2023. 3

  6. [6]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023. 8

  7. [7]

    Align your latents: High-resolution video synthesis with la- tent diffusion models

    Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dock- horn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with la- tent diffusion models. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023. 3

  8. [8]

    Keep it smpl: Automatic estimation of 3d human pose and shape from a single image

    Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J Black. Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In Computer Vision–ECCV 2016: 14th Euro- pean Conference, Amsterdam, The Netherlands, October 11- 14, 2016, Proceedings, Part V 14, pages 561–578. Springer,

  9. [9]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, Sand- hini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, S...

  10. [10]

    Gener- ating human motion in 3d scenes from text descriptions

    Zhi Cen, Huaijin Pi, Sida Peng, Zehong Shen, Minghui Yang, Shuai Zhu, Hujun Bao, and Xiaowei Zhou. Gener- ating human motion in 3d scenes from text descriptions. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 1855–1866, 2024. 3

  11. [11]

    Maskgit: Masked generative image transformer

    Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11315–11325, 2022. 3

  12. [12]

    Fan- tasia3d: Disentangling geometry and appearance for high- quality text-to-3d content creation

    Rui Chen, Yongwei Chen, Ningxin Jiao, and Kui Jia. Fan- tasia3d: Disentangling geometry and appearance for high- quality text-to-3d content creation. In Proceedings of the IEEE/CVF international conference on computer vision , pages 22246–22256, 2023. 3

  13. [13]

    Executing your Commands via Motion Diffusion in Latent Space

    Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, Jingyi Yu, and Gang Yu. Executing your Commands via Motion Diffusion in Latent Space. arXiv e-prints, art. arXiv:2212.04048, 2022. 3, 6

  14. [14]

    Mofusion: A framework for denoising-diffusion-based motion synthesis

    Rishabh Dabral, Muhammad Hamza Mughal, Vladislav Golyanik, and Christian Theobalt. Mofusion: A framework for denoising-diffusion-based motion synthesis. In CVPR,

  15. [15]

    Objaverse: A universe of annotated 3d objects

    Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142–13153, 2023. 3

  16. [16]

    CAT3D: Create Anything in 3D with Multi-View Diffusion Models

    Ruiqi Gao, Aleksander Holynski, Philipp Henzler, Arthur Brussee, Ricardo Martin-Brualla, Pratul Srinivasan, Jonathan T Barron, and Ben Poole. Cat3d: Create anything in 3d with multi-view diffusion models. arXiv preprint arXiv:2405.10314, 2024. 3

  17. [17]

    Synthesis of compositional animations from textual descriptions

    Anindita Ghosh, Noshaba Cheema, Cennet Oguz, Christian Theobalt, and Philipp Slusallek. Synthesis of compositional animations from textual descriptions. In ICCV, 2021. 2

  18. [18]

    Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives

    Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19...

  19. [19]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023. 3

  20. [20]

    Generating diverse and natural 3d human motions from text

    Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. Generating diverse and natural 3d human motions from text. In CVPR, 2022. 1, 2, 3, 5, 6, 7

  21. [21]

    Tm2t: Stochastic and tokenized modeling for the reciprocal gener- ation of 3d human motions and texts

    Chuan Guo, Xinxin Zuo, Sen Wang, and Li Cheng. Tm2t: Stochastic and tokenized modeling for the reciprocal gener- ation of 3d human motions and texts. In ECCV, 2022. 3

  22. [22]

    Momask: Generative masked model- ing of 3d human motions

    Chuan Guo, Yuxuan Mu, Muhammad Gohar Javed, Sen Wang, and Li Cheng. Momask: Generative masked model- ing of 3d human motions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 1900–1910, 2024. 3

  23. [23]

    Hartley and Peter Sturm

    Richard I. Hartley and Peter Sturm. Triangulation. Computer Vision and Image Understanding, 68(2):146 – 157, 1997. 3, 4, 5

  24. [24]

    Denoising diffu- sion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. In NeurIPS, 2020. 3, 4, 6

  25. [25]

    Avatarclip: zero-shot text- driven generation and animation of 3d avatars

    Fangzhou Hong, Mingyuan Zhang, Liang Pan, Zhongang Cai, Lei Yang, and Ziwei Liu. Avatarclip: zero-shot text- driven generation and animation of 3d avatars. ACM Trans. Graph., 41(4), 2022. 3

  26. [26]

    Motiongpt: Human motion as a foreign lan- 9 guage

    Biao Jiang, Xin Chen, Wen Liu, Jingyi Yu, Gang Yu, and Tao Chen. Motiongpt: Human motion as a foreign lan- 9 guage. Advances in Neural Information Processing Systems, 36:20067–20079, 2023. 3

  27. [27]

    Mas: Multi-view ancestral sampling for 3d motion generation using 2d diffusion

    Roy Kapon, Guy Tevet, Daniel Cohen-Or, and Amit H Bermano. Mas: Multi-view ancestral sampling for 3d motion generation using 2d diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1965–1974, 2024. 3, 4, 5, 6, 7, 8

  28. [28]

    Guided motion diffusion for controllable human motion synthesis

    Korrawe Karunratanakul, Konpat Preechakul, Supasorn Suwajanakorn, and Siyu Tang. Guided motion diffusion for controllable human motion synthesis. In ICCV, 2023. 3

  29. [29]

    Noise-free score distillation

    Oren Katzir, Or Patashnik, Daniel Cohen-Or, and Dani Lischinski. Noise-free score distillation. arXiv preprint arXiv:2310.17590, 2023. 3

  30. [30]

    Emdb: The electromagnetic database of global 3d human pose and shape in the wild

    Manuel Kaufmann, Jie Song, Chen Guo, Kaiyue Shen, Tian- jian Jiang, Chengcheng Tang, Juan Jos ´e Z ´arate, and Otmar Hilliges. Emdb: The electromagnetic database of global 3d human pose and shape in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 14632–14643, 2023. 2

  31. [31]

    Flame: Free- form language-based motion synthesis & editing

    Jihoon Kim, Jiseob Kim, and Sungjoon Choi. Flame: Free- form language-based motion synthesis & editing. In AAAI,

  32. [32]

    Kingma and Jimmy Ba

    Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. arXiv e-prints, 2014. 5

  33. [33]

    Large language models are zero-shot reasoners

    Takeshi Kojima, Shixiang (Shane) Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. In NeurIPS, 2022. 3

  34. [34]

    Priority-centric human motion genera- tion in discrete latent space

    Hanyang Kong, Kehong Gong, Dongze Lian, Michael Bi Mi, and Xinchao Wang. Priority-centric human motion genera- tion in discrete latent space. In ICCV, 2023. 3

  35. [35]

    Eschernet: A genera- tive model for scalable view synthesis

    Xin Kong, Shikun Liu, Xiaoyang Lyu, Marwan Taher, Xi- aojuan Qi, and Andrew J Davison. Eschernet: A genera- tive model for scalable view synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9503–9513, 2024. 3

  36. [36]

    Autoregressive image generation using residual quantization

    Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. Autoregressive image generation using residual quantization. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 11523–11532, 2022. 3

  37. [37]

    Mage: Masked generative encoder to unify representation learning and image synthe- sis

    Tianhong Li, Huiwen Chang, Shlok Mishra, Han Zhang, Dina Katabi, and Dilip Krishnan. Mage: Masked generative encoder to unify representation learning and image synthe- sis. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 2142–2152,

  38. [38]

    Omg: Towards open-vocabulary motion generation via mix- ture of controllers

    Han Liang, Jiacheng Bao, Ruichi Zhang, Sihan Ren, Yuecheng Xu, Sibei Yang, Xin Chen, Jingyi Yu, and Lan Xu. Omg: Towards open-vocabulary motion generation via mix- ture of controllers. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 482–493, 2024. 3, 6

  39. [39]

    Magic3d: High-resolution text-to-3d content creation

    Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 300–309, 2023. 3

  40. [40]

    Being comes from not-being: Open-vocabulary text-to-motion generation with wordless training

    Junfan Lin, Jianlong Chang, Lingbo Liu, Guanbin Li, Liang Lin, Qi Tian, and Chang-wen Chen. Being comes from not-being: Open-vocabulary text-to-motion generation with wordless training. In CVPR, 2023. 3

  41. [41]

    Motion-x: A large- scale 3d expressive whole-body human motion dataset

    Jing Lin, Ailing Zeng, Shunlin Lu, Yuanhao Cai, Ruimao Zhang, Haoqian Wang, and Lei Zhang. Motion-x: A large- scale 3d expressive whole-body human motion dataset. Ad- vances in Neural Information Processing Systems, 36, 2024. 2, 5

  42. [42]

    Align your gaussians: Text-to-4d with dynamic 3d gaussians and composed diffusion models

    Huan Ling, Seung Wook Kim, Antonio Torralba, Sanja Fi- dler, and Karsten Kreis. Align your gaussians: Text-to-4d with dynamic 3d gaussians and composed diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8576–8588, 2024. 3

  43. [43]

    Plan, posture and go: To- wards open-world text-to-motion generation

    Jinpeng Liu, Wenxun Dai, Chunyu Wang, Yiji Cheng, Yan- song Tang, and Xin Tong. Plan, posture and go: To- wards open-world text-to-motion generation. arXiv preprint arXiv:2312.14828, 2023. 3

  44. [44]

    One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimiza- tion

    Minghua Liu, Chao Xu, Haian Jin, Linghao Chen, Mukund Varma T, Zexiang Xu, and Hao Su. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimiza- tion. Advances in Neural Information Processing Systems , 36, 2024. 3

  45. [45]

    Zero-1-to- 3: Zero-shot one image to 3d object

    Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tok- makov, Sergey Zakharov, and Carl V ondrick. Zero-1-to- 3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9298–9309, 2023. 2, 3, 6, 7

  46. [46]

    Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. SMPL: A skinned multi- person linear model. ACM Trans. Graph., 2015. 5

  47. [47]

    Amass: Archive of motion capture as surface shapes

    Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Ger- ard Pons-Moll, and Michael J Black. Amass: Archive of motion capture as surface shapes. In ICCV, 2019. 2, 3, 5, 6

  48. [48]

    Srinivasan, Matthew Tancik, Jonathan T

    Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis. Commun. ACM, 2021. 3

  49. [49]

    Openai: Introducing chatgpt

    OpenAI. Openai: Introducing chatgpt. https : / / openai.com/blog/chatgpt, 2022. 3, 5

  50. [50]

    Gpt-4 technical report, 2023

    OpenAI. Gpt-4 technical report, 2023. 3

  51. [51]

    Optitrack

    OptiTrack. Optitrack. https://www.optitrack. com/. 2

  52. [52]

    Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. Expressive body capture: 3d hands, face, and body from a single image. In CVPR, 2019. 5

  53. [53]

    Black, and G ¨ul Varol

    Mathis Petrovich, Michael J. Black, and G ¨ul Varol. Action- conditioned 3D human motion synthesis with transformer V AE. InICCV, 2021. 3

  54. [54]

    Black, and G ¨ul Varol

    Mathis Petrovich, Michael J. Black, and G ¨ul Varol. TEMOS: Generating diverse human motions from textual descriptions. In ECCV, 2022. 2, 3

  55. [55]

    Hierarchical generation of human-object inter- actions with diffusion probabilistic models

    Huaijin Pi, Sida Peng, Minghui Yang, Xiaowei Zhou, and Hujun Bao. Hierarchical generation of human-object inter- actions with diffusion probabilistic models. In ICCV, 2023. 3 10

  56. [56]

    The kit motion-language dataset

    Matthias Plappert, Christian Mandery, and Tamim Asfour. The kit motion-language dataset. Big data, 2016. 2

  57. [57]

    DreamFusion: Text-to-3D using 2D Diffusion

    Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Milden- hall. DreamFusion: Text-to-3D using 2D Diffusion. arXiv e-prints, art. arXiv:2209.14988, 2022. 2, 3

  58. [58]

    Punnakkal, Arjun Chandrasekaran, Nikos Athanasiou, Alejandra Quiros-Ramirez, and Michael J

    Abhinanda R. Punnakkal, Arjun Chandrasekaran, Nikos Athanasiou, Alejandra Quiros-Ramirez, and Michael J. Black. Babel: Bodies, action and behavior with english la- bels. In CVPR, 2021. 2

  59. [59]

    Learn- ing transferable visual models from natural language super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. In ICML, 2021. 3, 4

  60. [60]

    Exploring the limits of transfer learning with a unified text-to-text transformer

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020. 3

  61. [61]

    Generat- ing diverse high-fidelity images with vq-vae-2

    Ali Razavi, Aaron van den Oord, and Oriol Vinyals. Generat- ing diverse high-fidelity images with vq-vae-2. In NeurIPS,

  62. [62]

    High-resolution image syn- thesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. In CVPR, 2022. 3, 6

  63. [63]

    Human motion diffusion as a generative prior

    Yoni Shafir, Guy Tevet, Roy Kapon, and Amit Haim Bermano. Human motion diffusion as a generative prior. In The Twelfth International Conference on Learning Rep- resentations, 2024. 3

  64. [64]

    MVDream: Multi-view Diffusion for 3D Generation

    Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. Mvdream: Multi-view diffusion for 3d gen- eration. arXiv preprint arXiv:2308.16512, 2023. 2, 3

  65. [65]

    Wham: Reconstructing world-grounded humans with accu- rate 3d motion

    Soyong Shin, Juyong Kim, Eni Halilaj, and Michael J Black. Wham: Reconstructing world-grounded humans with accu- rate 3d motion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 2070– 2080, 2024. 4, 5

  66. [66]

    Text-to-4d dy- namic scene generation

    Uriel Singer, Shelly Sheynin, Adam Polyak, Oron Ashual, Iurii Makarov, Filippos Kokkinos, Naman Goyal, Andrea Vedaldi, Devi Parikh, Justin Johnson, et al. Text-to-4d dy- namic scene generation. arXiv preprint arXiv:2301.11280 ,

  67. [67]

    DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation

    Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. Dreamgaussian: Generative gaussian splatting for effi- cient 3d content creation. arXiv preprint arXiv:2309.16653,

  68. [68]

    Make-it-3d: High-fidelity 3d creation from a single image with diffusion prior

    Junshu Tang, Tengfei Wang, Bo Zhang, Ting Zhang, Ran Yi, Lizhuang Ma, and Dong Chen. Make-it-3d: High-fidelity 3d creation from a single image with diffusion prior. In Pro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 22819–22829, 2023. 3

  69. [69]

    Motionclip: Exposing human motion generation to clip space

    Guy Tevet, Brian Gordon, Amir Hertz, Amit H Bermano, and Daniel Cohen-Or. Motionclip: Exposing human motion generation to clip space. In ECCV, 2022. 2, 3

  70. [70]

    Human motion diffu- sion model

    Guy Tevet, Sigal Raab, Brian Gordon, Yoni Shafir, Daniel Cohen-or, and Amit Haim Bermano. Human motion diffu- sion model. In ICLR, 2023. 2, 3, 4, 5, 6, 8

  71. [71]

    Llama: Open and efficient foundation lan- guage models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aure- lien Rodriguez, Armand Joulin, Edouard Grave, and Guil- laume Lample. Llama: Open and efficient foundation lan- guage models. arXiv preprint, 2023. 3

  72. [72]

    Textmesh: Gen- eration of realistic 3d meshes from text prompts

    Christina Tsalicoglou, Fabian Manhardt, Alessio Tonioni, Michael Niemeyer, and Federico Tombari. Textmesh: Gen- eration of realistic 3d meshes from text prompts. In 2024 International Conference on 3D Vision (3DV), pages 1554–

  73. [73]

    Neural discrete representation learning

    Aaron van den Oord, Oriol Vinyals, and koray kavukcuoglu. Neural discrete representation learning. In NeurIPS, 2017. 3

  74. [74]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017. 4, 5

  75. [75]

    Vicon. Vicon. https://www.vicon.com/. 2

  76. [76]

    Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation

    Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A Yeh, and Greg Shakhnarovich. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12619–12629, 2023. 3

  77. [77]

    Holistic-motion2d: Scalable whole-body human motion generation in 2d space

    Yuan Wang, Zhao Wang, Junhao Gong, Di Huang, Tong He, Wanli Ouyang, Jile Jiao, Xuetao Feng, Qi Dou, Shix- iang Tang, et al. Holistic-motion2d: Scalable whole-body human motion generation in 2d space. arXiv preprint arXiv:2406.11253, 2024. 3

  78. [78]

    Humanise: Language-conditioned hu- man motion generation in 3d scenes

    Zan Wang, Yixin Chen, Tengyu Liu, Yixin Zhu, Wei Liang, and Siyuan Huang. Humanise: Language-conditioned hu- man motion generation in 3d scenes. In NeurIPS, 2022. 3

  79. [79]

    Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distilla- tion

    Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distilla- tion. Advances in Neural Information Processing Systems , 36, 2024. 3

  80. [80]

    Omnicontrol: Control any joint at any time for human motion generation

    Yiming Xie, Varun Jampani, Lei Zhong, Deqing Sun, and Huaizu Jiang. Omnicontrol: Control any joint at any time for human motion generation. In The Twelfth International Conference on Learning Representations, 2024. 3

Showing first 80 references.