pith. sign in

arxiv: 2605.15497 · v2 · pith:25765X4Cnew · submitted 2026-05-15 · 💻 cs.CV · cs.GR

AnyAct: Towards Human Reenactment of Character Motion From Video

Pith reviewed 2026-05-20 20:04 UTC · model grok-4.3

classification 💻 cs.CV cs.GR
keywords human motion reenactmentcharacter video to humansparse 2D motion cuesmotion retargetingnon-human character animationvideo-based motion generationconditional motion synthesis
0
0 comments X

The pith

Sparse local 2D motion cues from character videos can generate plausible human reenactments without 3D source models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that monocular videos of non-human characters can be turned into initial human motion sequences by conditioning on sparse local articulated cues rather than full body reconstructions. This matters for animation workflows because it removes the usual requirements for known character topologies or structured 3D inputs, letting animators start from ordinary video of animals, cartoons, or abstract figures. The approach relies on three practical designs: supervising the model only with human motion data via 2D projections, training progressively to reduce ambiguity, and separating global and local motion signals for better control. If successful, the result is an editable human performance that keeps the timing and style of the original character motion.

Core claim

AnyAct formulates character-video-driven human reenactment as conditional human motion generation from transferable sparse local 2D articulated motion. It achieves this through human-motion-only supervision via augmented 3D-to-2D projection, progressive 3D-to-2D training to reduce conditioning ambiguity, and global-local motion decoupling for reliable local control, producing high-fidelity reenactments on a new benchmark of diverse non-human character videos.

What carries the argument

Sparse local 2D articulated motion cues extracted from the character video, used as the conditioning signal for generating human motion sequences.

If this is right

  • Initial human reenactments become possible directly from monocular character videos without requiring 3D source data or known topologies.
  • The generated motions preserve essential dynamics of the reference character while remaining editable for further animation work.
  • A benchmark of diverse non-human videos can be used to measure how well local cues survive large structural differences.
  • Global-local decoupling allows separate control over overall pose and fine-grained limb actions in the output human sequence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If local cues alone suffice, then many existing motion retargeting pipelines that rely on full skeletal correspondence may be over-specified for the initial transfer step.
  • Progressive training from coarse to fine 2D projections could be tested on other video-to-motion domains such as sign language or dance to see whether ambiguity reduction generalizes.
  • The separation of global and local signals suggests a natural next step of feeding the same local cues into physics-based simulators for contact-rich actions.

Load-bearing premise

Training solely on human motion data with projected 2D cues and staged learning will transfer to arbitrary non-human character shapes without losing essential timing or adding unnatural artifacts.

What would settle it

Generate human motions from a video of an octopus or highly asymmetric creature and check whether the output human sequence still matches the original action timings and energy profile or instead collapses into generic human walking.

Figures

Figures reproduced from arXiv: 2605.15497 by Jiewei Wang, Kanglin Liu, Leidong Fan, Lei Zhong, Liuhan Chen, Li Yuan, Qing Li, Qin Shuai.

Figure 1
Figure 1. Figure 1: Human reenactment from reference videos. Given monocular videos of non-human characters with diverse topologies, AnyAct reinterprets their characteristic motion patterns as plausible human performances rather than reproducing their source structures literally. Shown here are reenactments of (a) kangaroo-like jumping, (b) butterfly-like wing flapping, and (c) the periodic paw motion of a beckoning cat. We s… view at source ↗
Figure 2
Figure 2. Figure 2: , although the motions of a non-human character (e.g., a jump￾ing kangaroo) and its human reenactment may differ substantially in morphology and topology, their sparse local articulated move￾ments still carry essential and similar dynamic tendencies. This suggests that local sparse motion patterns provide a more stable bridge between monocular character video observations and human reenactment than source-… view at source ↗
Figure 3
Figure 3. Figure 3: Given reference videos of characters, AnyAct first extracts local sparse 2D joint trajectories as transferable motion cues from the input video using our model-ensemble-based Versatile Feature Extractor (VFE). These cues are then injected into a human motion generator (MoMask++) through the ControlNet-like 2D Local Adapter (2D-LA) to produce the initial human reenactments that follow the observed character… view at source ↗
Figure 4
Figure 4. Figure 4: Motion Condition Learning. We learn reliable motion control for our AnyAct using only human motion data. This is achieved by our proposed augmented 3D-to-2D projection for providing paired supervision, progres￾sive 3D-to-2D training to alleviate conditioning ambiguity, and global-local motion decoupling for suppressing unreliable global root motion. of local sparse 2D joint trajectories, which serves as th… view at source ↗
Figure 5
Figure 5. Figure 5: Result Gallery. (1) dancing with side-to-side swaying, following the rhythm of the ghost, (2) deer-like walking, (3) penguin-like walking, (4) monkey-like walking, (5) dinosaur-like walking, (6) seal-like walking (with side-to-side swaying), (7) mechanical-spider-like in-place jumping, (8) toy-robot-like walking (with side-to-side swaying). videos is also difficult in practice. Therefore, we adopt a progre… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison of AnyAct against VLM+HY-Motion and EchoMotion. Based on the reference videos, human should perform: (1) monster-like flying, (2) cartoon bear-like jumping, (3) penguin-like walking, and (4) rabbit-like bounding. The results demonstrate that our method achieves superior reenactment quality compared to the other two baselines, while preserving the plausibility of the motion. Adapter (… view at source ↗
Figure 7
Figure 7. Figure 7: Result of the user study. We report the preference rates of our Any￾Act in pairwise comparisons against (a) VLM+HY-Motion and (b) EchoMo￾tion. Participants evaluated the generated motions based on Reenactment Similarity, Motion Quality, and Overall Preference, respectively. Our method consistently outperforms both baselines across all criteria. et al. 2025a], to condense the detailed descriptions. The weig… view at source ↗
Figure 8
Figure 8. Figure 8: Trajectory control and intuitive editing. (1) cat-like walking with forward and half-circle trajectory control, (2) editing the height of human when perform kangaroo-like jumping, (3) editing the arm spread of human when perform penguin-like walking [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Generalization to the reenactment of human characters. Although our AnyAct does not aim to achieve the same level of absolute reconstruction as human-centric mocap-based methods, it still demonstrates the ability to perform reliable reenactment of human characters from monocular videos [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Results of DancingBox. Reference videos are obtained from the DancingBox project page, where the source captures and results are concatenated together. Our results (in the right part) confirms that AnyAct can effectively recovers the essential dynamics of physical proxies as DancingBox. to our human reenactment task. Specifically, given a reference video depicting non-human character motion, we first leve… view at source ↗
Figure 11
Figure 11. Figure 11: Examples of adapting Seedance 2.0 for human reenactment. Although the leading closed-source video generation model, Seedance 2.0, possesses generalized world knowledge, it still struggles to produce consistent human reenactment videos driven by non-human motion references. Moreover, 3D motions directly reconstructed from monocular generated videos suffer from inferior quality and artifacts. Thus, such a n… view at source ↗
Figure 12
Figure 12. Figure 12: Limitation. Left: AnyAct struggles to generate plausible reenactments for motions far outside the training distribution, such as frog-like swimming involving rapid leg kicks and a rare prone posture. Right: Actions like crab-like sideways walking with rapid multi-leg movements and pixel-level similarities with neighboring points can cause CoTracker3 tracking failures, resulting in noisy 2D features that i… view at source ↗
read the original abstract

We study the problem of directly deriving an initial human reenactment from a monocular video of a non-human character. Our goal is not to reconstruct the source character itself but to reinterpret its motion as a plausible and editable human performance for downstream animation authoring. This task is challenging because existing video-based motion capture methods are largely restricted to human-centric structural spaces, while motion retargeting methods typically require structured 3D source motions and known source topologies. Our key insight is that sparse local articulated motion cues can preserve essential dynamics across large structural differences, providing a stable bridge from character video to human reenactment. Based on this observation, we propose AnyAct, which formulates character-video-driven human reenactment as conditional human motion generation from transferable sparse local 2D articulated motion. To make this practical, we introduce three key designs: human-motion-only supervision via augmented 3D-to-2D projection, progressive 3D-to-2D training to alleviate conditioning ambiguity, and global-local motion decoupling for reliable local motion control. We further construct a benchmark primarily covering diverse non-human character videos. Experiments on the benchmark show that AnyAct produces high-fidelity initial human reenactments that preserve the essential dynamics of the characters in reference videos, and further ablation studies validate the effectiveness of its core designs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that sparse local articulated 2D motion cues extracted from monocular non-human character videos can serve as a topology-agnostic bridge to generate plausible, editable human reenactments. AnyAct achieves this via human-motion-only supervision through augmented 3D-to-2D projection, progressive training to reduce conditioning ambiguity, and global-local motion decoupling; experiments on a new benchmark of diverse character videos reportedly yield high-fidelity results that preserve essential dynamics.

Significance. If the central generalization claim holds, the work would offer a practical advance for animation authoring pipelines by removing the need for 3D source reconstructions or known topologies, potentially enabling direct video-to-human-motion transfer for arbitrary characters. The emphasis on local 2D cues and progressive training is a concrete technical contribution worth testing.

major comments (2)
  1. [Abstract / §3] Abstract and §3 (method overview): the central claim that 'sparse local articulated motion cues can preserve essential dynamics across large structural differences' rests on human-motion-only supervision via 3D-to-2D projection, yet no explicit mechanism is described for resolving ambiguous limb correspondences (e.g., mapping a quadruped leg to a human arm or leg) or for penalizing loss of 3D depth/self-occlusion information discarded by the 2D projection; this directly affects whether the approach generalizes beyond human-like topologies.
  2. [§4] §4 (experiments) and benchmark description: positive benchmark results and ablation studies are reported, but without quantitative tables, error bars, or per-character topology breakdowns (e.g., results on quadrupeds vs. bipeds with non-rigid parts), it is impossible to verify whether the progressive schedule truly prevents plausible-but-incorrect human motions for out-of-distribution characters or merely interpolates within the human distribution.
minor comments (2)
  1. [§3.1] Notation for 'transferable sparse local 2D articulated motion' is introduced without a clear formal definition or diagram showing how 2D keypoints are extracted and conditioned on the human motion generator.
  2. [§4.1] The benchmark construction paragraph should include explicit statistics on character diversity (number of topologies, video lengths, motion complexity) to allow readers to assess coverage of the claimed 'large structural differences'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the presentation of our method and results without altering the core claims.

read point-by-point responses
  1. Referee: [Abstract / §3] Abstract and §3 (method overview): the central claim that 'sparse local articulated motion cues can preserve essential dynamics across large structural differences' rests on human-motion-only supervision via 3D-to-2D projection, yet no explicit mechanism is described for resolving ambiguous limb correspondences (e.g., mapping a quadruped leg to a human arm or leg) or for penalizing loss of 3D depth/self-occlusion information discarded by the 2D projection; this directly affects whether the approach generalizes beyond human-like topologies.

    Authors: We agree that an explicit limb correspondence mechanism is not described because our design intentionally avoids it: the sparse local 2D articulated cues are topology-agnostic by construction, encoding only local joint velocities and angles that transfer across structures without requiring global alignment or predefined mappings. Human-motion-only supervision via augmented 3D-to-2D projection trains the model to produce plausible outputs conditioned on these cues, while progressive training gradually increases conditioning complexity to reduce ambiguity. Global-local decoupling further isolates local dynamics from global pose, allowing the model to focus on transferable motion patterns. We acknowledge that the manuscript could better articulate how depth and self-occlusion information loss is mitigated implicitly through learned human priors rather than explicit penalties. We will revise §3 to include a dedicated paragraph explaining these aspects and their relation to generalization. revision: yes

  2. Referee: [§4] §4 (experiments) and benchmark description: positive benchmark results and ablation studies are reported, but without quantitative tables, error bars, or per-character topology breakdowns (e.g., results on quadrupeds vs. bipeds with non-rigid parts), it is impossible to verify whether the progressive schedule truly prevents plausible-but-incorrect human motions for out-of-distribution characters or merely interpolates within the human distribution.

    Authors: We thank the referee for highlighting this presentation issue. The manuscript reports quantitative metrics and ablation studies in §4, but we recognize that the current tables lack error bars and explicit per-topology breakdowns. To address this, we will expand the experimental section with a new table providing mean and standard deviation across multiple runs, plus a breakdown of results grouped by character topology (quadrupeds, bipeds with non-rigid appendages, etc.). This will allow readers to assess whether the progressive schedule supports generalization beyond interpolation within the human motion distribution. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external human data and standard projections

full rationale

The paper's central claim rests on the observation that sparse local articulated motion cues preserve dynamics across structural differences, then implements AnyAct via human-motion-only supervision with augmented 3D-to-2D projection and progressive training. These components draw from external human motion datasets and conventional projection methods rather than reducing outputs to the model's own fitted parameters or self-citations by construction. No equations equate predictions to inputs tautologically, and the approach is benchmarked against diverse non-human videos without evident self-referential fitting. This yields a low circularity score consistent with self-contained use of independent external supervision.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available so ledger is necessarily incomplete; no explicit free parameters, axioms, or invented entities are named.

pith-pipeline@v0.9.0 · 5786 in / 957 out tokens · 44989 ms · 2026-05-20T20:04:05.652056+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

158 extracted references · 158 canonical work pages · 10 internal anchors

  1. [1]

    ACM Transactions on Graphics (ToG) , volume=

    Deepphase: Periodic autoencoders for learning motion phase manifolds , author=. ACM Transactions on Graphics (ToG) , volume=. 2022 , publisher=

  2. [2]

    ACM Transactions On Graphics (TOG) , volume=

    Unpaired motion style transfer from video to animation , author=. ACM Transactions On Graphics (TOG) , volume=. 2020 , publisher=

  3. [3]

    Proceedings of the 16th ACM SIGGRAPH Conference on Motion, Interaction and Games , pages=

    Video-Based Motion Retargeting Framework between Characters with Various Skeleton Structure , author=. Proceedings of the 16th ACM SIGGRAPH Conference on Motion, Interaction and Games , pages=

  4. [4]

    2019 International conference on 3D vision (3DV) , pages=

    Language2pose: Natural language grounded pose forecasting , author=. 2019 International conference on 3D vision (3DV) , pages=. 2019 , organization=

  5. [5]

    Proceedings of the 5th ACM International Conference on Multimedia in Asia , pages=

    Cross-modal retrieval for motion and text via droptriple loss , author=. Proceedings of the 5th ACM International Conference on Multimedia in Asia , pages=

  6. [6]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Generating diverse and natural 3d human motions from text , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  7. [7]

    European Conference on Computer Vision , pages=

    Motionclip: Exposing human motion generation to clip space , author=. European Conference on Computer Vision , pages=. 2022 , organization=

  8. [8]

    European conference on computer vision , pages=

    Temos: Generating diverse human motions from textual descriptions , author=. European conference on computer vision , pages=. 2022 , organization=

  9. [9]

    The Eleventh International Conference on Learning Representations , year=

    Human Motion Diffusion Model , author=. The Eleventh International Conference on Learning Representations , year=

  10. [10]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Executing your commands via motion diffusion in latent space , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  11. [11]

    IEEE transactions on pattern analysis and machine intelligence , volume=

    Motiondiffuse: Text-driven human motion generation with diffusion model , author=. IEEE transactions on pattern analysis and machine intelligence , volume=. 2024 , publisher=

  12. [12]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Remodiffuse: Retrieval-augmented motion diffusion model , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  13. [13]

    Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

    Rethinking diffusion for text-driven human motion generation: Redundant representations, evaluation, and masked autoregression , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

  14. [14]

    arXiv preprint arXiv:2510.26794 , year=

    The quest for generalizable motion generation: Data, model, and evaluation , author=. arXiv preprint arXiv:2510.26794 , year=

  15. [15]

    arXiv preprint arXiv:2512.23464 , year=

    HY-Motion 1.0: Scaling Flow Matching Models for Text-To-Motion Generation , author=. arXiv preprint arXiv:2512.23464 , year=

  16. [16]

    arXiv preprint arXiv:2603.15546 , year=

    Kimodo: Scaling Controllable Human Motion Generation , author=. arXiv preprint arXiv:2603.15546 , year=

  17. [17]

    Advances in Neural Information Processing Systems , volume=

    Motiongpt: Human motion as a foreign language , author=. Advances in Neural Information Processing Systems , volume=

  18. [18]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Generating human motion from textual descriptions with discrete representations , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  19. [19]

    Proceedings of the IEEE/CVF international conference on computer vision , pages=

    Attt2m: Text-driven human motion generation with multi-perspective attention mechanism , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

  20. [20]

    Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=

    Motiongpt: Human motion synthesis with improved diversity and realism via gpt-3 prompting , author=. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=

  21. [21]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Go to zero: Towards zero-shot motion generation with million-scale data , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  22. [22]

    European Conference on Computer Vision , pages=

    Bamm: bidirectional autoregressive motion model , author=. European Conference on Computer Vision , pages=. 2024 , organization=

  23. [23]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Momask: Generative masked modeling of 3d human motions , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  24. [24]

    The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

    Snapmogen: Human motion generation from expressive texts , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

  25. [25]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Mmm: Generative masked motion model , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  26. [26]

    ACM Transactions on Graphics (TOG) , volume=

    Motion puzzle: Arbitrary motion style transfer by body part , author=. ACM Transactions on Graphics (TOG) , volume=. 2022 , publisher=

  27. [27]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    StyleMotif: Multi-Modal Motion Stylization using Style-Content Cross Fusion , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  28. [28]

    SIGGRAPH Asia 2024 Conference Papers , pages=

    Monkey see, monkey do: Harnessing self-attention in motion diffusion for zero-shot motion transfer , author=. SIGGRAPH Asia 2024 Conference Papers , pages=

  29. [29]

    European Conference on Computer Vision , pages=

    SMooDi: Stylized Motion Diffusion Model , author=. European Conference on Computer Vision , pages=

  30. [30]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Arbitrary motion style transfer with multi-condition motion latent diffusion model , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  31. [31]

    arXiv preprint arXiv:2509.04058 , year=

    Smoogpt: Stylized motion generation using large language models , author=. arXiv preprint arXiv:2509.04058 , year=

  32. [32]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Avatars grow legs: Generating smooth human motion from sparse tracking inputs with diffusion model , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  33. [33]

    ACM Transactions on Graphics (TOG) , volume=

    Listen, denoise, action! audio-driven motion synthesis with diffusion models , author=. ACM Transactions on Graphics (TOG) , volume=. 2023 , publisher=

  34. [34]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Diffusion-based generation, optimization, and planning in 3d scenes , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  35. [35]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Move as you say interact as you can: Language-guided human motion generation with scene affordance , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  36. [36]

    arXiv preprint arXiv:2506.00173 , year=

    MotionPersona: Characteristics-aware Locomotion Control , author=. arXiv preprint arXiv:2506.00173 , year=

  37. [37]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Personabooth: Personalized text-to-motion generation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  38. [38]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Auto-regressive diffusion for generating 3d human-object interactions , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  39. [39]

    Omnicontrol: Control any joint at any time for human motion generation,

    Omnicontrol: Control any joint at any time for human motion generation , author=. arXiv preprint arXiv:2310.08580 , year=

  40. [40]

    ACM SIGGRAPH 2024 conference papers , pages=

    Flexible motion in-betweening with diffusion models , author=. ACM SIGGRAPH 2024 conference papers , pages=

  41. [41]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Maskcontrol: Spatio-temporal control for masked motion synthesis , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  42. [42]

    Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

    Pomp: Physics-consistent motion generative model through phase manifolds , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

  43. [43]

    ACM Transactions on Graphics (TOG) , year=

    Adaptnet: Policy adaptation for physics-based character control , author=. ACM Transactions on Graphics (TOG) , year=

  44. [44]

    ACM Transactions on Graphics (TOG) , year=

    Sketch2anim: Towards transferring sketch storyboards into 3d animation , author=. ACM Transactions on Graphics (TOG) , year=

  45. [45]

    ACM Transactions on Graphics (TOG) , volume=

    Physics-based character controllers using conditional vaes , author=. ACM Transactions on Graphics (TOG) , volume=. 2022 , publisher=

  46. [46]

    ACM Transactions on Graphics (ToG) , volume=

    Amp: Adversarial motion priors for stylized physics-based character control , author=. ACM Transactions on Graphics (ToG) , volume=. 2021 , publisher=

  47. [47]

    ACM Transactions On Graphics (TOG) , volume=

    Maskedmimic: Unified physics-based character control through masked motion inpainting , author=. ACM Transactions On Graphics (TOG) , volume=. 2024 , publisher=

  48. [48]

    ACM SIGGRAPH 2024 Conference Papers , pages=

    Strategy and skill learning for physics-based table tennis animation , author=. ACM SIGGRAPH 2024 Conference Papers , pages=

  49. [49]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Motion-2-to-3: Leveraging 2D Motion Data for 3D Motion Generations , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  50. [50]

    Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

    Humandreamer: Generating controllable human-motion videos via decoupled generation , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

  51. [51]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Mas: Multi-view ancestral sampling for 3d motion generation using 2d diffusion , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  52. [52]

    SIGGRAPH Asia 2024 Conference Papers , pages=

    Motionfix: Text-driven 3d human motion editing , author=. SIGGRAPH Asia 2024 Conference Papers , pages=

  53. [53]

    ACM SIGGRAPH 2024 Conference Papers , pages=

    Iterative motion editing with natural language , author=. ACM SIGGRAPH 2024 Conference Papers , pages=

  54. [54]

    Advances in Neural Information Processing Systems , volume=

    Finemogen: Fine-grained spatio-temporal motion generation and editing , author=. Advances in Neural Information Processing Systems , volume=

  55. [55]

    arXiv preprint arXiv:2512.24200 , year=

    PartMotionEdit: Fine-Grained Text-Driven 3D Human Motion Editing via Part-Level Modulation , author=. arXiv preprint arXiv:2512.24200 , year=

  56. [56]

    Proceedings of the AAAI conference on artificial intelligence , volume=

    Flame: Free-form language-based motion synthesis & editing , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

  57. [57]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Motionlab: Unified human motion generation and editing via the motion-condition-motion paradigm , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  58. [58]

    openreview , year=

    Motionclr: Motion generation and training-free editing via understanding attention mechanisms , author=. openreview , year=

  59. [59]

    arXiv preprint arXiv:2512.19159 , year=

    OmniMoGen: Unifying Human Motion Generation via Learning from Interleaved Text-Motion Instructions , author=. arXiv preprint arXiv:2512.19159 , year=

  60. [60]

    Proceedings of the IEEE/CVF international conference on computer vision , pages=

    AMASS: Archive of motion capture as surface shapes , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

  61. [61]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    BABEL: Bodies, action and behavior with english labels , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  62. [62]

    Advances in Neural Information Processing Systems , volume=

    Motion-x: A large-scale 3d expressive whole-body human motion dataset , author=. Advances in Neural Information Processing Systems , volume=

  63. [63]

    CoMoVi: Co-Generation of 3D Human Motions and Realistic Videos

    CoMoVi: Co-Generation of 3D Human Motions and Realistic Videos , author=. arXiv preprint arXiv:2601.10632 , year=

  64. [64]

    arXiv preprint arXiv:2512.18814 , year=

    EchoMotion: Unified Human Video and Motion Generation via Dual-Modality Diffusion Transformer , author=. arXiv preprint arXiv:2512.18814 , year=

  65. [65]

    arXiv preprint arXiv:2510.03909 , year=

    Generating Human Motion Videos using a Cascaded Text-to-Video Framework , author=. arXiv preprint arXiv:2510.03909 , year=

  66. [66]

    Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers , pages=

    Anytop: Character animation diffusion with any topology , author=. Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers , pages=

  67. [67]

    arXiv preprint arXiv:2508.05162 , year=

    X-MoGen: Unified Motion Generation across Humans and Animals , author=. arXiv preprint arXiv:2508.05162 , year=

  68. [68]

    ACM Transactions on Graphics (ToG) , volume=

    Physcap: Physically plausible monocular 3d motion capture in real time , author=. ACM Transactions on Graphics (ToG) , volume=. 2020 , publisher=

  69. [69]

    Acm transactions on graphics (tog) , volume=

    Motionet: 3d human motion reconstruction from monocular video with skeleton consistency , author=. Acm transactions on graphics (tog) , volume=. 2020 , publisher=

  70. [70]

    SIGGRAPH Asia 2023 conference papers , pages=

    A locality-based neural solver for optical motion capture , author=. SIGGRAPH Asia 2023 conference papers , pages=

  71. [71]

    SIGGRAPH Asia 2024 Conference Papers , pages=

    RoMo: A Robust Solver for Full-body Unlabeled Optical Motion Capture , author=. SIGGRAPH Asia 2024 Conference Papers , pages=

  72. [72]

    Proceedings of the SIGGRAPH Asia 2025 Conference Papers , pages=

    StableMotion: Training Motion Cleanup Models with Unpaired Corrupted Data , author=. Proceedings of the SIGGRAPH Asia 2025 Conference Papers , pages=

  73. [73]

    arXiv preprint arXiv:2603.17704 , year=

    DancingBox: A Lightweight MoCap System for Character Animation from Physical Proxies , author=. arXiv preprint arXiv:2603.17704 , year=

  74. [74]

    Thirteenth International Conference on 3D Vision , year=

    EgoMDM: Diffusion-based Human Motion Synthesis from Sparse Egocentric Sensors , author=. Thirteenth International Conference on 3D Vision , year=

  75. [75]

    , howpublished =

    EasyMoCap - Make human motion capture easier. , howpublished =. 2021 , url =

  76. [76]

    SIGGRAPH Conference Proceedings , year=

    Novel View Synthesis of Human Interactions from Sparse Multi-view Videos , author=. SIGGRAPH Conference Proceedings , year=

  77. [77]

    SIGGRAPH Asia Conference Proceedings , year=

    Efficient Neural Radiance Fields for Interactive Free-viewpoint Video , author=. SIGGRAPH Asia Conference Proceedings , year=

  78. [78]

    Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

    Realtime multi-person 2d pose estimation using part affinity fields , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

  79. [79]

    Proceedings of the IEEE conference on Computer Vision and Pattern Recognition , pages=

    Hand keypoint detection in single images using multiview bootstrapping , author=. Proceedings of the IEEE conference on Computer Vision and Pattern Recognition , pages=

  80. [80]

    IEEE transactions on pattern analysis and machine intelligence , volume=

    Openpose: Realtime multi-person 2d pose estimation using part affinity fields , author=. IEEE transactions on pattern analysis and machine intelligence , volume=. 2019 , publisher=

Showing first 80 references.