pith. machine review for the scientific record. sign in

arxiv: 2605.13857 · v1 · submitted 2026-04-08 · 💻 cs.GR · cs.CV· cs.LG

Recognition: no theorem link

MoZoo:Unleashing Video Diffusion power in animal fur and muscle simulation

Authors on Pith no claims yet

Pith reviewed 2026-05-15 06:49 UTC · model grok-4.3

classification 💻 cs.GR cs.CVcs.LG
keywords video diffusionanimal simulationfur dynamicsmuscle simulationgenerative modelingsynthetic datatemporal consistency3D mesh animation
0
0 comments X

The pith

MoZoo generates high-fidelity animal fur and muscle videos directly from coarse meshes using video diffusion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MoZoo as a generative dynamics solver that creates realistic animal videos with detailed fur and muscle simulations starting from basic meshes and additional guidance. Traditional animation of such effects requires extensive labor and computation, making this approach potentially transformative for film and game production. It introduces specific techniques like role-aware positional encoding and decoupled attention to handle the complexities of motion and features effectively. A custom data pipeline generates the necessary training examples by bridging synthetic renders and real videos, and a new benchmark is provided for evaluation. Results show improved consistency in the generated sequences compared to prior methods.

Core claim

MoZoo bypasses conventional refinement steps to synthesize high-fidelity animal videos from coarse meshes under multimodal guidance. It employs Role-Aware RoPE to synchronize motion alignment through role-based index remapping and fixed temporal offsets for decoupling references. Asymmetric Decoupled Attention partitions the latent sequence to enforce unidirectional information flow, preventing interference and boosting efficiency. Trained on the MoZoo-Data dataset constructed via a synthetic-to-real pipeline, the model is assessed on the MoZooBench benchmark comprising 120 mesh-video pairs, demonstrating high-fidelity fur simulation with superior temporal and structural consistency across a

What carries the argument

Role-Aware RoPE (RAR-RoPE) with role-based index remapping and fixed temporal offsets, paired with Asymmetric Decoupled Attention that partitions latents for unidirectional flow.

If this is right

  • Enables synthesis of animal videos without conventional refinement steps.
  • Achieves superior temporal and structural consistency in fur simulations across diverse skeletons.
  • Supports multimodal guidance for controlling motion and appearance.
  • Provides MoZooBench as a standardized evaluation set with 120 mesh-video pairs.
  • Reduces computational expense compared to traditional production workflows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The attention partitioning technique could transfer to other video generation tasks involving long sequences with reference images.
  • Similar synthetic-to-real pipelines might address data scarcity in related domains such as cloth or fluid simulation.
  • Integration with existing mesh authoring tools could allow animators to preview dynamics without separate physics solvers.
  • The approach opens a path toward controllable, high-resolution animal effects in real-time rendering contexts.

Load-bearing premise

The synthetic-to-real pipeline used to create MoZoo-Data produces training examples whose distribution is close enough to real animal videos that the model generalizes without large domain gaps or artifacts.

What would settle it

Large visual discrepancies, flickering, or loss of structural detail in MoZoo outputs when tested on real captured animal videos outside the training distribution would falsify the generalization claim.

Figures

Figures reproduced from arXiv: 2605.13857 by Bin Xia, Dongxia Liu, Jiancheng Zhang, Jie Ma, Jin Li, Jun Liang, Nisha Huang, Wenming Yang, Xiaochen Yang, Zhehan Kan.

Figure 1
Figure 1. Figure 1: Visualization for our MoZoo results [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of production workflows. Traditional pipelines (top) require a se￾quence of complex simulation stages, including muscle rigging and hair rendering. In contrast, MoZoo (bottom) streamlines this into a single generative dynamics solving process, synthesized directly from a mesh and a reference image [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Pipeline for Constructing MoZoo-Data from Synthetic and Real-World Sources. (a) Synthetic data generation utilizes Unreal Engine 5 with diverse animal assets, scenes, and camera trajectories. (b) Real-world footage is processed through scene seg￾mentation and first-frame editing, followed by mesh extraction via an inverse generative model and quality filtering with a Vision-Language Model. The resulting da… view at source ↗
Figure 4
Figure 4. Figure 4: Pipeline of the MoZoo Framework. (a) RAR Rope for target, mesh, and refer￾ence tokens across frame, width, and height dimensions. (b) The restricted attention matrix regulates information exchange among different latent components to maintain structural and appearance fidelity during the generation process. fur textures, a first-frame mask of the mesh video Ms ∈ R 1×h×w×1 , and a noisy target video Vtar ∈ … view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison of text-based generation. Given text prompts and mesh￾guided motion, our method synthesizes realistic animal textures with high fidelity. Notably, our approach preserves intricate high-frequency details, which appear over￾smoothed in the VACE results. method shows robust versatility across different animal categories, delivering high-fidelity simulations of both muscle structures and… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison of reference-based generation. The results demonstrate that our proposed method can render animal subjects with superior spatio-temporal consistency, finer textural detail, and more realistic lighting effects. Ablation on Architectural Components We analysis the individual con￾tributions of our core designs: Role-Aware RoPE (RAR) and Asymmetric De￾coupled Attention (ADA). As illustra… view at source ↗
Figure 7
Figure 7. Figure 7: Ablation analysis of RAR and ADA in V2V Synthesis and comparison with I2V Modality. Removing these components leads to texture misalignment and loss of fine details. While our model achieves high quality results in both settings, the V2V configuration provides superior identity fidelity and texture consistency compared to the I2V setting. image conditioned model is comparable to that of the video condition… view at source ↗
Figure 8
Figure 8. Figure 8: Results of cross-species texture transfer. Mozoo maps biological textures from a reference animal onto a source mesh proxy of a different species, achieving photorealistic synthesis with anatomical coherence. 6 Conclusion In this paper, We present MoZoo, a generative framework that transforms coarse mesh sequences into photorealistic animal videos by integrating muscle and fur dynamics into an end-to-end p… view at source ↗
read the original abstract

The creation of cinematic-quality animal effects necessitates the precise modeling of muscle and fur dynamics, a process that remains both labor-intensive and computationally expensive within traditional production workflows. While generative diffusion models have shown promise in diverse artistic workflows, their capacity for high-fidelity animal simulation remains largely unexploited. We present MoZoo, a generative dynamics solver that bypasses conventional refinement to synthesize high-fidelity animal videos from coarse meshes under multimodal guidance. We propose Role-Aware RoPE (RAR-RoPE) which employs role-based index remapping to synchronize motion alignment while decoupling reference information via fixed temporal offsets. Complementing this, Asymmetric Decoupled Attention partitions the latent sequence to enforce a unidirectional information flow, effectively preventing feature interference and improving computational efficiency. To address the scarcity of high-quality training data, we introduce MoZoo-Data, a synthetic-to-real pipeline that leverages a rendering engine and an inverse mapping approach to construct a large-scale dataset of paired sequences. Furthermore, we establish MoZooBench, a comprehensive benchmark with 120 mesh-video pairs. Experimental results demonstrate that MoZoo achieves high-fidelity fur simulation across diverse animal skeletons and layouts, preserving superior temporal and structural consistency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces MoZoo, a video diffusion-based generative dynamics solver that synthesizes high-fidelity animal fur and muscle videos from coarse meshes under multimodal guidance. It proposes Role-Aware RoPE (RAR-RoPE) for role-based index remapping to align motion while decoupling references via fixed temporal offsets, and Asymmetric Decoupled Attention to partition latent sequences for unidirectional flow and efficiency. To mitigate data scarcity, it introduces the MoZoo-Data synthetic-to-real pipeline using a rendering engine and inverse mapping, plus the MoZooBench benchmark with 120 mesh-video pairs. The central claim is that experimental results show high-fidelity fur simulation across diverse skeletons with superior temporal and structural consistency.

Significance. If the performance claims are substantiated with quantitative evidence, MoZoo could meaningfully extend video diffusion techniques to physics-informed animal dynamics in computer graphics, offering a potential shortcut past labor-intensive traditional simulation pipelines for cinematic effects. The MoZoo-Data pipeline and MoZooBench benchmark represent concrete resources that could accelerate follow-on work in generative modeling of deformable biological structures. The architectural proposals (RAR-RoPE and asymmetric attention) target specific temporal consistency issues in video generation, which, if shown to be effective, would be of interest to the graphics and generative modeling communities.

major comments (3)
  1. [Abstract and Experimental Results] The abstract and experimental claims assert 'high-fidelity fur simulation' and 'superior temporal and structural consistency' yet supply no quantitative metrics, baselines, error bars, ablation studies, or statistical comparisons. This absence is load-bearing because the entire performance argument rests on these unspecified results.
  2. [MoZoo-Data Pipeline] The MoZoo-Data synthetic-to-real pipeline is described only at a high level (rendering engine plus inverse mapping) with no validation that it closes the domain gap to real animal videos (e.g., no FID, distribution distances, or physics-fidelity metrics on held-out real footage). This directly undermines the generalization claims that rely on the training distribution matching real dynamics such as wind-driven fur or gravity-induced muscle deformation.
  3. [Method] Role-Aware RoPE (RAR-RoPE) and Asymmetric Decoupled Attention are introduced as novel components, but the manuscript provides no explicit equations, derivations, or analysis showing how the role-based remapping and unidirectional partitioning achieve the claimed alignment and efficiency gains without introducing new free parameters or artifacts.
minor comments (2)
  1. [Title] The title string 'MoZoo:Unleashing' is missing a space after the colon.
  2. [Abstract] The abstract refers to 'Experimental results demonstrate...' without previewing the specific metrics or comparison methods that will be reported.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed review. We appreciate the recognition of MoZoo's potential contributions to generative modeling in graphics. We agree that the manuscript requires stronger quantitative support, explicit validation of the data pipeline, and detailed mathematical descriptions of the proposed components. All major comments will be addressed through additions and clarifications in the revised version.

read point-by-point responses
  1. Referee: [Abstract and Experimental Results] The abstract and experimental claims assert 'high-fidelity fur simulation' and 'superior temporal and structural consistency' yet supply no quantitative metrics, baselines, error bars, ablation studies, or statistical comparisons. This absence is load-bearing because the entire performance argument rests on these unspecified results.

    Authors: We acknowledge the need for explicit quantitative evidence to support the performance claims. While the current manuscript emphasizes qualitative results and visual comparisons, the revised version will include a dedicated quantitative evaluation section. This will report metrics such as Fréchet Video Distance (FVD), temporal warping error, and structural similarity measures, with comparisons against baselines including standard video diffusion models and physics-based simulators. Ablation studies on RAR-RoPE and Asymmetric Decoupled Attention will be added, along with error bars from multiple runs and statistical analysis. These changes will be incorporated to substantiate the claims of high-fidelity simulation and superior consistency. revision: yes

  2. Referee: [MoZoo-Data Pipeline] The MoZoo-Data synthetic-to-real pipeline is described only at a high level (rendering engine plus inverse mapping) with no validation that it closes the domain gap to real animal videos (e.g., no FID, distribution distances, or physics-fidelity metrics on held-out real footage). This directly undermines the generalization claims that rely on the training distribution matching real dynamics such as wind-driven fur or gravity-induced muscle deformation.

    Authors: We agree that validation of the synthetic-to-real pipeline is critical. In the revised manuscript, we will expand the MoZoo-Data description with quantitative validation. This includes FID scores and other distributional distances computed between the generated synthetic videos and held-out real animal footage. We will also report physics-fidelity metrics, such as average fur displacement errors and muscle deformation accuracy under simulated wind and gravity conditions, to demonstrate effective closure of the domain gap and support the generalization claims. revision: yes

  3. Referee: [Method] Role-Aware RoPE (RAR-RoPE) and Asymmetric Decoupled Attention are introduced as novel components, but the manuscript provides no explicit equations, derivations, or analysis showing how the role-based remapping and unidirectional partitioning achieve the claimed alignment and efficiency gains without introducing new free parameters or artifacts.

    Authors: We thank the referee for highlighting this omission. The revised manuscript will include explicit mathematical formulations in the Method section. For RAR-RoPE, we will provide the equations for role-based index remapping and the fixed temporal offset decoupling mechanism, along with analysis showing motion alignment without additional free parameters. For Asymmetric Decoupled Attention, we will detail the latent sequence partitioning, unidirectional attention masks, and derivations demonstrating efficiency improvements (e.g., reduced computational complexity) and artifact prevention. Pseudocode and parameter analysis will be added to the main text and appendix. revision: yes

Circularity Check

0 steps flagged

No circularity: method components and results are independently specified

full rationale

The paper introduces RAR-RoPE and Asymmetric Decoupled Attention as architectural proposals, plus a synthetic-to-real data pipeline and MoZooBench benchmark. No equations, fitted parameters, or self-citations are shown that reduce the high-fidelity simulation claims or temporal consistency results to quantities defined by construction from the inputs. Performance is reported via experimental evaluation on held-out mesh-video pairs rather than any self-referential derivation or renaming of known patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim depends on the assumption that standard video diffusion architectures can be specialized for physics-like dynamics via the two new attention mechanisms and that the synthetic data pipeline closes the domain gap sufficiently for high-fidelity output.

axioms (1)
  • domain assumption Video diffusion models trained on artistic workflows can be repurposed for high-fidelity physical simulation of fur and muscle when given appropriate conditioning and architectural modifications.
    Stated in the abstract as the premise that diffusion models have shown promise but remain unexploited for animal simulation.
invented entities (2)
  • Role-Aware RoPE (RAR-RoPE) no independent evidence
    purpose: Synchronize motion alignment across animal parts while decoupling reference information via fixed temporal offsets
    New positional encoding variant introduced to handle role-based index remapping.
  • Asymmetric Decoupled Attention no independent evidence
    purpose: Partition the latent sequence to enforce unidirectional information flow and prevent feature interference
    New attention mechanism proposed to improve efficiency and consistency.

pith-pipeline@v0.9.0 · 5538 in / 1318 out tokens · 29633 ms · 2026-05-15T06:49:34.272646+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 16 internal anchors

  1. [1]

    Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala, K.V., Khedr, H., Huang, A., Lei, J., Ma, T., Guo, B., Kalla, A., Marks, M., Greer, J., Wang, M., Sun, P., Rädle, R., Afouras, T., Mavroudi, E., Xu, K., Wu, T.H., Zhou, Y., Momeni, L., Hazra, R., Ding, S., Vaze, S., Porcher, F., Li, F., Li, S., Kamath, A., Cheng, H.K., ...

  2. [2]

    Chang, D., Hou, J., Bozic, A., Neuberger, A., Juefei-Xu, F., Maury, O., Lin, G.W.C., Stuyck, T., Roble, D., Soleymani, M., Grabli, S.: Hairweaver: Few-shot photorealistic hair motion synthesis with sim-to-real guided video diffusion (2026), https://arxiv.org/abs/2602.11117

  3. [3]

    Chen, H., Zhang, Y., Cun, X., Xia, M., Wang, X., Weng, C., Shan, Y.: Videocrafter2: Overcoming data limitations for high-quality video diffusion models (2024)

  4. [4]

    In: ACM SIGGRAPH 2015 Talks

    Chiang, M.J.Y., Bitterli, B., Tappan, C., Burley, B.: A practical and control- lable hair and fur model for production path tracing. In: ACM SIGGRAPH 2015 Talks. SIGGRAPH ’15, Association for Computing Machinery, New York, NY, USA (2015).https://doi.org/10.1145/2775280.2792559,https: //doi.org/10.1145/2775280.2792559

  5. [5]

    Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis (2021), https://arxiv.org/abs/2105.05233

  6. [6]

    Epic Games: Unreal engine 5.5 (2026),https://www.unrealengine.com, version 5.7

  7. [7]

    Guo, Q., Yang, T., He, X., Shen, F., Zhang, Y., Kang, Z., Wei, X., Xu, D.: Wildactor: Unconstrained identity-preserving video generation (2026),https: //arxiv.org/abs/2603.00586

  8. [8]

    Guo, X., Ye, F., Li, X., Tu, P., Zhang, P., Sun, Q., Zhao, S., Hou, X., He, Q.: Dreamid-v:bridging the image-to-video gap for high-fidelity face swapping via dif- fusion transformer (2026),https://arxiv.org/abs/2601.01425

  9. [9]

    LTX-Video: Realtime Video Latent Diffusion

    HaCohen, Y., Chiprut, N., Brazowski, B., Shalem, D., Moshe, D., Richardson, E., Levin, E., Shiran, G., Zabari, N., Gordon, O., Panet, P., Weissbuch, S., Kulikov, V., Bitterman, Y., Melumian, Z., Bibi, O.: Ltx-video: Realtime video latent diffusion. arXiv preprint arXiv:2501.00103 (2024)

  10. [10]

    He, X., Liu, Q., Qian, S., Wang, X., Hu, T., Cao, K., Yan, K., Zhang, J.: Id- animator: Zero-shot identity-preserving human video generation (2024),https: //arxiv.org/abs/2404.15275

  11. [11]

    Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models (2020),https: //arxiv.org/abs/2006.11239

  12. [12]

    Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., Fleet, D.J.: Video diffusion models (2022),https://arxiv.org/abs/2204.03458

  13. [13]

    Huang, Y., Ruan, P., Zi, B., Qi, X., Wang, J., Xiao, R.: Refaçade: Editing object with given reference texture (2025),https://arxiv.org/abs/2512.04534

  14. [14]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

    Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., Wang, Y., Chen, X., Wang, L., Lin, D., Qiao, Y., Liu, Z.: VBench: Comprehensive benchmark suite for video generative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

  15. [15]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Jiang, Z., Han, Z., Mao, C., Zhang, J., Pan, Y., Liu, Y.: Vace: All-in-one video creation and editing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17191–17202 (2025) 16 D. Liu et al

  16. [16]

    Ju, X., Ye, W., Liu, Q., Wang, Q., Wang, X., Wan, P., Zhang, D., Gai, K., Xu, Q.: Fulldit: Multi-task video generative foundation model with full attention (2025), https://arxiv.org/abs/2503.19907

  17. [17]

    Ke, J., Wang, Q., Wang, Y., Milanfar, P., Yang, F.: Musiq: Multi-scale image quality transformer (2021),https://arxiv.org/abs/2108.05997

  18. [18]

    Kingma, D.P., Welling, M.: Auto-encoding variational bayes (2022),https://ar xiv.org/abs/1312.6114

  19. [19]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al.: Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603 (2024)

  20. [20]

    Labs, B.F.: FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2 (2025)

  21. [21]

    Li, L., Wang, G., Zhang, Z., Li, Y., Li, X., Dou, Q., Gu, J., Xue, T., Shan, Y.: Tooncomposer: Streamlining cartoon production with generative post-keyframing (2025),https://arxiv.org/abs/2508.10881

  22. [22]

    Li, Y., Xia, M., Liu, G., Bai, J., Wang, X., Zhang, C., Lin, Y., Chu, R., Wan, P., Yang, Y.: Adaviewplanner: Adapting video diffusion models for viewpoint planning in 4d scenes (2025),https://arxiv.org/abs/2510.10670

  23. [23]

    Lin, W., Li, H., Zhu, Y.: Controlhair: Physically-based video diffusion for control- lable dynamic hair rendering (2025),https://arxiv.org/abs/2509.21541

  24. [24]

    Liu, Q., Gao, B., Huang, W., Zhang, J., Sun, Z., Wei, Y., Liu, F., Peng, Z., Ma, Q., Yang, S., Liao, Z., Zhao, H., Niu, L.: Animatescene: Camera-controllable animation in any scene (2026),https://arxiv.org/abs/2508.05982

  25. [25]

    Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow (2022),https://arxiv.org/abs/2209.03003

  26. [26]

    Luo, H., Ouyang, M., Zhao, Z., Jiang, S., Zhang, L., Zhang, Q., Yang, W., Xu, L., Yu, J.: Gaussianhair: Hair modeling and rendering with light-aware gaussians (2024),https://arxiv.org/abs/2402.10483

  27. [27]

    ACM Trans

    Marschner, S.R., Jensen, H.W., Cammarano, M., Worley, S., Hanrahan, P.: Light scattering from human hair fibers. ACM Trans. Graph.22(3), 780–791 (Jul 2003). https://doi.org/10.1145/882262.882345,https://doi.org/10.1145/882262 .882345

  28. [28]

    In: ACM SIGGRAPH 2008 Papers

    Moon, J.T., Walter, B., Marschner, S.: Efficient multiple scattering in hair using spherical harmonics. In: ACM SIGGRAPH 2008 Papers. SIGGRAPH ’08, Associ- ation for Computing Machinery, New York, NY, USA (2008).https://doi.org/ 10.1145/1399504.1360630,https://doi.org/10.1145/1399504.1360630

  29. [29]

    Peebles, W., Xie, S.: Scalable diffusion models with transformers (2023),https: //arxiv.org/abs/2212.09748

  30. [30]

    Pexels: Pexels: Free stock photos & videos.https://www.pexels.com/(2026), accessed: 2026-01-22

  31. [31]

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision (2021),https://arxiv.org/ab s/2103.00020

  32. [32]

    Rosu, R.A., Saito, S., Wang, Z., Wu, C., Behnke, S., Nam, G.: Neural strands: Learning hair geometry and appearance from multi-view images (2022),https: //arxiv.org/abs/2207.14067

  33. [33]

    Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., Schramowski, P., Kundurthy, S., Crowson, K., Schmidt, L., Kaczmarczyk, R., Jitsev, J.: Laion-5b: An open Abbreviated paper title 17 large-scale dataset for training next generation image-text models (2022),https: //arxiv.org/abs/...

  34. [34]

    Shi, Y., Liu, Y., Wu, Y., Liu, X., Zhao, C., Luo, J., Zhou, B.: Drive any mesh: 4d latent diffusion for mesh deformation from video (2025),https://arxiv.org/ab s/2506.07489

  35. [35]

    SideFX: Houdini.https://www.sidefx.com/(2026), accessed: 2026-03-02

  36. [36]

    Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni, O., Parikh, D., Gupta, S., Taigman, Y.: Make-a-video: Text-to- video generation without text-video data (2022),https://arxiv.org/abs/2209.1 4792

  37. [37]

    Sklyarova, V., Chelishev, J., Dogaru, A., Medvedev, I., Lempitsky, V., Zakharov, E.: Neural haircut: Prior-guided strand-based hair reconstruction (2023),https: //arxiv.org/abs/2306.05872

  38. [38]

    Sklyarova, V., Kabadayi, B., Yiannakidis, A., Becherini, G., Black, M.J., Thies, J.: Neuralfur: Animal fur reconstruction from multi-view images (2026),https: //arxiv.org/abs/2601.12481

  39. [39]

    Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models (2022),https: //arxiv.org/abs/2010.02502

  40. [40]

    Tan, Z., Liu, S., Yang, X., Xue, Q., Wang, X.: Ominicontrol: Minimal and universal control for diffusion transformer (2025),https://arxiv.org/abs/2411.15098

  41. [41]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., Zeng, J., Wang, J., Zhang, J., Zhou, J., Wang, J., Chen, J., Zhu, K., Zhao, K., Yan, K., Huang, L., Feng, M., Zhang, N., Li, P., Wu, P., Chu, R., Feng, R., Zhang, S., Sun, S., Fang, T., Wang, T., Gui, T., Weng, T., Shen, T., Lin, W., Wang, W., Wang, W., Zhou, W.,...

  42. [42]

    IEEE Transactions on Image Process- ing13(4), 600–612 (2004)

    Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Process- ing13(4), 600–612 (2004)

  43. [43]

    Wang, Z., Nam, G., Stuyck, T., Lombardi, S., Cao, C., Saragih, J., Zollhoefer, M., Hodgins, J., Lassner, C.: Neuwigs: A neural dynamic model for volumetric hair capture and animation (2023),https://arxiv.org/abs/2212.00613

  44. [44]

    Wu, K., Yang, L., Kuang, Z., Feng, Y., Han, X., Shen, Y., Fu, H., Zhou, K., Zheng, Y.: Monohair: High-fidelity hair modeling from a monocular video (2024), https://arxiv.org/abs/2403.18356

  45. [45]

    Xia, Z., Wang, Y., Lu, Z., Liu, K., Xiao, J., Wonka, P.: OMEGA-avatar: One-shot modeling of 360-degree gaussian avatars (2026),https://arxiv.org/abs/2602.1 1693

  46. [46]

    Xing, J., Xia, M., Liu, Y., Zhang, Y., Zhang, Y., He, Y., Liu, H., Chen, H., Cun, X., Wang, X., Shan, Y., Wong, T.T.: Make-your-video: Customized video generation using textual and structural guidance (2023),https://arxiv.org/abs/2306.009 43

  47. [47]

    Xue, B., Duan, Z.P., Yan, Q., Wang, W., Liu, H., Guo, C.L., Li, C., Li, C., Lyu, J.: Stand-in: A lightweight and plug-and-play identity control for video generation (2025),https://arxiv.org/abs/2508.07901

  48. [48]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., et al.: Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072 (2024) 18 D. Liu et al

  49. [49]

    Ye, H., Zhang, J., Liu, S., Han, X., Yang, W.: Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models (2023),https://arxiv.org/ab s/2308.06721

  50. [50]

    Yuan, S., Huang, J., He, X., Ge, Y., Shi, Y., Chen, L., Luo, J., Yuan, L.: Identity- preserving text-to-video generation by frequency decomposition (2025),https: //arxiv.org/abs/2411.17440

  51. [51]

    Zakharov, E., Sklyarova, V., Black, M., Nam, G., Thies, J., Hilliges, O.: Human hair reconstruction with strand-aligned 3d gaussians (2024),https://arxiv.org/ abs/2409.14778

  52. [52]

    Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models (2023),https://arxiv.org/abs/2302.05543

  53. [53]

    In: CVPR (2018)

    Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)

  54. [54]

    Zheng, Z., Peng, X., Yang, T., Shen, C., Li, S., Liu, H., Zhou, Y., Li, T., You, Y.: Open-sora: Democratizing efficient video production for all (2024),https: //arxiv.org/abs/2412.20404

  55. [55]

    ACM Transactions on Graphics43(6), 1–15 (Nov 2024).https://doi.org/10.1145/3687768,http: //dx.doi.org/10.1145/3687768

    Zhou, Y., Chai, M., Wang, D., Winberg, S., Wood, E., Sarkar, K., Gross, M., Beeler, T.: Groomcap: High-fidelity prior-free hair capture. ACM Transactions on Graphics43(6), 1–15 (Nov 2024).https://doi.org/10.1145/3687768,http: //dx.doi.org/10.1145/3687768

  56. [56]

    ACM Trans

    Zinke, A., Yuksel, C., Weber, A., Keyser, J.: Dual scattering approximation for fast multiple scattering in hair. ACM Trans. Graph.27(3), 1–10 (Aug 2008).https: //doi.org/10.1145/1360612.1360631,https://doi.org/10.1145/1360612.13 60631