arxiv: 2605.13857 · v1 · submitted 2026-04-08 · 💻 cs.GR · cs.CV· cs.LG

Recognition: no theorem link

MoZoo:Unleashing Video Diffusion power in animal fur and muscle simulation

Dongxia Liu , Jie Ma , Xiaochen Yang , Jiancheng Zhang , Bin Xia , Zhehan Kan , Nisha Huang , Jun Liang

show 2 more authors

Wenming Yang Jin Li

Authors on Pith no claims yet

Pith reviewed 2026-05-15 06:49 UTC · model grok-4.3

classification 💻 cs.GR cs.CVcs.LG

keywords video diffusionanimal simulationfur dynamicsmuscle simulationgenerative modelingsynthetic datatemporal consistency3D mesh animation

0 comments

The pith

MoZoo generates high-fidelity animal fur and muscle videos directly from coarse meshes using video diffusion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MoZoo as a generative dynamics solver that creates realistic animal videos with detailed fur and muscle simulations starting from basic meshes and additional guidance. Traditional animation of such effects requires extensive labor and computation, making this approach potentially transformative for film and game production. It introduces specific techniques like role-aware positional encoding and decoupled attention to handle the complexities of motion and features effectively. A custom data pipeline generates the necessary training examples by bridging synthetic renders and real videos, and a new benchmark is provided for evaluation. Results show improved consistency in the generated sequences compared to prior methods.

Core claim

MoZoo bypasses conventional refinement steps to synthesize high-fidelity animal videos from coarse meshes under multimodal guidance. It employs Role-Aware RoPE to synchronize motion alignment through role-based index remapping and fixed temporal offsets for decoupling references. Asymmetric Decoupled Attention partitions the latent sequence to enforce unidirectional information flow, preventing interference and boosting efficiency. Trained on the MoZoo-Data dataset constructed via a synthetic-to-real pipeline, the model is assessed on the MoZooBench benchmark comprising 120 mesh-video pairs, demonstrating high-fidelity fur simulation with superior temporal and structural consistency across a

What carries the argument

Role-Aware RoPE (RAR-RoPE) with role-based index remapping and fixed temporal offsets, paired with Asymmetric Decoupled Attention that partitions latents for unidirectional flow.

If this is right

Enables synthesis of animal videos without conventional refinement steps.
Achieves superior temporal and structural consistency in fur simulations across diverse skeletons.
Supports multimodal guidance for controlling motion and appearance.
Provides MoZooBench as a standardized evaluation set with 120 mesh-video pairs.
Reduces computational expense compared to traditional production workflows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The attention partitioning technique could transfer to other video generation tasks involving long sequences with reference images.
Similar synthetic-to-real pipelines might address data scarcity in related domains such as cloth or fluid simulation.
Integration with existing mesh authoring tools could allow animators to preview dynamics without separate physics solvers.
The approach opens a path toward controllable, high-resolution animal effects in real-time rendering contexts.

Load-bearing premise

The synthetic-to-real pipeline used to create MoZoo-Data produces training examples whose distribution is close enough to real animal videos that the model generalizes without large domain gaps or artifacts.

What would settle it

Large visual discrepancies, flickering, or loss of structural detail in MoZoo outputs when tested on real captured animal videos outside the training distribution would falsify the generalization claim.

Figures

Figures reproduced from arXiv: 2605.13857 by Bin Xia, Dongxia Liu, Jiancheng Zhang, Jie Ma, Jin Li, Jun Liang, Nisha Huang, Wenming Yang, Xiaochen Yang, Zhehan Kan.

**Figure 2.** Figure 2: Comparison of production workflows. Traditional pipelines (top) require a sequence of complex simulation stages, including muscle rigging and hair rendering. In contrast, MoZoo (bottom) streamlines this into a single generative dynamics solving process, synthesized directly from a mesh and a reference image [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Pipeline for Constructing MoZoo-Data from Synthetic and Real-World Sources. (a) Synthetic data generation utilizes Unreal Engine 5 with diverse animal assets, scenes, and camera trajectories. (b) Real-world footage is processed through scene segmentation and first-frame editing, followed by mesh extraction via an inverse generative model and quality filtering with a Vision-Language Model. The resulting da… view at source ↗

**Figure 4.** Figure 4: Pipeline of the MoZoo Framework. (a) RAR Rope for target, mesh, and reference tokens across frame, width, and height dimensions. (b) The restricted attention matrix regulates information exchange among different latent components to maintain structural and appearance fidelity during the generation process. fur textures, a first-frame mask of the mesh video Ms ∈ R 1×h×w×1 , and a noisy target video Vtar ∈ … view at source ↗

**Figure 5.** Figure 5: Qualitative comparison of text-based generation. Given text prompts and meshguided motion, our method synthesizes realistic animal textures with high fidelity. Notably, our approach preserves intricate high-frequency details, which appear oversmoothed in the VACE results. method shows robust versatility across different animal categories, delivering high-fidelity simulations of both muscle structures and… view at source ↗

**Figure 6.** Figure 6: Qualitative comparison of reference-based generation. The results demonstrate that our proposed method can render animal subjects with superior spatio-temporal consistency, finer textural detail, and more realistic lighting effects. Ablation on Architectural Components We analysis the individual contributions of our core designs: Role-Aware RoPE (RAR) and Asymmetric Decoupled Attention (ADA). As illustra… view at source ↗

**Figure 7.** Figure 7: Ablation analysis of RAR and ADA in V2V Synthesis and comparison with I2V Modality. Removing these components leads to texture misalignment and loss of fine details. While our model achieves high quality results in both settings, the V2V configuration provides superior identity fidelity and texture consistency compared to the I2V setting. image conditioned model is comparable to that of the video condition… view at source ↗

**Figure 8.** Figure 8: Results of cross-species texture transfer. Mozoo maps biological textures from a reference animal onto a source mesh proxy of a different species, achieving photorealistic synthesis with anatomical coherence. 6 Conclusion In this paper, We present MoZoo, a generative framework that transforms coarse mesh sequences into photorealistic animal videos by integrating muscle and fur dynamics into an end-to-end p… view at source ↗

read the original abstract

The creation of cinematic-quality animal effects necessitates the precise modeling of muscle and fur dynamics, a process that remains both labor-intensive and computationally expensive within traditional production workflows. While generative diffusion models have shown promise in diverse artistic workflows, their capacity for high-fidelity animal simulation remains largely unexploited. We present MoZoo, a generative dynamics solver that bypasses conventional refinement to synthesize high-fidelity animal videos from coarse meshes under multimodal guidance. We propose Role-Aware RoPE (RAR-RoPE) which employs role-based index remapping to synchronize motion alignment while decoupling reference information via fixed temporal offsets. Complementing this, Asymmetric Decoupled Attention partitions the latent sequence to enforce a unidirectional information flow, effectively preventing feature interference and improving computational efficiency. To address the scarcity of high-quality training data, we introduce MoZoo-Data, a synthetic-to-real pipeline that leverages a rendering engine and an inverse mapping approach to construct a large-scale dataset of paired sequences. Furthermore, we establish MoZooBench, a comprehensive benchmark with 120 mesh-video pairs. Experimental results demonstrate that MoZoo achieves high-fidelity fur simulation across diverse animal skeletons and layouts, preserving superior temporal and structural consistency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MoZoo adds a few targeted tweaks to video diffusion for animal fur and muscle but the abstract gives no numbers or baselines so it's impossible to tell if the gains are real.

read the letter

The core of this paper is an attempt to turn video diffusion into a practical tool for generating animal fur and muscle motion from coarse meshes. They add Role-Aware RoPE that remaps indices by role to keep motion aligned across frames and Asymmetric Decoupled Attention that splits the sequence to stop features from mixing in both directions. They also built a synthetic-to-real data pipeline using a renderer and inverse mapping, plus a benchmark of 120 mesh-video pairs. Those pieces look like genuine additions aimed at the temporal consistency problems that come up with fur.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces MoZoo, a video diffusion-based generative dynamics solver that synthesizes high-fidelity animal fur and muscle videos from coarse meshes under multimodal guidance. It proposes Role-Aware RoPE (RAR-RoPE) for role-based index remapping to align motion while decoupling references via fixed temporal offsets, and Asymmetric Decoupled Attention to partition latent sequences for unidirectional flow and efficiency. To mitigate data scarcity, it introduces the MoZoo-Data synthetic-to-real pipeline using a rendering engine and inverse mapping, plus the MoZooBench benchmark with 120 mesh-video pairs. The central claim is that experimental results show high-fidelity fur simulation across diverse skeletons with superior temporal and structural consistency.

Significance. If the performance claims are substantiated with quantitative evidence, MoZoo could meaningfully extend video diffusion techniques to physics-informed animal dynamics in computer graphics, offering a potential shortcut past labor-intensive traditional simulation pipelines for cinematic effects. The MoZoo-Data pipeline and MoZooBench benchmark represent concrete resources that could accelerate follow-on work in generative modeling of deformable biological structures. The architectural proposals (RAR-RoPE and asymmetric attention) target specific temporal consistency issues in video generation, which, if shown to be effective, would be of interest to the graphics and generative modeling communities.

major comments (3)

[Abstract and Experimental Results] The abstract and experimental claims assert 'high-fidelity fur simulation' and 'superior temporal and structural consistency' yet supply no quantitative metrics, baselines, error bars, ablation studies, or statistical comparisons. This absence is load-bearing because the entire performance argument rests on these unspecified results.
[MoZoo-Data Pipeline] The MoZoo-Data synthetic-to-real pipeline is described only at a high level (rendering engine plus inverse mapping) with no validation that it closes the domain gap to real animal videos (e.g., no FID, distribution distances, or physics-fidelity metrics on held-out real footage). This directly undermines the generalization claims that rely on the training distribution matching real dynamics such as wind-driven fur or gravity-induced muscle deformation.
[Method] Role-Aware RoPE (RAR-RoPE) and Asymmetric Decoupled Attention are introduced as novel components, but the manuscript provides no explicit equations, derivations, or analysis showing how the role-based remapping and unidirectional partitioning achieve the claimed alignment and efficiency gains without introducing new free parameters or artifacts.

minor comments (2)

[Title] The title string 'MoZoo:Unleashing' is missing a space after the colon.
[Abstract] The abstract refers to 'Experimental results demonstrate...' without previewing the specific metrics or comparison methods that will be reported.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed review. We appreciate the recognition of MoZoo's potential contributions to generative modeling in graphics. We agree that the manuscript requires stronger quantitative support, explicit validation of the data pipeline, and detailed mathematical descriptions of the proposed components. All major comments will be addressed through additions and clarifications in the revised version.

read point-by-point responses

Referee: [Abstract and Experimental Results] The abstract and experimental claims assert 'high-fidelity fur simulation' and 'superior temporal and structural consistency' yet supply no quantitative metrics, baselines, error bars, ablation studies, or statistical comparisons. This absence is load-bearing because the entire performance argument rests on these unspecified results.

Authors: We acknowledge the need for explicit quantitative evidence to support the performance claims. While the current manuscript emphasizes qualitative results and visual comparisons, the revised version will include a dedicated quantitative evaluation section. This will report metrics such as Fréchet Video Distance (FVD), temporal warping error, and structural similarity measures, with comparisons against baselines including standard video diffusion models and physics-based simulators. Ablation studies on RAR-RoPE and Asymmetric Decoupled Attention will be added, along with error bars from multiple runs and statistical analysis. These changes will be incorporated to substantiate the claims of high-fidelity simulation and superior consistency. revision: yes
Referee: [MoZoo-Data Pipeline] The MoZoo-Data synthetic-to-real pipeline is described only at a high level (rendering engine plus inverse mapping) with no validation that it closes the domain gap to real animal videos (e.g., no FID, distribution distances, or physics-fidelity metrics on held-out real footage). This directly undermines the generalization claims that rely on the training distribution matching real dynamics such as wind-driven fur or gravity-induced muscle deformation.

Authors: We agree that validation of the synthetic-to-real pipeline is critical. In the revised manuscript, we will expand the MoZoo-Data description with quantitative validation. This includes FID scores and other distributional distances computed between the generated synthetic videos and held-out real animal footage. We will also report physics-fidelity metrics, such as average fur displacement errors and muscle deformation accuracy under simulated wind and gravity conditions, to demonstrate effective closure of the domain gap and support the generalization claims. revision: yes
Referee: [Method] Role-Aware RoPE (RAR-RoPE) and Asymmetric Decoupled Attention are introduced as novel components, but the manuscript provides no explicit equations, derivations, or analysis showing how the role-based remapping and unidirectional partitioning achieve the claimed alignment and efficiency gains without introducing new free parameters or artifacts.

Authors: We thank the referee for highlighting this omission. The revised manuscript will include explicit mathematical formulations in the Method section. For RAR-RoPE, we will provide the equations for role-based index remapping and the fixed temporal offset decoupling mechanism, along with analysis showing motion alignment without additional free parameters. For Asymmetric Decoupled Attention, we will detail the latent sequence partitioning, unidirectional attention masks, and derivations demonstrating efficiency improvements (e.g., reduced computational complexity) and artifact prevention. Pseudocode and parameter analysis will be added to the main text and appendix. revision: yes

Circularity Check

0 steps flagged

No circularity: method components and results are independently specified

full rationale

The paper introduces RAR-RoPE and Asymmetric Decoupled Attention as architectural proposals, plus a synthetic-to-real data pipeline and MoZooBench benchmark. No equations, fitted parameters, or self-citations are shown that reduce the high-fidelity simulation claims or temporal consistency results to quantities defined by construction from the inputs. Performance is reported via experimental evaluation on held-out mesh-video pairs rather than any self-referential derivation or renaming of known patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim depends on the assumption that standard video diffusion architectures can be specialized for physics-like dynamics via the two new attention mechanisms and that the synthetic data pipeline closes the domain gap sufficiently for high-fidelity output.

axioms (1)

domain assumption Video diffusion models trained on artistic workflows can be repurposed for high-fidelity physical simulation of fur and muscle when given appropriate conditioning and architectural modifications.
Stated in the abstract as the premise that diffusion models have shown promise but remain unexploited for animal simulation.

invented entities (2)

Role-Aware RoPE (RAR-RoPE) no independent evidence
purpose: Synchronize motion alignment across animal parts while decoupling reference information via fixed temporal offsets
New positional encoding variant introduced to handle role-based index remapping.
Asymmetric Decoupled Attention no independent evidence
purpose: Partition the latent sequence to enforce unidirectional information flow and prevent feature interference
New attention mechanism proposed to improve efficiency and consistency.

pith-pipeline@v0.9.0 · 5538 in / 1318 out tokens · 29633 ms · 2026-05-15T06:49:34.272646+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 16 internal anchors

[1]

Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala, K.V., Khedr, H., Huang, A., Lei, J., Ma, T., Guo, B., Kalla, A., Marks, M., Greer, J., Wang, M., Sun, P., Rädle, R., Afouras, T., Mavroudi, E., Xu, K., Wu, T.H., Zhou, Y., Momeni, L., Hazra, R., Ding, S., Vaze, S., Porcher, F., Li, F., Li, S., Kamath, A., Cheng, H.K., ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Chang, D., Hou, J., Bozic, A., Neuberger, A., Juefei-Xu, F., Maury, O., Lin, G.W.C., Stuyck, T., Roble, D., Soleymani, M., Grabli, S.: Hairweaver: Few-shot photorealistic hair motion synthesis with sim-to-real guided video diffusion (2026), https://arxiv.org/abs/2602.11117

work page arXiv 2026
[3]

Chen, H., Zhang, Y., Cun, X., Xia, M., Wang, X., Weng, C., Shan, Y.: Videocrafter2: Overcoming data limitations for high-quality video diffusion models (2024)

work page 2024
[4]

In: ACM SIGGRAPH 2015 Talks

Chiang, M.J.Y., Bitterli, B., Tappan, C., Burley, B.: A practical and control- lable hair and fur model for production path tracing. In: ACM SIGGRAPH 2015 Talks. SIGGRAPH ’15, Association for Computing Machinery, New York, NY, USA (2015).https://doi.org/10.1145/2775280.2792559,https: //doi.org/10.1145/2775280.2792559

work page doi:10.1145/2775280.2792559 2015
[5]

Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis (2021), https://arxiv.org/abs/2105.05233

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

Epic Games: Unreal engine 5.5 (2026),https://www.unrealengine.com, version 5.7

work page 2026
[7]

Guo, Q., Yang, T., He, X., Shen, F., Zhang, Y., Kang, Z., Wei, X., Xu, D.: Wildactor: Unconstrained identity-preserving video generation (2026),https: //arxiv.org/abs/2603.00586

work page arXiv 2026
[8]

Guo, X., Ye, F., Li, X., Tu, P., Zhang, P., Sun, Q., Zhao, S., Hou, X., He, Q.: Dreamid-v:bridging the image-to-video gap for high-fidelity face swapping via dif- fusion transformer (2026),https://arxiv.org/abs/2601.01425

work page arXiv 2026
[9]

LTX-Video: Realtime Video Latent Diffusion

HaCohen, Y., Chiprut, N., Brazowski, B., Shalem, D., Moshe, D., Richardson, E., Levin, E., Shiran, G., Zabari, N., Gordon, O., Panet, P., Weissbuch, S., Kulikov, V., Bitterman, Y., Melumian, Z., Bibi, O.: Ltx-video: Realtime video latent diffusion. arXiv preprint arXiv:2501.00103 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

He, X., Liu, Q., Qian, S., Wang, X., Hu, T., Cao, K., Yan, K., Zhang, J.: Id- animator: Zero-shot identity-preserving human video generation (2024),https: //arxiv.org/abs/2404.15275

work page arXiv 2024
[11]

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models (2020),https: //arxiv.org/abs/2006.11239

work page internal anchor Pith review Pith/arXiv arXiv 2020
[12]

Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., Fleet, D.J.: Video diffusion models (2022),https://arxiv.org/abs/2204.03458

work page internal anchor Pith review Pith/arXiv arXiv 2022
[13]

Huang, Y., Ruan, P., Zi, B., Qi, X., Wang, J., Xiao, R.: Refaçade: Editing object with given reference texture (2025),https://arxiv.org/abs/2512.04534

work page arXiv 2025
[14]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., Wang, Y., Chen, X., Wang, L., Lin, D., Qiao, Y., Liu, Z.: VBench: Comprehensive benchmark suite for video generative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)

work page 2024
[15]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Jiang, Z., Han, Z., Mao, C., Zhang, J., Pan, Y., Liu, Y.: Vace: All-in-one video creation and editing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17191–17202 (2025) 16 D. Liu et al

work page 2025
[16]

Ju, X., Ye, W., Liu, Q., Wang, Q., Wang, X., Wan, P., Zhang, D., Gai, K., Xu, Q.: Fulldit: Multi-task video generative foundation model with full attention (2025), https://arxiv.org/abs/2503.19907

work page arXiv 2025
[17]

Ke, J., Wang, Q., Wang, Y., Milanfar, P., Yang, F.: Musiq: Multi-scale image quality transformer (2021),https://arxiv.org/abs/2108.05997

work page arXiv 2021
[18]

Kingma, D.P., Welling, M.: Auto-encoding variational bayes (2022),https://ar xiv.org/abs/1312.6114

work page internal anchor Pith review Pith/arXiv arXiv 2022
[19]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al.: Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Labs, B.F.: FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2 (2025)

work page 2025
[21]

Li, L., Wang, G., Zhang, Z., Li, Y., Li, X., Dou, Q., Gu, J., Xue, T., Shan, Y.: Tooncomposer: Streamlining cartoon production with generative post-keyframing (2025),https://arxiv.org/abs/2508.10881

work page arXiv 2025
[22]

Li, Y., Xia, M., Liu, G., Bai, J., Wang, X., Zhang, C., Lin, Y., Chu, R., Wan, P., Yang, Y.: Adaviewplanner: Adapting video diffusion models for viewpoint planning in 4d scenes (2025),https://arxiv.org/abs/2510.10670

work page arXiv 2025
[23]

Lin, W., Li, H., Zhu, Y.: Controlhair: Physically-based video diffusion for control- lable dynamic hair rendering (2025),https://arxiv.org/abs/2509.21541

work page arXiv 2025
[24]

Liu, Q., Gao, B., Huang, W., Zhang, J., Sun, Z., Wei, Y., Liu, F., Peng, Z., Ma, Q., Yang, S., Liao, Z., Zhao, H., Niu, L.: Animatescene: Camera-controllable animation in any scene (2026),https://arxiv.org/abs/2508.05982

work page arXiv 2026
[25]

Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow (2022),https://arxiv.org/abs/2209.03003

work page internal anchor Pith review Pith/arXiv arXiv 2022
[26]

Luo, H., Ouyang, M., Zhao, Z., Jiang, S., Zhang, L., Zhang, Q., Yang, W., Xu, L., Yu, J.: Gaussianhair: Hair modeling and rendering with light-aware gaussians (2024),https://arxiv.org/abs/2402.10483

work page arXiv 2024
[27]

ACM Trans

Marschner, S.R., Jensen, H.W., Cammarano, M., Worley, S., Hanrahan, P.: Light scattering from human hair fibers. ACM Trans. Graph.22(3), 780–791 (Jul 2003). https://doi.org/10.1145/882262.882345,https://doi.org/10.1145/882262 .882345

work page doi:10.1145/882262.882345 2003
[28]

In: ACM SIGGRAPH 2008 Papers

Moon, J.T., Walter, B., Marschner, S.: Efficient multiple scattering in hair using spherical harmonics. In: ACM SIGGRAPH 2008 Papers. SIGGRAPH ’08, Associ- ation for Computing Machinery, New York, NY, USA (2008).https://doi.org/ 10.1145/1399504.1360630,https://doi.org/10.1145/1399504.1360630

work page doi:10.1145/1399504.1360630 2008
[29]

Peebles, W., Xie, S.: Scalable diffusion models with transformers (2023),https: //arxiv.org/abs/2212.09748

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Pexels: Pexels: Free stock photos & videos.https://www.pexels.com/(2026), accessed: 2026-01-22

work page 2026
[31]

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision (2021),https://arxiv.org/ab s/2103.00020

work page internal anchor Pith review Pith/arXiv arXiv 2021
[32]

Rosu, R.A., Saito, S., Wang, Z., Wu, C., Behnke, S., Nam, G.: Neural strands: Learning hair geometry and appearance from multi-view images (2022),https: //arxiv.org/abs/2207.14067

work page arXiv 2022
[33]

Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., Schramowski, P., Kundurthy, S., Crowson, K., Schmidt, L., Kaczmarczyk, R., Jitsev, J.: Laion-5b: An open Abbreviated paper title 17 large-scale dataset for training next generation image-text models (2022),https: //arxiv.org/abs/...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[34]

Shi, Y., Liu, Y., Wu, Y., Liu, X., Zhao, C., Luo, J., Zhou, B.: Drive any mesh: 4d latent diffusion for mesh deformation from video (2025),https://arxiv.org/ab s/2506.07489

work page arXiv 2025
[35]

SideFX: Houdini.https://www.sidefx.com/(2026), accessed: 2026-03-02

work page 2026
[36]

Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni, O., Parikh, D., Gupta, S., Taigman, Y.: Make-a-video: Text-to- video generation without text-video data (2022),https://arxiv.org/abs/2209.1 4792

work page 2022
[37]

Sklyarova, V., Chelishev, J., Dogaru, A., Medvedev, I., Lempitsky, V., Zakharov, E.: Neural haircut: Prior-guided strand-based hair reconstruction (2023),https: //arxiv.org/abs/2306.05872

work page arXiv 2023
[38]

Sklyarova, V., Kabadayi, B., Yiannakidis, A., Becherini, G., Black, M.J., Thies, J.: Neuralfur: Animal fur reconstruction from multi-view images (2026),https: //arxiv.org/abs/2601.12481

work page arXiv 2026
[39]

Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models (2022),https: //arxiv.org/abs/2010.02502

work page internal anchor Pith review Pith/arXiv arXiv 2022
[40]

Tan, Z., Liu, S., Yang, X., Xue, Q., Wang, X.: Ominicontrol: Minimal and universal control for diffusion transformer (2025),https://arxiv.org/abs/2411.15098

work page arXiv 2025
[41]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., Zeng, J., Wang, J., Zhang, J., Zhou, J., Wang, J., Chen, J., Zhu, K., Zhao, K., Yan, K., Huang, L., Feng, M., Zhang, N., Li, P., Wu, P., Chu, R., Feng, R., Zhang, S., Sun, S., Fang, T., Wang, T., Gui, T., Weng, T., Shen, T., Lin, W., Wang, W., Wang, W., Zhou, W.,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

IEEE Transactions on Image Process- ing13(4), 600–612 (2004)

Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Process- ing13(4), 600–612 (2004)

work page 2004
[43]

Wang, Z., Nam, G., Stuyck, T., Lombardi, S., Cao, C., Saragih, J., Zollhoefer, M., Hodgins, J., Lassner, C.: Neuwigs: A neural dynamic model for volumetric hair capture and animation (2023),https://arxiv.org/abs/2212.00613

work page arXiv 2023
[44]

Wu, K., Yang, L., Kuang, Z., Feng, Y., Han, X., Shen, Y., Fu, H., Zhou, K., Zheng, Y.: Monohair: High-fidelity hair modeling from a monocular video (2024), https://arxiv.org/abs/2403.18356

work page arXiv 2024
[45]

Xia, Z., Wang, Y., Lu, Z., Liu, K., Xiao, J., Wonka, P.: OMEGA-avatar: One-shot modeling of 360-degree gaussian avatars (2026),https://arxiv.org/abs/2602.1 1693

work page 2026
[46]

Xing, J., Xia, M., Liu, Y., Zhang, Y., Zhang, Y., He, Y., Liu, H., Chen, H., Cun, X., Wang, X., Shan, Y., Wong, T.T.: Make-your-video: Customized video generation using textual and structural guidance (2023),https://arxiv.org/abs/2306.009 43

work page 2023
[47]

Xue, B., Duan, Z.P., Yan, Q., Wang, W., Liu, H., Guo, C.L., Li, C., Li, C., Lyu, J.: Stand-in: A lightweight and plug-and-play identity control for video generation (2025),https://arxiv.org/abs/2508.07901

work page arXiv 2025
[48]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., et al.: Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072 (2024) 18 D. Liu et al

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

Ye, H., Zhang, J., Liu, S., Han, X., Yang, W.: Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models (2023),https://arxiv.org/ab s/2308.06721

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

Yuan, S., Huang, J., He, X., Ge, Y., Shi, Y., Chen, L., Luo, J., Yuan, L.: Identity- preserving text-to-video generation by frequency decomposition (2025),https: //arxiv.org/abs/2411.17440

work page arXiv 2025
[51]

Zakharov, E., Sklyarova, V., Black, M., Nam, G., Thies, J., Hilliges, O.: Human hair reconstruction with strand-aligned 3d gaussians (2024),https://arxiv.org/ abs/2409.14778

work page arXiv 2024
[52]

Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models (2023),https://arxiv.org/abs/2302.05543

work page arXiv 2023
[53]

In: CVPR (2018)

Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)

work page 2018
[54]

Zheng, Z., Peng, X., Yang, T., Shen, C., Li, S., Liu, H., Zhou, Y., Li, T., You, Y.: Open-sora: Democratizing efficient video production for all (2024),https: //arxiv.org/abs/2412.20404

work page internal anchor Pith review Pith/arXiv arXiv 2024
[55]

ACM Transactions on Graphics43(6), 1–15 (Nov 2024).https://doi.org/10.1145/3687768,http: //dx.doi.org/10.1145/3687768

Zhou, Y., Chai, M., Wang, D., Winberg, S., Wood, E., Sarkar, K., Gross, M., Beeler, T.: Groomcap: High-fidelity prior-free hair capture. ACM Transactions on Graphics43(6), 1–15 (Nov 2024).https://doi.org/10.1145/3687768,http: //dx.doi.org/10.1145/3687768

work page doi:10.1145/3687768 2024
[56]

ACM Trans

Zinke, A., Yuksel, C., Weber, A., Keyser, J.: Dual scattering approximation for fast multiple scattering in hair. ACM Trans. Graph.27(3), 1–10 (Aug 2008).https: //doi.org/10.1145/1360612.1360631,https://doi.org/10.1145/1360612.13 60631

work page doi:10.1145/1360612.1360631 2008