Recognition: unknown
AnimateAnyMesh++: A Flexible 4D Foundation Model for High-Fidelity Text-Driven Mesh Animation
Pith reviewed 2026-05-07 10:45 UTC · model grok-4.3
The pith
Enlarging the training set to 300K identities and redesigning the variational autoencoder lets a feed-forward model turn text prompts into coherent animated 3D meshes in seconds.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AnimateAnyMesh++ is a feed-forward framework that, after expanding DyMesh-XL to 300K identities mined from Objaverse-XL, redesigning DyMeshVAE-Flex with power-law topology-aware attention and vertex-normal enhanced features, and adding variable-length training to both the VAE and the rectified-flow generator, produces semantically accurate and temporally coherent mesh animations from text prompts within seconds.
What carries the argument
DyMeshVAE-Flex, the variational autoencoder upgraded with power-law topology-aware attention and vertex-normal features that encodes and reconstructs variable-length mesh trajectories.
Load-bearing premise
That scaling the dynamic dataset and the new attention and feature choices in the autoencoder produce real gains in trajectory accuracy and artifact reduction on meshes never seen during training.
What would settle it
A side-by-side comparison on a held-out collection of in-the-wild meshes with novel motions, measuring whether animation coherence and artifact counts remain better than the previous version of the model.
Figures
read the original abstract
Recent advances in 4D content generation have attracted increasing attention, yet creating high-quality animated 3D models remains challenging due to the complexity of modeling spatio-temporal distributions and the scarcity of 4D training data. We present AnimateAnyMesh++, a feed-forward framework for text-driven animation of arbitrary 3D meshes with substantial upgrades in data, architecture, and generative capability. First, we expand the DyMesh-XL dataset by mining dynamic content from Objaverse-XL, increasing the number of unique identities from 60K to 300K and substantially broadening category and motion diversity. Second, we redesign DyMeshVAE-Flex with power-law topology-aware attention and vertex-normal enhanced features, which significantly improves trajectory reconstruction, local geometry preservation, and mitigates trajectory-sticking artifacts. Third, we introduce architectural changes to both DyMeshVAE-Flex and the rectified-flow (RF) generator to support variable-length sequence training and generation, enabling longer animations while preserving reconstruction fidelity. Extensive experiments demonstrate that AnimateAnyMesh++ generates semantically accurate and temporally coherent mesh animations within seconds, surpassing prior approaches in quality and efficiency. The enlarged DyMesh-XL, the upgraded DyMeshVAE-Flex, and variable-length RF together deliver consistent gains across benchmarks and in-the-wild meshes. We will release code, models, and the expanded DyMesh-XL upon acceptance of this manuscript to facilitate research in 4D content creation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents AnimateAnyMesh++, a feed-forward framework for text-driven animation of arbitrary 3D meshes. It expands DyMesh-XL to 300K identities by mining dynamic content from Objaverse-XL, redesigns DyMeshVAE-Flex with power-law topology-aware attention and vertex-normal features, and adds architectural support for variable-length sequences in both the VAE and rectified-flow generator. Extensive experiments are reported to show that the model produces semantically accurate and temporally coherent animations in seconds, outperforming prior methods in quality and efficiency with consistent gains on benchmarks and in-the-wild meshes.
Significance. If the quantitative benchmarks and ablations hold, the work advances 4D foundation models by scaling data diversity and improving architectural flexibility for spatio-temporal modeling. The combination of enlarged training data, topology-aware attention, and variable-length training addresses data scarcity and artifact issues in mesh animation. The planned public release of code, models, and the expanded DyMesh-XL dataset strengthens the contribution by enabling reproducibility and follow-on research.
minor comments (3)
- [Abstract] Abstract: the summary of results would be more informative if it included one or two key quantitative metrics (e.g., a specific improvement in trajectory error or user-study score) alongside the qualitative claims of superiority.
- The description of power-law topology-aware attention would benefit from an explicit equation or pseudocode block to clarify how the attention weights are computed from the topology.
- Ensure that all experimental tables report standard deviations or error bars across multiple runs to support the claim of consistent gains.
Simulated Author's Rebuttal
We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. No specific major comments were raised in the report, so we have no point-by-point rebuttals to provide at this stage. We will incorporate any minor suggestions during revision and ensure the planned release of code, models, and the expanded DyMesh-XL dataset proceeds upon acceptance.
Circularity Check
No significant circularity identified
full rationale
The paper describes empirical upgrades to dataset size (DyMesh-XL from 60K to 300K identities), architecture (power-law topology-aware attention and vertex-normal features in DyMeshVAE-Flex), and training (variable-length rectified flow support). Performance claims rest on quantitative benchmarks, ablations, and in-the-wild comparisons rather than any derivation, fitted parameter renamed as prediction, or self-citation chain. No equations, uniqueness theorems, or ansatzes are invoked that reduce outputs to inputs by construction. The work is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2506.09982 , year=
Z. Wu, C. Yu, F. Wang, and X. Bai, “Animateanymesh: A feed-forward 4d foundation model for text-driven universal mesh animation,”arXiv preprint arXiv:2506.09982, 2025
-
[2]
arXiv preprint arXiv:2311.06214 , year=
J. Li, H. Tan, K. Zhang, Z. Xu, F. Luan, Y . Xu, Y . Hong, K. Sunkavalli, G. Shakhnarovich, and S. Bi, “Instant3d: Fast text-to-3d with sparse- view generation and large reconstruction model,”arXiv preprint arXiv:2311.06214, 2023
-
[3]
Lgm: Large multi-view gaussian model for high-resolution 3d content creation,
J. Tang, Z. Chen, X. Chen, T. Wang, G. Zeng, and Z. Liu, “Lgm: Large multi-view gaussian model for high-resolution 3d content creation,” in European Conference on Computer Vision. Springer, 2024, pp. 1–18
2024
-
[4]
arXiv preprint arXiv:2311.04400 , year=
Y . Hong, K. Zhang, J. Gu, S. Bi, Y . Zhou, D. Liu, F. Liu, K. Sunkavalli, T. Bui, and H. Tan, “Lrm: Large reconstruction model for single image to 3d,”arXiv preprint arXiv:2311.04400, 2023
-
[5]
arXiv preprint arXiv:2404.12385 , year=
X. Wei, K. Zhang, S. Bi, H. Tan, F. Luan, V . Deschaintre, K. Sunkavalli, H. Su, and Z. Xu, “Meshlrm: Large reconstruction model for high- quality meshes,”arXiv preprint arXiv:2404.12385, 2024
-
[6]
Sc4d: Sparse- controlled video-to-4d generation and motion transfer,
Z. Wu, C. Yu, Y . Jiang, C. Cao, F. Wang, and X. Bai, “Sc4d: Sparse- controlled video-to-4d generation and motion transfer,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 361–379
2024
-
[7]
arXiv preprint arXiv:2311.02848 , year=
Y . Jiang, L. Zhang, J. Gao, W. Hu, and Y . Yao, “Consistent4d: Consistent 360{\deg}dynamic object generation from monocular video,”arXiv preprint arXiv:2311.02848, 2023
-
[8]
arXiv preprint arXiv:2311.17984 , year=
S. Bahmani, I. Skorokhodov, V . Rong, G. Wetzstein, L. Guibas, P. Wonka, S. Tulyakov, J. J. Park, A. Tagliasacchi, and D. B. Lindell, “4d-fy: Text-to-4d generation using hybrid score distillation sampling,” arXiv preprint arXiv:2311.17984, 2023
-
[9]
arXiv preprint arXiv:2311.14603 , year=
Y . Zhao, Z. Yan, E. Xie, L. Hong, Z. Li, and G. H. Lee, “Ani- mate124: Animating one image to 4d dynamic scene,”arXiv preprint arXiv:2311.14603, 2023
-
[10]
arXiv preprint arXiv:2312.17142 , year=
J. Ren, L. Pan, J. Tang, C. Zhang, A. Cao, G. Zeng, and Z. Liu, “Dreamgaussian4d: Generative 4d gaussian splatting,”arXiv preprint arXiv:2312.17142, 2023
-
[11]
DreamFusion: Text-to-3D using 2D Diffusion
B. Poole, A. Jain, J. T. Barron, and B. Mildenhall, “Dreamfusion: Text- to-3d using 2d diffusion,”arXiv preprint arXiv:2209.14988, 2022
work page internal anchor Pith review arXiv 2022
-
[12]
arXiv preprint arXiv:2407.11398 , year=
Y . Jiang, C. Yu, C. Cao, F. Wang, W. Hu, and J. Gao, “Animate3d: An- imating any 3d model with multi-view video diffusion,”arXiv preprint arXiv:2407.11398, 2024
-
[13]
4diffusion: Multi-view video diffusion model for 4d generation,
H. Zhang, X. Chen, Y . Wang, X. Liu, Y . Wang, and Y . Qiao, “4diffusion: Multi-view video diffusion model for 4d generation,”Advances in Neural Information Processing Systems, vol. 37, pp. 15 272–15 295, 2025
2025
-
[14]
3d gaussian splatting for real-time radiance field rendering
B. Kerbl, G. Kopanas, T. Leimk ¨uhler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering.”ACM Trans. Graph., vol. 42, no. 4, pp. 139–1, 2023
2023
-
[15]
Nerf: Representing scenes as neural radiance fields for view synthesis,
B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,”Communications of the ACM, vol. 65, no. 1, pp. 99–106, 2021
2021
-
[16]
Auto-Encoding Variational Bayes
D. P. Kingma and M. Welling, “Auto-encoding variational bayes,”arXiv preprint arXiv:1312.6114, 2013
work page internal anchor Pith review arXiv 2013
-
[17]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
X. Liu, C. Gong, and Q. Liu, “Flow straight and fast: Learning to generate and transfer data with rectified flow,”arXiv preprint arXiv:2209.03003, 2022
work page internal anchor Pith review arXiv 2022
-
[18]
Objaverse-xl: A uni- verse of 10m+ 3d objects,
M. Deitke, R. Liu, M. Wallingford, H. Ngo, O. Michel, A. Kusupati, A. Fan, C. Laforte, V . V oleti, S. Y . Gadreet al., “Objaverse-xl: A uni- verse of 10m+ 3d objects,”Advances in Neural Information Processing Systems, vol. 36, 2024
2024
-
[19]
Objaverse: A universe of annotated 3d objects,
M. Deitke, D. Schwenk, J. Salvador, L. Weihs, O. Michel, E. VanderBilt, L. Schmidt, K. Ehsani, A. Kembhavi, and A. Farhadi, “Objaverse: A universe of annotated 3d objects,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 13 142–13 153
2023
-
[20]
AMASS: Archive of motion capture as surface shapes,
N. Mahmood, N. Ghorbani, N. F. Troje, G. Pons-Moll, and M. J. Black, “AMASS: Archive of motion capture as surface shapes,” inInternational Conference on Computer Vision, Oct. 2019, pp. 5442–5451
2019
-
[21]
4dcom- plete: Non-rigid motion estimation beyond the observable surface,
Y . Li, H. Takehara, T. Taketomi, B. Zheng, and M. Nießner, “4dcom- plete: Non-rigid motion estimation beyond the observable surface,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 12 706–12 716
2021
-
[22]
Dream3d: Zero-shot text-to-3d synthesis using 3d shape prior and text- to-image diffusion models,
J. Xu, X. Wang, W. Cheng, Y .-P. Cao, Y . Shan, X. Qie, and S. Gao, “Dream3d: Zero-shot text-to-3d synthesis using 3d shape prior and text- to-image diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 20 908–20 918
2023
-
[23]
Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation,
H. Wang, X. Du, J. Li, R. A. Yeh, and G. Shakhnarovich, “Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 12 619–12 629
2023
-
[24]
Magic3d: High-resolution text-to- 3d content creation,
C.-H. Lin, J. Gao, L. Tang, T. Takikawa, X. Zeng, X. Huang, K. Kreis, S. Fidler, M.-Y . Liu, and T.-Y . Lin, “Magic3d: High-resolution text-to- 3d content creation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 300–309
2023
-
[25]
Prolific- dreamer: High-fidelity and diverse text-to-3d generation with variational score distillation,
Z. Wang, C. Lu, Y . Wang, F. Bao, C. Li, H. Su, and J. Zhu, “Prolific- dreamer: High-fidelity and diverse text-to-3d generation with variational score distillation,”Advances in Neural Information Processing Systems, vol. 36, pp. 8406–8441, 2023
2023
-
[26]
arXiv preprint arXiv:2310.16818 , year=
J. Sun, B. Zhang, R. Shao, L. Wang, W. Liu, Z. Xie, and Y . Liu, “Dreamcraft3d: Hierarchical 3d generation with bootstrapped diffusion prior,”arXiv preprint arXiv:2310.16818, 2023
-
[27]
Points-to-3d: Bridging the gap between sparse points and shape-controllable text-to-3d generation,
C. Yu, Q. Zhou, J. Li, Z. Zhang, Z. Wang, and F. Wang, “Points-to-3d: Bridging the gap between sparse points and shape-controllable text-to-3d generation,” inProceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 6841–6850
2023
-
[28]
J. Tang, T. Wang, B. Zhang, T. Zhang, R. Yi, L. Ma, and D. Chen, “Make-it-3d: High-fidelity 3d creation from a single image with diffu- sion prior,”arXiv preprint arXiv:2303.14184, 2023
-
[29]
arXiv preprint arXiv:2309.16653 , year=
J. Tang, J. Ren, H. Zhou, Z. Liu, and G. Zeng, “Dreamgaussian: Generative gaussian splatting for efficient 3d content creation,”arXiv preprint arXiv:2309.16653, 2023
-
[30]
arXiv preprint arXiv:2309.16585 , year=
Z. Chen, F. Wang, and H. Liu, “Text-to-3d using gaussian splatting,” arXiv preprint arXiv:2309.16585, 2023
-
[31]
arXiv preprint arXiv:2310.08529 , year=
T. Yi, J. Fang, G. Wu, L. Xie, X. Zhang, W. Liu, Q. Tian, and X. Wang, “Gaussiandreamer: Fast generation from text to 3d gaussian splatting with point cloud priors,”arXiv preprint arXiv:2310.08529, 2023
-
[32]
Learning transferable IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 13 visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 13 visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763
2021
-
[33]
High- resolution image synthesis with latent diffusion models,
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695
2022
-
[34]
Photorealistic text-to-image diffusion models with deep language understanding,
C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans et al., “Photorealistic text-to-image diffusion models with deep language understanding,”Advances in neural information processing systems, vol. 35, pp. 36 479–36 494, 2022
2022
-
[35]
Zero-1-to-3: Zero-shot one image to 3d object,
R. Liu, R. Wu, B. Van Hoorick, P. Tokmakov, S. Zakharov, and C. V on- drick, “Zero-1-to-3: Zero-shot one image to 3d object,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 9298–9309
2023
-
[36]
arXiv preprint arXiv:2312.02201 , year=
P. Wang and Y . Shi, “Imagedream: Image-prompt multi-view diffusion for 3d generation,”arXiv preprint arXiv:2312.02201, 2023
-
[37]
arXiv preprint arXiv:2309.03453 , year=
Y . Liu, C. Lin, Z. Zeng, X. Long, L. Liu, T. Komura, and W. Wang, “Syncdreamer: Generating multiview-consistent images from a single- view image,”arXiv preprint arXiv:2309.03453, 2023
-
[38]
R. Shi, H. Chen, Z. Zhang, M. Liu, C. Xu, X. Wei, L. Chen, C. Zeng, and H. Su, “Zero123++: a single image to consistent multi-view diffusion base model,”arXiv preprint arXiv:2310.15110, 2023
-
[39]
Richdreamer: A generalizable normal-depth dif- fusion model for detail richness in text-to-3d,
L. Qiu, G. Chen, X. Gu, Q. Zuo, M. Xu, Y . Wu, W. Yuan, Z. Dong, L. Bo, and X. Han, “Richdreamer: A generalizable normal-depth dif- fusion model for detail richness in text-to-3d,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 9914–9925
2024
-
[40]
One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization,
M. Liu, C. Xu, H. Jin, L. Chen, M. Varma T, Z. Xu, and H. Su, “One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization,”Advances in Neural Information Processing Systems, vol. 36, pp. 22 226–22 246, 2023
2023
-
[41]
One-2-3-45++: Fast single image to 3d objects with consistent multi-view generation and 3d diffusion,
M. Liu, R. Shi, L. Chen, Z. Zhang, C. Xu, X. Wei, H. Chen, C. Zeng, J. Gu, and H. Su, “One-2-3-45++: Fast single image to 3d objects with consistent multi-view generation and 3d diffusion,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 10 072–10 083
2024
-
[42]
J. Xu, W. Cheng, Y . Gao, X. Wang, S. Gao, and Y . Shan, “Instantmesh: Efficient 3d mesh generation from a single image with sparse-view large reconstruction models,”arXiv preprint arXiv:2404.07191, 2024
work page internal anchor Pith review arXiv 2024
-
[43]
Grm: Large gaussian reconstruction model for efficient 3d reconstruction and generation,
Y . Xu, Z. Shi, W. Yifan, H. Chen, C. Yang, S. Peng, Y . Shen, and G. Wetzstein, “Grm: Large gaussian reconstruction model for efficient 3d reconstruction and generation,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 1–20
2024
-
[44]
Gs-lrm: Large reconstruction model for 3d gaussian splatting,
K. Zhang, S. Bi, H. Tan, Y . Xiangli, N. Zhao, K. Sunkavalli, and Z. Xu, “Gs-lrm: Large reconstruction model for 3d gaussian splatting,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 1–19
2024
-
[45]
Lpm: Efficient 3d content creation from single image by large-scale partial 3d modeling,
Y . Zhang, C. Yu, F. Wang, and J. Zhu, “Lpm: Efficient 3d content creation from single image by large-scale partial 3d modeling,”IEEE Transactions on Circuits and Systems for Video Technology, 2025
2025
-
[46]
Hunyuan3d 1.0: A unified framework for text-to-3d and image-to-3d generation,
X. Yang, H. Shi, B. Zhang, F. Yang, J. Wang, H. Zhao, X. Liu, X. Wang, Q. Lin, J. Yuet al., “Hunyuan3d 1.0: A unified framework for text-to-3d and image-to-3d generation,”arXiv preprint arXiv:2411.02293, 2024
-
[47]
Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation
Z. Zhao, Z. Lai, Q. Lin, Y . Zhao, H. Liu, S. Yang, Y . Feng, M. Yang, S. Zhang, X. Yanget al., “Hunyuan3d 2.0: Scaling diffusion mod- els for high resolution textured 3d assets generation,”arXiv preprint arXiv:2501.12202, 2025
work page internal anchor Pith review arXiv 2025
-
[48]
Structured 3d latents for scalable and versatile 3d generation,
J. Xiang, Z. Lv, S. Xu, Y . Deng, R. Wang, B. Zhang, D. Chen, X. Tong, and J. Yang, “Structured 3d latents for scalable and versatile 3d generation,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 21 469–21 480
2025
-
[49]
Triposg: High-fidelity 3d shape synthesis using large-scale rectified flow models
Y . Li, Z.-X. Zou, Z. Liu, D. Wang, Y . Liang, Z. Yu, X. Liu, Y .- C. Guo, D. Liang, W. Ouyanget al., “Triposg: High-fidelity 3d shape synthesis using large-scale rectified flow models,”arXiv preprint arXiv:2502.06608, 2025
-
[50]
Sparseflex: High-resolution and arbitrary-topology 3d shape modeling,
X. He, Z.-X. Zou, C.-H. Chen, Y .-C. Guo, D. Liang, C. Yuan, W. Ouyang, Y .-P. Cao, and Y . Li, “Sparseflex: High-resolution and arbitrary-topology 3d shape modeling,”arXiv preprint arXiv:2503.21732, 2025
-
[51]
C. Ye, Y . Wu, Z. Lu, J. Chang, X. Guo, J. Zhou, H. Zhao, and X. Han, “Hi3dgen: High-fidelity 3d geometry generation from images via normal bridging,”arXiv preprint arXiv:2503.22236, vol. 3, p. 2, 2025
-
[52]
S. Wu, Y . Lin, F. Zhang, Y . Zeng, Y . Yang, Y . Bao, J. Qian, S. Zhu, X. Cao, P. Torret al., “Direct3d-s2: Gigascale 3d generation made easy with spatial sparse attention,”arXiv preprint arXiv:2505.17412, 2025
-
[53]
arXiv preprint arXiv:2312.13763 , year=
H. Ling, S. W. Kim, A. Torralba, S. Fidler, and K. Kreis, “Align your gaussians: Text-to-4d with dynamic 3d gaussians and composed diffusion models,”arXiv preprint arXiv:2312.13763, 2023
-
[54]
arXiv preprint arXiv:2301.11280 , year=
U. Singer, S. Sheynin, A. Polyak, O. Ashual, I. Makarov, F. Kokkinos, N. Goyal, A. Vedaldi, D. Parikh, J. Johnsonet al., “Text-to-4d dynamic scene generation,”arXiv preprint arXiv:2301.11280, 2023
-
[55]
arXiv preprint arXiv:2312.17225 , year=
Y . Yin, D. Xu, Z. Wang, Y . Zhao, and Y . Wei, “4dgen: Grounded 4d content generation with spatial-temporal consistency,”arXiv preprint arXiv:2312.17225, 2023
-
[56]
arXiv preprint arXiv:2401.08742 , year=
Z. Pan, Z. Yang, X. Zhu, and L. Zhang, “Fast dynamic 3d object generation from a single-view video,”arXiv preprint arXiv:2401.08742, 2024
-
[57]
Stag4d: Spatial-temporal anchored generative 4d gaussians,
Y . Zeng, Y . Jiang, S. Zhu, Y . Lu, Y . Lin, H. Zhu, W. Hu, X. Cao, and Y . Yao, “Stag4d: Spatial-temporal anchored generative 4d gaussians,” in European Conference on Computer Vision. Springer, 2024, pp. 163– 179
2024
-
[58]
arXiv preprint arXiv:2405.16645 , year=
H. Liang, Y . Yin, D. Xu, H. Liang, Z. Wang, K. N. Plataniotis, Y . Zhao, and Y . Wei, “Diffusion4d: Fast spatial-temporal consistent 4d generation via video diffusion models,”arXiv preprint arXiv:2405.16645, 2024
-
[59]
L4gm: Large 4d gaussian reconstruction model,
J. Ren, C. Xie, A. Mirzaei, K. Kreis, Z. Liu, A. Torralba, S. Fidler, S. W. Kim, H. Linget al., “L4gm: Large 4d gaussian reconstruction model,”Advances in Neural Information Processing Systems, vol. 37, pp. 56 828–56 858, 2025
2025
-
[60]
Vivid-zoo: Multi-view video generation with diffusion model,
B. Li, C. Zheng, W. Zhu, J. Mai, B. Zhang, P. Wonka, and B. Ghanem, “Vivid-zoo: Multi-view video generation with diffusion model,”Ad- vances in Neural Information Processing Systems, vol. 37, pp. 62 189– 62 222, 2025
2025
-
[61]
MVDream: Multi-view Diffusion for 3D Generation
Y . Shi, P. Wang, J. Ye, M. Long, K. Li, and X. Yang, “Mvdream: Multi- view diffusion for 3d generation,”arXiv preprint arXiv:2308.16512, 2023
work page internal anchor Pith review arXiv 2023
-
[62]
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
Y . Guo, C. Yang, A. Rao, Z. Liang, Y . Wang, Y . Qiao, M. Agrawala, D. Lin, and B. Dai, “Animatediff: Animate your personalized text- to-image diffusion models without specific tuning,”arXiv preprint arXiv:2307.04725, 2023
work page internal anchor Pith review arXiv 2023
-
[63]
Zeroscope text-to-video model,
Cerspense, “Zeroscope text-to-video model,” https://huggingface.co/ cerspense/zeroscope v2 576w, 2023, accessed: 2023-10-31
2023
-
[64]
ModelScope Text-to-Video Technical Report
J. Wang, H. Yuan, D. Chen, Y . Zhang, X. Wang, and S. Zhang, “Mod- elscope text-to-video technical report,”arXiv preprint arXiv:2308.06571, 2023
work page internal anchor Pith review arXiv 2023
-
[65]
Motion2vecsets: 4d latent vector set diffusion for non-rigid shape reconstruction and tracking,
W. Cao, C. Luo, B. Zhang, M. Nießner, and J. Tang, “Motion2vecsets: 4d latent vector set diffusion for non-rigid shape reconstruction and tracking,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 20 496–20 506
2024
-
[66]
G. Tevet, S. Raab, B. Gordon, Y . Shafir, D. Cohen-Or, and A. H. Bermano, “Human motion diffusion model,”arXiv preprint arXiv:2209.14916, 2022
work page internal anchor Pith review arXiv 2022
-
[67]
Motiondiffuse: Text-driven human motion generation with diffusion model,
M. Zhang, Z. Cai, L. Pan, F. Hong, X. Guo, L. Yang, and Z. Liu, “Motiondiffuse: Text-driven human motion generation with diffusion model,”IEEE transactions on pattern analysis and machine intelligence, vol. 46, no. 6, pp. 4115–4128, 2024
2024
-
[68]
Realistic human motion generation with cross-diffusion models,
Z. Ren, S. Huang, and X. Li, “Realistic human motion generation with cross-diffusion models,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 345–362
2024
-
[69]
Physdiff: Physics- guided human motion diffusion model,
Y . Yuan, J. Song, U. Iqbal, A. Vahdat, and J. Kautz, “Physdiff: Physics- guided human motion diffusion model,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 16 010–16 021
2023
-
[70]
Smpl: A skinned multi-person linear model,
M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black, “Smpl: A skinned multi-person linear model,”Seminal Graphics Papers: Pushing the Boundaries, V olume 2, pp. 851–866, 2023
2023
-
[71]
arXiv preprint arXiv:2405.20155 , year=
L. Uzolas, E. Eisemann, and P. Kellnhofer, “Motiondreamer: Zero- shot 3d mesh animation from video diffusion models,”arXiv preprint arXiv:2405.20155, 2024
-
[72]
3dshape2vecset: A 3d shape representation for neural fields and generative diffusion models,
B. Zhang, J. Tang, M. Niessner, and P. Wonka, “3dshape2vecset: A 3d shape representation for neural fields and generative diffusion models,” ACM Transactions On Graphics (TOG), vol. 42, no. 4, pp. 1–16, 2023
2023
-
[73]
Scalable diffusion models with transformers,
W. Peebles and S. Xie, “Scalable diffusion models with transformers,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4195–4205
2023
-
[74]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y . Yang, W. Hong, X. Zhang, G. Fenget al., “Cogvideox: Text-to-video diffusion models with an expert transformer,”arXiv preprint arXiv:2408.06072, 2024
work page internal anchor Pith review arXiv 2024
-
[75]
Roformer: En- hanced transformer with rotary position embedding,
J. Su, M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu, “Roformer: En- hanced transformer with rotary position embedding,”Neurocomputing, vol. 568, p. 127063, 2024
2024
-
[76]
Sketchfab,
Sketchfab, “Sketchfab,” https://sketchfab.com/, 2024, accessed: 2024-05- 21. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 14
2024
-
[77]
S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tanget al., “Qwen2. 5-vl technical report,”arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review arXiv 2025
-
[78]
Vbench: Comprehensive benchmark suite for video generative models,
Z. Huang, Y . He, J. Yu, F. Zhang, C. Si, Y . Jiang, Y . Zhang, T. Wu, Q. Jin, N. Chanpaisitet al., “Vbench: Comprehensive benchmark suite for video generative models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 21 807–21 818
2024
-
[79]
Wan: Open and Advanced Large-Scale Video Generative Models
T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yanget al., “Wan: Open and advanced large-scale video generative models,”arXiv preprint arXiv:2503.20314, 2025
work page internal anchor Pith review arXiv 2025
-
[80]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y . Levi, Z. English, V . V oleti, A. Lettset al., “Stable video diffusion: Scaling latent video diffusion models to large datasets,”arXiv preprint arXiv:2311.15127, 2023
work page internal anchor Pith review arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.