Feed-forward Motion In-betweening for Any 4D
Pith reviewed 2026-06-26 12:19 UTC · model grok-4.3
The pith
A feed-forward framework generates arbitrary 4D mesh sequences by conditioning a rectified flow model on sparse keyframes via universal latents.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that universal mesh-animation latents, produced by a frame-wise mesh VAE that outputs topology-agnostic tokens anchored to a reference mesh, enable a keyframe-conditioned rectified flow model with MMDiT backbone to synthesize non-keyframe frames for arbitrary 4D meshes in a single feed-forward step, yielding strong performance and improved controllability on DyMesh16 and DyMesh32 benchmarks.
What carries the argument
Frame-wise mesh VAE producing topology-agnostic latent tokens anchored by a reference mesh, paired with a keyframe-conditioned rectified flow model using MMDiT backbone.
If this is right
- Long-horizon 4D sequences can be produced without error accumulation from short-horizon generation.
- Users gain direct spatiotemporal control by editing only the supplied keyframes.
- Inference becomes fast enough for practical use in animation and games without per-sequence optimization.
- The same latent space supports meshes of varying shapes and connectivities.
Where Pith is reading between the lines
- The same latent construction might transfer to other dynamic 3D representations such as point clouds or neural fields if the anchoring step generalizes.
- Real-time interactive 4D editing tools could become feasible if the flow model supports incremental keyframe updates.
- World-modeling pipelines that currently rely on video diffusion priors could switch to this feed-forward route for faster rollout.
Load-bearing premise
Universal mesh-animation latents exist and a frame-wise VAE can reliably encode arbitrary topologies into consistent tokens anchored by a reference mesh for conditioning.
What would settle it
Run the model on DyMesh32 test sequences where the generated in-between frames diverge from ground truth by more than baseline methods on standard geometric and temporal metrics, or where controllability metrics show no gain over prior feed-forward generators.
Figures
read the original abstract
4D dynamics (3D geometry evolving over time) is a fundamental representation of the physical world and plays a crucial role in world modeling (e.g., animation and games). Owing to the scarcity of large-scale, long-horizon 4D mesh data with arbitrary shapes, early text-to-4D methods rely on distillation or test-time optimization from video diffusion priors, making inference prohibitively slow. Recent feed-forward generators greatly reduce inference cost but offer limited spatiotemporal controllability, and short-horizon generation often leads to error accumulation in long-horizon sequences. We propose a novel feed-forward in-betweening framework for arbitrary 4D meshes with keyframe conditioning. Building on universal mesh-animation latents, we introduce a frame-wise mesh VAE that encodes each frame into topology-agnostic latent tokens anchored by a reference mesh for keyframe conditioning. We further introduce a keyframe-conditioned rectified flow model with an MMDiT backbone that synthesizes non-keyframe frames conditioned on sparse keyframes. Experiments show strong performance and improved controllability on both DyMesh16 and DyMesh32 benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a feed-forward motion in-betweening framework for arbitrary 4D meshes with keyframe conditioning. It builds on universal mesh-animation latents via a frame-wise mesh VAE that encodes each frame into topology-agnostic latent tokens anchored by a reference mesh. A keyframe-conditioned rectified flow model with an MMDiT backbone then synthesizes the non-keyframe frames. The central claim is that this yields strong performance and improved controllability on the DyMesh16 and DyMesh32 benchmarks.
Significance. If the performance claims hold under quantitative scrutiny, the work would advance feed-forward 4D mesh generation by providing controllable long-horizon synthesis without distillation or test-time optimization, addressing a key limitation in current text-to-4D and animation pipelines.
major comments (1)
- [Abstract] Abstract: The assertion that 'Experiments show strong performance and improved controllability' is load-bearing for the paper's contribution yet supplies no quantitative metrics, error bars, ablation tables, or baseline implementation details, preventing verification of the claim.
minor comments (2)
- [Abstract] Abstract: The term 'MMDiT backbone' is introduced without expansion or reference, reducing accessibility for readers outside the immediate subfield.
- [Abstract] Abstract: The DyMesh16 and DyMesh32 benchmarks are named without any characterization of their content, sequence length, or topology diversity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the abstract. We address the point below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The assertion that 'Experiments show strong performance and improved controllability' is load-bearing for the paper's contribution yet supplies no quantitative metrics, error bars, ablation tables, or baseline implementation details, preventing verification of the claim.
Authors: We agree that the abstract would benefit from concrete quantitative support for the performance claims. While the full manuscript already contains detailed results with metrics, error bars, ablation tables, and baseline comparisons in Section 4 and the supplementary material, the abstract itself is currently high-level. In the revision we will update the abstract to include key quantitative results from the DyMesh16 and DyMesh32 benchmarks (e.g., primary error metrics and relative improvements), thereby making the central claim more verifiable without exceeding typical abstract length constraints. revision: yes
Circularity Check
No significant circularity
full rationale
The provided abstract and description contain no equations, derivations, or loss formulations. The method is explicitly described as building on external 'universal mesh-animation latents' rather than defining success metrics or latents in terms of its own outputs. Experiments are reported on independent DyMesh16/DyMesh32 benchmarks. No self-citation chains, fitted inputs renamed as predictions, or self-definitional reductions are present in the given text, making the derivation self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Universal mesh-animation latents exist that allow topology-agnostic encoding of arbitrary 4D meshes.
invented entities (2)
-
frame-wise mesh VAE
no independent evidence
-
keyframe-conditioned rectified flow model with MMDiT backbone
no independent evidence
Reference graph
Works this paper leans on
-
[1]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
Bae, H., Oh, G., Lee, K., Kim, S., Shin, H., Lee, K.M.: Less is more: Improving motion diffusion models with sparse keyframes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 7902–7913 (October 2025)
2025
-
[2]
In: Computer Vision – ECCV 2024
Bahmani, S., Liu, X., Yifan, W., Skorokhodov, I., Rong, V., Liu, Z., Liu, X., Park, J.J., Tulyakov, S., Wetzstein, G., Tagliasacchi, A., Lindell, D.B.: TC4D: Trajectory-conditioned text-to-4d generation. In: Computer Vision – ECCV 2024. Lecture Notes in Computer Science, vol. 15104, pp. 53–72. Springer (2024).https: //doi.org/10.1007/978-3-031-72952-2_4
-
[3]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Bahmani, S., Skorokhodov, I., Rong, V., Wetzstein, G., Guibas, L., Wonka, P., Tulyakov, S., Park, J.J., Tagliasacchi, A., Lindell, D.B.: 4d-fy: Text-to-4d gener- ation using hybrid score distillation sampling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 7996– 8006 (June 2024), https://openaccess.thecvf...
2024
-
[5]
Deitke, M., Liu, R., Wallingford, M., Ngo, H., Michel, O., Kusupati, A., Fan, A., Laforte, C., Voleti, V., Gadre, S.Y., VanderBilt, E., Kembhavi, A., Vondrick, C., Gkioxari, G., Ehsani, K., Schmidt, L., Farhadi, A.: Objaverse-xl: A universe of 10m+ 3d objects (2023).https://doi.org/10.48550/arXiv.2307.05663, https: //arxiv.org/abs/2307.05663
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.05663 2023
-
[6]
InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Deitke, M., Schwenk, D., Salvador, J., Weihs, L., Michel, O., VanderBilt, E., Schmidt, L., Ehsani, K., Kembhavi, A., Farhadi, A.: Objaverse: A universe of annotated 3d objects (2022). https://doi.org/10.48550/arXiv.2212.08051 , https://arxiv.org/abs/2212.08051
-
[7]
arXiv preprint arXiv:2403.03206 (2024)
Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., Podell, D., Dockhorn, T., English, Z., Lacey, K., Goodwin, A., Marek, Y., Rombach, R.: Scaling rectified flow transformers for high-resolution image synthesis. arXiv preprint arXiv:2403.03206 (2024)
Pith/arXiv arXiv 2024
-
[8]
Geng, Z., Han, C., Hayder, Z., Liu, J., Shah, M., Mian, A.: Text-guided 3d human motion generation with keyframe-based parallel skip transformer (2024).https: //doi.org/10.48550/arXiv.2405.15439,https://arxiv.org/abs/2405.15439
-
[9]
In: ACM SIGGRAPH Asia Conference Papers (2025)
Goel, P., Tevet, G., Liu, C.K., Fatahalian, K.: Generating detailed character mo- tion from blocking poses. In: ACM SIGGRAPH Asia Conference Papers (2025). https://doi.org/10.1145/3757377.3763874 , https://dl.acm.org/doi/10. 1145/3757377.3763874
-
[10]
Computer Graphics Forum (2025).https://doi.org/ 10.1111/cgf.70060
Goel, P., Zhang, H., Liu, C.K., Fatahalian, K.: Generative motion infilling from imprecisely timed keyframes. Computer Graphics Forum (2025).https://doi.org/ 10.1111/cgf.70060
-
[11]
00063,https://arxiv.org/abs/2312.00063
Guo, C., Mu, Y., Javed, M.G., Wang, S., Cheng, L.: Momask: Generative masked modeling of 3d human motions (2023).https://doi.org/10.48550/arXiv.2312. 00063,https://arxiv.org/abs/2312.00063
-
[12]
ACM Transactions on Graphics39(4) (July 2020).https://doi.org/10.1145/ 3386569.3392480
Harvey, F.G., Yurick, M., Nowrouzezahrai, D., Pal, C.: Robust motion in-betweening. ACM Transactions on Graphics39(4) (July 2020).https://doi.org/10.1145/ 3386569.3392480
arXiv 2020
-
[13]
Flexible diffusion model- ing of long videos.arXiv preprint arXiv:2205.11495, 2022
Harvey, W., Naderiparizi, S., Masrani, V., Weilbach, C., Wood, F.: Flexible diffusion modeling of long videos. In: Advances in Neural Information Processing Systems (2022). https://doi.org/10.48550/arXiv.2205.11495, https://arxiv.org/abs/ 2205.11495
-
[14]
Hong, S., Kim, H., Cho, K., Noh, J.: Long-term motion in-betweening via keyframe prediction. Computer Graphics Forum (2024).https://doi.org/10.1111/cgf. 15171
work page doi:10.1111/cgf 2024
-
[15]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., Wang, Y., Chen, X., Wang, L., Lin, D., Qiao, Y., Liu, Z.: Vbench: Comprehensive benchmark suite for video generative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 21807–21818 (June 2024),https://o...
2024
-
[16]
Jiang, Y., Yu, C., Cao, C., Wang, F., Hu, W., Gao, J.: Animate3d: Animating any 3d model with multi-view video diffusion (2024).https://doi.org/10.48550/ arXiv.2406.11216,https://arxiv.org/abs/2406.11216 28 H. Nishizawa et al
arXiv 2024
-
[17]
In: The Twelfth International Conference on Learning Representations (2024),https://openreview.net/forum? id=sPUrdFGepF
Jiang, Y., Zhang, L., Gao, J., Hu, W., Yao, Y.: Consistent4d: Consistent 360° dynamic object generation from monocular video. In: The Twelfth International Conference on Learning Representations (2024),https://openreview.net/forum? id=sPUrdFGepF
2024
-
[18]
48550/arXiv.2305.12577,https://arxiv.org/abs/2305.12577
Karunratanakul, K., Preechakul, K., Suwajanakorn, S., Tang, S.: Guided motion diffusion for controllable human motion synthesis (2023).https://doi.org/10. 48550/arXiv.2305.12577,https://arxiv.org/abs/2305.12577
arXiv 2023
-
[19]
In: International Conference on 3D Vision (3DV)
Kaufmann, M., Aksan, E., Song, J., Pece, F., Ziegler, R., Hilliges, O.: Convolutional autoencoders for human motion infilling. In: International Conference on 3D Vision (3DV). pp. 918–927 (2020).https://doi.org/10.48550/arXiv.2010.11531
-
[20]
Kim, J., Byun, T., Shin, S., Won, J., Choi, S.: Conditional motion in-betweening. Pattern Recognition132, 108894 (December 2022).https://doi.org/10.1016/j. patcog.2022.108894
work page doi:10.1016/j 2022
-
[21]
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: International Conference on Learning Representations (ICLR) (2015),https://arxiv.org/abs/ 1412.6980
Pith/arXiv arXiv 2015
-
[22]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
Li, Y., Takehara, H., Taketomi, T., Zheng, B., Nießner, M.: 4dcomplete: Non-rigid motion estimation beyond the observable surface. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 12706–12716 (2021), https://openaccess.thecvf.com/content/ICCV2021/html/Li_4DComplete_Non- Rigid_Motion_Estimation_Beyond_the_Observable_S...
2021
-
[23]
arXiv preprint arXiv:2512.14284 (2025)
Li, Z., Zhang, M., Wu, T., Tan, J., Wang, J., Lin, D.: Ss4d: Native 4d generative model via structured spacetime latents. arXiv preprint arXiv:2512.14284 (2025)
arXiv 2025
-
[24]
In: Advances in Neural Information Processing Systems (2024)
Li, Z., Chen, Y., Liu, P.: Dreammesh4d: Video-to-4d generation with sparse- controlled gaussian-mesh hybrid representation. In: Advances in Neural Information Processing Systems (2024)
2024
-
[25]
arXiv preprint arXiv:2312.13763 (2023)
Ling, H., Kim, S.W., Torralba, A., Fidler, S., Kreis, K.: Align your gaussians: Text- to-4d with dynamic 3d gaussians and composed diffusion models. arXiv preprint arXiv:2312.13763 (2023)
arXiv 2023
-
[26]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Ling, H., Kim, S.W., Torralba, A., Fidler, S., Kreis, K.: Align your gaussians: Text- to-4d with dynamic 3d gaussians and composed diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 8576–8588 (June 2024),https://openaccess.thecvf.com/content/CVPR2024/ html/Ling_Align_Your_Gaussians_Text- to-...
2024
-
[27]
arXiv preprint arXiv:2209.03003 (2022)
Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003 (2022)
Pith/arXiv arXiv 2022
-
[28]
Choreographing a world of dynamic objects, 2026
Lyu, Y., Geng, C., Dharmarajan, K., Zhang, Y., Alzayer, H., Wu, S., Wu, J.: Choreographing a world of dynamic objects. arXiv preprint arXiv:2601.04194 (2026). https://doi.org/10.48550/arXiv.2601.04194 , https://arxiv.org/abs/2601. 04194
-
[29]
Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: Amass: Archive of motion capture as surface shapes (2019).https://doi.org/10.48550/ arXiv.1904.03278,https://arxiv.org/abs/1904.03278
Pith/arXiv arXiv 2019
-
[30]
arXiv preprint arXiv:2504.08366 (2025)
Nag, S., Cohen-Or, D., Zhang, H., Mahdavi-Amiri, A.: In-2-4d: Inbetweening from two single-view images to 4d generation. arXiv preprint arXiv:2504.08366 (2025). https://doi.org/10.48550/arXiv.2504.08366
-
[31]
Oreshkin, B.N., Valkanas, A., Harvey, F.G., Ménard, L.S., Bocquelet, F., Coates, M.J.: Motion inbetweening via deepδ-interpolator. arXiv preprint arXiv:2201.06701 (2022).https://doi.org/10.48550/arXiv.2201.06701 Feed-forward Motion In-betweening for Any 4D 29
-
[32]
arXiv preprint arXiv:2401.08742 (2024)
Pan, Z., Yang, Z., Zhu, X., Zhang, L.: Fast dynamic 3d object generation from a single-view video. arXiv preprint arXiv:2401.08742 (2024)
arXiv 2024
-
[33]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 4195– 4205 (October 2023)
2023
-
[34]
arXiv preprint arXiv:2504.09413 (2025).https://doi.org/10.48550/ arXiv.2504.09413
Qin, J.: Scalable motion in-betweening via diffusion and physics-based character adaptation. arXiv preprint arXiv:2504.09413 (2025).https://doi.org/10.48550/ arXiv.2504.09413
arXiv 2025
-
[35]
Computer Graphics Forum (2024).https://doi.org/10.1111/cgf.15260
Qin, J., Yan, P., An, B.: Robust diffusion-based motion in-betweening. Computer Graphics Forum (2024).https://doi.org/10.1111/cgf.15260
-
[36]
ACM Transactions on Graphics41(6) (December 2022)
Qin, J., Zheng, Y., Zhou, K.: Motion in-betweening via two-stage transformers. ACM Transactions on Graphics41(6) (December 2022). https://doi.org/10. 1145/3550454.3555454
arXiv 2022
-
[37]
In: Meila, M., Zhang, T
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning. Proceedings of Machine Learning Resea...
2021
-
[38]
DreamGaussian4D: Generative 4D gaussian splatting.arXiv preprint arXiv:2312.17142, 2023
Ren, J., Pan, L., Tang, J., Zhang, C., Cao, A., Zeng, G., Liu, Z.: Dreamgaussian4d: Generative 4d gaussian splatting. arXiv preprint arXiv:2312.17142 (2023).https: //doi.org/10.48550/arXiv.2312.17142,https://arxiv.org/abs/2312.17142
-
[39]
In: Advances in Neural Information Processing Systems (NeurIPS) (2024), https : / / proceedings
Ren, J., Xie, K., Mirzaei, A., Liang, H., Zeng, X., Kreis, K., Liu, Z., Tor- ralba, A., Fidler, S., Kim, S.W., Ling, H.: L4gm: Large 4d gaussian reconstruc- tion model. In: Advances in Neural Information Processing Systems (NeurIPS) (2024), https : / / proceedings . neurips . cc / paper _ files / paper / 2024 / hash / 6808f2c57d9564a2639a4710e3bbd9b9-Abst...
2024
-
[40]
In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J
Singer, U., Sheynin, S., Polyak, A., Ashual, O., Makarov, I., Kokkinos, F., Goyal, N., Vedaldi, A., Parikh, D., Johnson, J., Taigman, Y.: Text-to-4D dynamic scene generation. In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J. (eds.) Proceedings of the 40th International Conference on Machine Learning. Proceedings of Machine L...
2023
-
[41]
In: Proceedings of the 40th International Conference on Machine Learn- ing
Singer, U., Sheynin, S., Polyak, A., Ashual, O., Makarov, I., Kokkinos, F., Goyal, N., Vedaldi, A., Parikh, D., Johnson, J., Taigman, Y.: Text-to-4d dynamic scene generation. In: Proceedings of the 40th International Conference on Machine Learn- ing. Proceedings of Machine Learning Research, vol. 202, pp. 31915–31929. PMLR (Jul 2023),https://proceedings.m...
2023
-
[42]
48550/arXiv.2402.05054,https://arxiv.org/abs/2402.05054
Tang, J., Chen, Z., Chen, X., Wang, T., Zeng, G., Liu, Z.: Lgm: Large multi-view gaussian model for high-resolution 3d generation (2024).https://doi.org/10. 48550/arXiv.2402.05054,https://arxiv.org/abs/2402.05054
arXiv 2024
-
[43]
Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., Bermano, A.H.: Human motion diffusion model (2022).https://doi.org/10.48550/arXiv.2209.14916 , https://arxiv.org/abs/2209.14916
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2209.14916 2022
-
[44]
In: International Conference on Learning Representations (ICLR) (2025),https://openreview.net/forum?id=ykD8a9gJvy
Wang, X., Zhou, B., Curless, B., Kemelmacher-Shlizerman, I., Holynski, A., Seitz, S.M.: Generative inbetweening: Adapting image-to-video models for keyframe interpolation. In: International Conference on Learning Representations (ICLR) (2025),https://openreview.net/forum?id=ykD8a9gJvy
2025
-
[45]
Wang, Y., Wang, X., Chen, Z., Wang, Z., Sun, F., Zhu, J.: Vidu4d: Single generated video to high-fidelity 4d reconstruction with dynamic gaussian surfels (2024).https: //doi.org/10.48550/arXiv.2405.16822,https://arxiv.org/abs/2405.16822 30 H. Nishizawa et al
-
[46]
arXiv preprint arXiv:2602.22742 (2026)
Watanabe, A., Yu, Q., Simo-Serra, E., Fujiwara, K.: Projflow: Projection sampling with flow matching for zero-shot exact spatial motion control. arXiv preprint arXiv:2602.22742 (2026)
arXiv 2026
-
[47]
Watson, D., Saxena, S., Li, L., Tagliasacchi, A., Fleet, D.J.: Controlling space and time with diffusion models (2024).https://doi.org/10.48550/arXiv.2407.07860, https://arxiv.org/abs/2407.07860
-
[48]
In: Proceedings of the AAAI Conference on Artificial Intelligence (2024)
Wei, D., Sun, X., Sun, H., Li, B., Hu, S., Li, W., Lu, J.: Diffkfc: Keyframes- collaborated diffusion for human motion synthesis with text control. In: Proceedings of the AAAI Conference on Artificial Intelligence (2024)
2024
-
[49]
Wu, G., Yi, T., Fang, J., Xie, L., Zhang, X., Wei, W., Liu, W., Tian, Q., Wang, X.: 4d gaussian splatting for real-time dynamic scene rendering (2024).https: //doi.org/10.48550/arXiv.2310.08528,https://arxiv.org/abs/2310.08528
-
[50]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Wu, R., Gao, R., Poole, B., Trevithick, A., Zheng, C., Barron, J.T., Holynski, A.: Cat4d: Create anything in 4d with multi-view video diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 26057–26068 (June 2025),https://openaccess.thecvf. com/content/CVPR2025/html/Wu_CAT4D_Create_Anything_in_4D...
2025
-
[51]
In: Computer Vision – ECCV 2024
Wu, Z., Yu, C., Jiang, Y., Cao, C., Wang, F., Bai, X.: Sc4d: Sparse-controlled video-to-4d generation and motion transfer. In: Computer Vision – ECCV 2024. pp. 361–379 (2024).https://doi.org/10.1007/978-3-031-72624-8_21
-
[52]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
Wu, Z., Yu, C., Wang, F., Bai, X.: Animateanymesh: A feed-forward 4d foundation model for text-driven universal mesh animation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 13557–13568 (October 2025)
2025
-
[53]
In: International Conference on Learning Representations (ICLR) (2024),https://neu-vi.github.io/omnicontrol/
Xie, Y., Jampani, V., Zhong, L., Sun, D., Jiang, H.: Omnicontrol: Control any joint at any time for human motion generation. In: International Conference on Learning Representations (ICLR) (2024),https://neu-vi.github.io/omnicontrol/
2024
-
[54]
arXiv preprint arXiv:2312.17225 (2023)
Yin, Y., Xu, D., Wang, Z., Zhao, Y., Wei, Y.: 4dgen: Grounded 4d content generation with spatial-temporal consistency. arXiv preprint arXiv:2312.17225 (2023)
arXiv 2023
-
[55]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Yun, K., Hong, S., Kim, C., Noh, J.: Anymole: Any character motion in-betweening leveraging video diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 27838–27848 (June 2025), https://openaccess.thecvf.com/content/CVPR2025/html/Yun_AnyMoLe_ Any_Character_Motion_In-betweening_Leveraging_Video_D...
2025
-
[56]
In: Computer Vision – ECCV 2024
Zeng, Y., Jiang, Y., Zhu, S., Lu, Y., Lin, Y., Zhu, H., Hu, W., Cao, X., Yao, Y.: Stag4d: Spatial-temporal anchored generative 4d gaussians. In: Computer Vision – ECCV 2024. Lecture Notes in Computer Science, vol. 15094, pp. 163–179. Springer (2024). https://doi.org/10.1007/978-3-031-72764-1_10 , https://www.ecva. net/papers/eccv_2024/papers_ECCV/html/528...
-
[57]
Continual reinforcement learning with multi-timescale replay (2020)
Zhang, H., Chen, X., Wang, Y., Liu, X., Wang, Y., Qiao, Y.: 4diffusion: Multi-view video diffusion model for 4d generation (2024).https://doi.org/10.48550/arXiv. 2405.20674,https://arxiv.org/abs/2405.20674
work page internal anchor Pith review doi:10.48550/arxiv 2024
-
[58]
Advances in Neural Information Processing Systems38, 30546–30566 (2026)
Zhang, L., Cai, S., Li, M., Wetzstein, G., Agrawala, M.: Frame context packing and drift prevention in next-frame-prediction video diffusion models. Advances in Neural Information Processing Systems38, 30546–30566 (2026)
2026
-
[59]
Zhang, M., Cai, Z., Pan, L., Hong, F., Guo, X., Yang, L., Liu, Z.: Motiondiffuse: Text-driven human motion generation with diffusion model (2022).https://doi. org/10.48550/arXiv.2208.15001,https://arxiv.org/abs/2208.15001 Feed-forward Motion In-betweening for Any 4D 31
-
[60]
arXiv preprint arXiv:2311.14603 (2023)
Zhao, Y., Yan, Z., Xie, E., Hong, L., Li, Z., Lee, G.H.: Animate124: Animating one image to 4d dynamic scene. arXiv preprint arXiv:2311.14603 (2023)
arXiv 2023
-
[61]
Zheng, B., Chen, K., Yao, Y., Zeng, Z., Jiang, X., Wang, H., Lasenby, J., Jin, X.: Autokeyframe: Autoregressive keyframe generation for human motion synthesis and editing. In: ACM SIGGRAPH Conference Proceedings (2025).https://doi.org/ 10.1145/3721238.3730664, https://dl.acm.org/doi/10.1145/3721238.3730664
-
[62]
Zhu, H., He, T., Yu, X., Guo, J., Chen, Z., Bian, J.: Ar4d: Autoregressive 4d generation from monocular videos. arXiv preprint arXiv:2501.01722 (2025).https: //doi.org/10.48550/arXiv.2501.01722,https://arxiv.org/abs/2501.01722
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.