pith. sign in

arxiv: 2606.22131 · v1 · pith:M4HHNWZ2new · submitted 2026-06-20 · 💻 cs.CV

Feed-forward Motion In-betweening for Any 4D

Pith reviewed 2026-06-26 12:19 UTC · model grok-4.3

classification 💻 cs.CV
keywords 4D mesh generationmotion in-betweeningfeed-forward generationkeyframe conditioningrectified flowmesh VAEarbitrary topology
0
0 comments X

The pith

A feed-forward framework generates arbitrary 4D mesh sequences by conditioning a rectified flow model on sparse keyframes via universal latents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a method to synthesize the missing frames in 4D mesh animations (changing 3D shapes over time) when only a few keyframes are supplied. It first trains a frame-wise VAE to turn each mesh into latent tokens that ignore topology differences by anchoring everything to one reference mesh. These tokens then drive a rectified flow model whose MMDiT backbone fills in the in-between frames in one fast pass. The goal is to replace slow test-time optimization or distillation steps while giving users direct control over long sequences through the chosen keyframes.

Core claim

The paper claims that universal mesh-animation latents, produced by a frame-wise mesh VAE that outputs topology-agnostic tokens anchored to a reference mesh, enable a keyframe-conditioned rectified flow model with MMDiT backbone to synthesize non-keyframe frames for arbitrary 4D meshes in a single feed-forward step, yielding strong performance and improved controllability on DyMesh16 and DyMesh32 benchmarks.

What carries the argument

Frame-wise mesh VAE producing topology-agnostic latent tokens anchored by a reference mesh, paired with a keyframe-conditioned rectified flow model using MMDiT backbone.

If this is right

  • Long-horizon 4D sequences can be produced without error accumulation from short-horizon generation.
  • Users gain direct spatiotemporal control by editing only the supplied keyframes.
  • Inference becomes fast enough for practical use in animation and games without per-sequence optimization.
  • The same latent space supports meshes of varying shapes and connectivities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same latent construction might transfer to other dynamic 3D representations such as point clouds or neural fields if the anchoring step generalizes.
  • Real-time interactive 4D editing tools could become feasible if the flow model supports incremental keyframe updates.
  • World-modeling pipelines that currently rely on video diffusion priors could switch to this feed-forward route for faster rollout.

Load-bearing premise

Universal mesh-animation latents exist and a frame-wise VAE can reliably encode arbitrary topologies into consistent tokens anchored by a reference mesh for conditioning.

What would settle it

Run the model on DyMesh32 test sequences where the generated in-between frames diverge from ground truth by more than baseline methods on standard geometric and temporal metrics, or where controllability metrics show no gain over prior feed-forward generators.

Figures

Figures reproduced from arXiv: 2606.22131 by Hirokatsu Kataoka, Hiroki Nishizawa, Hubert P. H. Shum, Shigeo Morishima, Yoshihiro Fukuhara.

Figure 1
Figure 1. Figure 1: We introduce CompletionAny4D, the first feedforward motion in-betweenings for any 4D mesh, completes the 4D scenes, objects, animals, and characters from sparse keyframes and text prompts in a few seconds. Abstract. 4D dynamics (3D geometry evolving over time) is a funda￾mental representation of the physical world and plays a crucial role in world modeling (e.g., animation and games). Owing to the scarcity… view at source ↗
Figure 2
Figure 2. Figure 2: Method overview of our CompletionAny4D, a framework of frame-wise latent encoding and generation for motion in-betweening. 3.2 The CompletionAny4D Framework The CompletionAny4D mesh in-betweening pipeline is illustrated in [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The architecture of our frame-wise mesh VAE reference mesh M0, we extract a shape structure token sequence Vb n 0 that is shared across the entire sequence. For each timestep t, we encode the mesh frame Mt into a per-frame motion token sequence Vb t and we do FPS(Farthest Point Sampling) to reduce the vertex counts to n. Our latents is consisted of full vertex shape structure latent Vb 0 and reduced vertex… view at source ↗
Figure 4
Figure 4. Figure 4: The architecture of our Keyframe-conditioned frame-wise Rectified Flow model. Proposed Keyframe Encoding and Masking. Replacing keyframes only at sampling time provides weak motion control and often yields unnatural motions. We therefore develop a CondMDI-style masked latent replacement scheme for our frame-wise rectified flow model. Let m ∈ {0, 1} T be a binary mask over frame indices i ∈ {1, . . . , T}, … view at source ↗
Figure 5
Figure 5. Figure 5: A qualitative result on Dymesh16 dataset. We sampled 7 frames from non￾keyframes, the orange colored meshes are used as the keyframes which are conditioned for each methods. Keyframe conditioning for long horizons. In real-world applications, long￾horizon motion generation is often more important than short-horizon synthesis. We therefore evaluate whether our 16-frame keyframe-conditioned frame-wise Rectif… view at source ↗
Figure 6
Figure 6. Figure 6: A qualitative result on Dymesh32 dataset with AR generation by Ani￾mateAnyMesh [52] and motion in-betweenings. For visualization, we sampled 6 frames from the non-keyframes, the orange colored meshes are used as the keyframes for each methods and stages. We next evaluate motion controllability and generation quality using quanti￾tative metrics. As shown in [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Representative frames from the six fixed orthographic views used for VBench evaluation: front, right side, back, left side, top, and bottom. All methods are rendered with the same camera, lighting, background, material, and frame rate settings. VBench is computed on the six rendered videos from these viewpoints, and the reported score is the mean across views. C.2 Human evaluation User Study Protocol and S… view at source ↗
Figure 8
Figure 8. Figure 8: Representative snapshots used in the user study. This figure should match the actual evaluation interface and examples. D Training and Implementation Details D.1 Data and model variants Dataset Curation and Split We train on the DyMesh16 subset used in the main paper. We restrict the vertex count to 512–4096, resulting in approximately 260K samples in total with an 80/20 train/validation split. The main pa… view at source ↗
Figure 9
Figure 9. Figure 9: Illustration of the autoregressive 31-frame rollout on the DyMesh32 dataset using AnimateAnyMesh [52]. For clarity, text prompts are omitted; the figure focuses on how frames are normalized and denormalized at each stage and how the stage outputs are stitched together. Text prompts are specified independently for each stage. 3. Normalize the stage-2 local input using c1 and s1. 4. Replace the first local s… view at source ↗
Figure 10
Figure 10. Figure 10: Illustration of the 31-frame longer-horizon generation on the DyMesh32 dataset using CompletionAny4D. For clarity, text prompts are omitted; the figure focuses on how frames are normalized and denormalized at each stage and how the stage outputs are stitched together. Text prompts are specified independently for each stage. This normalization/de-normalization flow and the final stage stitching are illustr… view at source ↗
Figure 11
Figure 11. Figure 11: 5-point Likert score distributions (short-horizon user study). [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: 5-point Likert score distributions (long-horizon user study). [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗
read the original abstract

4D dynamics (3D geometry evolving over time) is a fundamental representation of the physical world and plays a crucial role in world modeling (e.g., animation and games). Owing to the scarcity of large-scale, long-horizon 4D mesh data with arbitrary shapes, early text-to-4D methods rely on distillation or test-time optimization from video diffusion priors, making inference prohibitively slow. Recent feed-forward generators greatly reduce inference cost but offer limited spatiotemporal controllability, and short-horizon generation often leads to error accumulation in long-horizon sequences. We propose a novel feed-forward in-betweening framework for arbitrary 4D meshes with keyframe conditioning. Building on universal mesh-animation latents, we introduce a frame-wise mesh VAE that encodes each frame into topology-agnostic latent tokens anchored by a reference mesh for keyframe conditioning. We further introduce a keyframe-conditioned rectified flow model with an MMDiT backbone that synthesizes non-keyframe frames conditioned on sparse keyframes. Experiments show strong performance and improved controllability on both DyMesh16 and DyMesh32 benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes a feed-forward motion in-betweening framework for arbitrary 4D meshes with keyframe conditioning. It builds on universal mesh-animation latents via a frame-wise mesh VAE that encodes each frame into topology-agnostic latent tokens anchored by a reference mesh. A keyframe-conditioned rectified flow model with an MMDiT backbone then synthesizes the non-keyframe frames. The central claim is that this yields strong performance and improved controllability on the DyMesh16 and DyMesh32 benchmarks.

Significance. If the performance claims hold under quantitative scrutiny, the work would advance feed-forward 4D mesh generation by providing controllable long-horizon synthesis without distillation or test-time optimization, addressing a key limitation in current text-to-4D and animation pipelines.

major comments (1)
  1. [Abstract] Abstract: The assertion that 'Experiments show strong performance and improved controllability' is load-bearing for the paper's contribution yet supplies no quantitative metrics, error bars, ablation tables, or baseline implementation details, preventing verification of the claim.
minor comments (2)
  1. [Abstract] Abstract: The term 'MMDiT backbone' is introduced without expansion or reference, reducing accessibility for readers outside the immediate subfield.
  2. [Abstract] Abstract: The DyMesh16 and DyMesh32 benchmarks are named without any characterization of their content, sequence length, or topology diversity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address the point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The assertion that 'Experiments show strong performance and improved controllability' is load-bearing for the paper's contribution yet supplies no quantitative metrics, error bars, ablation tables, or baseline implementation details, preventing verification of the claim.

    Authors: We agree that the abstract would benefit from concrete quantitative support for the performance claims. While the full manuscript already contains detailed results with metrics, error bars, ablation tables, and baseline comparisons in Section 4 and the supplementary material, the abstract itself is currently high-level. In the revision we will update the abstract to include key quantitative results from the DyMesh16 and DyMesh32 benchmarks (e.g., primary error metrics and relative improvements), thereby making the central claim more verifiable without exceeding typical abstract length constraints. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The provided abstract and description contain no equations, derivations, or loss formulations. The method is explicitly described as building on external 'universal mesh-animation latents' rather than defining success metrics or latents in terms of its own outputs. Experiments are reported on independent DyMesh16/DyMesh32 benchmarks. No self-citation chains, fitted inputs renamed as predictions, or self-definitional reductions are present in the given text, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the existence and utility of universal mesh-animation latents plus the effectiveness of the newly introduced VAE and flow components; no independent evidence for these is supplied in the abstract.

axioms (1)
  • domain assumption Universal mesh-animation latents exist that allow topology-agnostic encoding of arbitrary 4D meshes.
    The framework is built on these latents for keyframe conditioning.
invented entities (2)
  • frame-wise mesh VAE no independent evidence
    purpose: Encodes each frame into topology-agnostic latent tokens anchored by a reference mesh
    Introduced as the encoding component of the framework.
  • keyframe-conditioned rectified flow model with MMDiT backbone no independent evidence
    purpose: Synthesizes non-keyframe frames from sparse keyframes
    Introduced as the generative component of the framework.

pith-pipeline@v0.9.1-grok · 5739 in / 1391 out tokens · 39716 ms · 2026-06-26T12:19:04.286815+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 26 canonical work pages · 3 internal anchors

  1. [1]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Bae, H., Oh, G., Lee, K., Kim, S., Shin, H., Lee, K.M.: Less is more: Improving motion diffusion models with sparse keyframes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 7902–7913 (October 2025)

  2. [2]

    In: Computer Vision – ECCV 2024

    Bahmani, S., Liu, X., Yifan, W., Skorokhodov, I., Rong, V., Liu, Z., Liu, X., Park, J.J., Tulyakov, S., Wetzstein, G., Tagliasacchi, A., Lindell, D.B.: TC4D: Trajectory-conditioned text-to-4d generation. In: Computer Vision – ECCV 2024. Lecture Notes in Computer Science, vol. 15104, pp. 53–72. Springer (2024).https: //doi.org/10.1007/978-3-031-72952-2_4

  3. [3]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Bahmani, S., Skorokhodov, I., Rong, V., Wetzstein, G., Guibas, L., Wonka, P., Tulyakov, S., Park, J.J., Tagliasacchi, A., Lindell, D.B.: 4d-fy: Text-to-4d gener- ation using hybrid score distillation sampling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 7996– 8006 (June 2024), https://openaccess.thecvf...

  4. [5]

    Deitke, M., Liu, R., Wallingford, M., Ngo, H., Michel, O., Kusupati, A., Fan, A., Laforte, C., Voleti, V., Gadre, S.Y., VanderBilt, E., Kembhavi, A., Vondrick, C., Gkioxari, G., Ehsani, K., Schmidt, L., Farhadi, A.: Objaverse-xl: A universe of 10m+ 3d objects (2023).https://doi.org/10.48550/arXiv.2307.05663, https: //arxiv.org/abs/2307.05663

  5. [6]

    InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Deitke, M., Schwenk, D., Salvador, J., Weihs, L., Michel, O., VanderBilt, E., Schmidt, L., Ehsani, K., Kembhavi, A., Farhadi, A.: Objaverse: A universe of annotated 3d objects (2022). https://doi.org/10.48550/arXiv.2212.08051 , https://arxiv.org/abs/2212.08051

  6. [7]

    arXiv preprint arXiv:2403.03206 (2024)

    Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., Podell, D., Dockhorn, T., English, Z., Lacey, K., Goodwin, A., Marek, Y., Rombach, R.: Scaling rectified flow transformers for high-resolution image synthesis. arXiv preprint arXiv:2403.03206 (2024)

  7. [8]

    Geng, Z., Han, C., Hayder, Z., Liu, J., Shah, M., Mian, A.: Text-guided 3d human motion generation with keyframe-based parallel skip transformer (2024).https: //doi.org/10.48550/arXiv.2405.15439,https://arxiv.org/abs/2405.15439

  8. [9]

    In: ACM SIGGRAPH Asia Conference Papers (2025)

    Goel, P., Tevet, G., Liu, C.K., Fatahalian, K.: Generating detailed character mo- tion from blocking poses. In: ACM SIGGRAPH Asia Conference Papers (2025). https://doi.org/10.1145/3757377.3763874 , https://dl.acm.org/doi/10. 1145/3757377.3763874

  9. [10]

    Computer Graphics Forum (2025).https://doi.org/ 10.1111/cgf.70060

    Goel, P., Zhang, H., Liu, C.K., Fatahalian, K.: Generative motion infilling from imprecisely timed keyframes. Computer Graphics Forum (2025).https://doi.org/ 10.1111/cgf.70060

  10. [11]

    00063,https://arxiv.org/abs/2312.00063

    Guo, C., Mu, Y., Javed, M.G., Wang, S., Cheng, L.: Momask: Generative masked modeling of 3d human motions (2023).https://doi.org/10.48550/arXiv.2312. 00063,https://arxiv.org/abs/2312.00063

  11. [12]

    ACM Transactions on Graphics39(4) (July 2020).https://doi.org/10.1145/ 3386569.3392480

    Harvey, F.G., Yurick, M., Nowrouzezahrai, D., Pal, C.: Robust motion in-betweening. ACM Transactions on Graphics39(4) (July 2020).https://doi.org/10.1145/ 3386569.3392480

  12. [13]

    Flexible diffusion model- ing of long videos.arXiv preprint arXiv:2205.11495, 2022

    Harvey, W., Naderiparizi, S., Masrani, V., Weilbach, C., Wood, F.: Flexible diffusion modeling of long videos. In: Advances in Neural Information Processing Systems (2022). https://doi.org/10.48550/arXiv.2205.11495, https://arxiv.org/abs/ 2205.11495

  13. [14]

    Yenpure, S

    Hong, S., Kim, H., Cho, K., Noh, J.: Long-term motion in-betweening via keyframe prediction. Computer Graphics Forum (2024).https://doi.org/10.1111/cgf. 15171

  14. [15]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., Wang, Y., Chen, X., Wang, L., Lin, D., Qiao, Y., Liu, Z.: Vbench: Comprehensive benchmark suite for video generative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 21807–21818 (June 2024),https://o...

  15. [16]

    Nishizawa et al

    Jiang, Y., Yu, C., Cao, C., Wang, F., Hu, W., Gao, J.: Animate3d: Animating any 3d model with multi-view video diffusion (2024).https://doi.org/10.48550/ arXiv.2406.11216,https://arxiv.org/abs/2406.11216 28 H. Nishizawa et al

  16. [17]

    In: The Twelfth International Conference on Learning Representations (2024),https://openreview.net/forum? id=sPUrdFGepF

    Jiang, Y., Zhang, L., Gao, J., Hu, W., Yao, Y.: Consistent4d: Consistent 360° dynamic object generation from monocular video. In: The Twelfth International Conference on Learning Representations (2024),https://openreview.net/forum? id=sPUrdFGepF

  17. [18]

    48550/arXiv.2305.12577,https://arxiv.org/abs/2305.12577

    Karunratanakul, K., Preechakul, K., Suwajanakorn, S., Tang, S.: Guided motion diffusion for controllable human motion synthesis (2023).https://doi.org/10. 48550/arXiv.2305.12577,https://arxiv.org/abs/2305.12577

  18. [19]

    In: International Conference on 3D Vision (3DV)

    Kaufmann, M., Aksan, E., Song, J., Pece, F., Ziegler, R., Hilliges, O.: Convolutional autoencoders for human motion infilling. In: International Conference on 3D Vision (3DV). pp. 918–927 (2020).https://doi.org/10.48550/arXiv.2010.11531

  19. [20]

    Peters and S

    Kim, J., Byun, T., Shin, S., Won, J., Choi, S.: Conditional motion in-betweening. Pattern Recognition132, 108894 (December 2022).https://doi.org/10.1016/j. patcog.2022.108894

  20. [21]

    In: International Conference on Learning Representations (ICLR) (2015),https://arxiv.org/abs/ 1412.6980

    Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: International Conference on Learning Representations (ICLR) (2015),https://arxiv.org/abs/ 1412.6980

  21. [22]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Li, Y., Takehara, H., Taketomi, T., Zheng, B., Nießner, M.: 4dcomplete: Non-rigid motion estimation beyond the observable surface. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 12706–12716 (2021), https://openaccess.thecvf.com/content/ICCV2021/html/Li_4DComplete_Non- Rigid_Motion_Estimation_Beyond_the_Observable_S...

  22. [23]

    arXiv preprint arXiv:2512.14284 (2025)

    Li, Z., Zhang, M., Wu, T., Tan, J., Wang, J., Lin, D.: Ss4d: Native 4d generative model via structured spacetime latents. arXiv preprint arXiv:2512.14284 (2025)

  23. [24]

    In: Advances in Neural Information Processing Systems (2024)

    Li, Z., Chen, Y., Liu, P.: Dreammesh4d: Video-to-4d generation with sparse- controlled gaussian-mesh hybrid representation. In: Advances in Neural Information Processing Systems (2024)

  24. [25]

    arXiv preprint arXiv:2312.13763 (2023)

    Ling, H., Kim, S.W., Torralba, A., Fidler, S., Kreis, K.: Align your gaussians: Text- to-4d with dynamic 3d gaussians and composed diffusion models. arXiv preprint arXiv:2312.13763 (2023)

  25. [26]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Ling, H., Kim, S.W., Torralba, A., Fidler, S., Kreis, K.: Align your gaussians: Text- to-4d with dynamic 3d gaussians and composed diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 8576–8588 (June 2024),https://openaccess.thecvf.com/content/CVPR2024/ html/Ling_Align_Your_Gaussians_Text- to-...

  26. [27]

    arXiv preprint arXiv:2209.03003 (2022)

    Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003 (2022)

  27. [28]

    Choreographing a world of dynamic objects, 2026

    Lyu, Y., Geng, C., Dharmarajan, K., Zhang, Y., Alzayer, H., Wu, S., Wu, J.: Choreographing a world of dynamic objects. arXiv preprint arXiv:2601.04194 (2026). https://doi.org/10.48550/arXiv.2601.04194 , https://arxiv.org/abs/2601. 04194

  28. [29]

    Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: Amass: Archive of motion capture as surface shapes (2019).https://doi.org/10.48550/ arXiv.1904.03278,https://arxiv.org/abs/1904.03278

  29. [30]

    arXiv preprint arXiv:2504.08366 (2025)

    Nag, S., Cohen-Or, D., Zhang, H., Mahdavi-Amiri, A.: In-2-4d: Inbetweening from two single-view images to 4d generation. arXiv preprint arXiv:2504.08366 (2025). https://doi.org/10.48550/arXiv.2504.08366

  30. [31]

    arXiv preprint arXiv:2201.06701 (2022).https://doi.org/10.48550/arXiv.2201.06701 Feed-forward Motion In-betweening for Any 4D 29

    Oreshkin, B.N., Valkanas, A., Harvey, F.G., Ménard, L.S., Bocquelet, F., Coates, M.J.: Motion inbetweening via deepδ-interpolator. arXiv preprint arXiv:2201.06701 (2022).https://doi.org/10.48550/arXiv.2201.06701 Feed-forward Motion In-betweening for Any 4D 29

  31. [32]

    arXiv preprint arXiv:2401.08742 (2024)

    Pan, Z., Yang, Z., Zhu, X., Zhang, L.: Fast dynamic 3d object generation from a single-view video. arXiv preprint arXiv:2401.08742 (2024)

  32. [33]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 4195– 4205 (October 2023)

  33. [34]

    arXiv preprint arXiv:2504.09413 (2025).https://doi.org/10.48550/ arXiv.2504.09413

    Qin, J.: Scalable motion in-betweening via diffusion and physics-based character adaptation. arXiv preprint arXiv:2504.09413 (2025).https://doi.org/10.48550/ arXiv.2504.09413

  34. [35]

    Computer Graphics Forum (2024).https://doi.org/10.1111/cgf.15260

    Qin, J., Yan, P., An, B.: Robust diffusion-based motion in-betweening. Computer Graphics Forum (2024).https://doi.org/10.1111/cgf.15260

  35. [36]

    ACM Transactions on Graphics41(6) (December 2022)

    Qin, J., Zheng, Y., Zhou, K.: Motion in-betweening via two-stage transformers. ACM Transactions on Graphics41(6) (December 2022). https://doi.org/10. 1145/3550454.3555454

  36. [37]

    In: Meila, M., Zhang, T

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning. Proceedings of Machine Learning Resea...

  37. [38]

    DreamGaussian4D: Generative 4D gaussian splatting.arXiv preprint arXiv:2312.17142, 2023

    Ren, J., Pan, L., Tang, J., Zhang, C., Cao, A., Zeng, G., Liu, Z.: Dreamgaussian4d: Generative 4d gaussian splatting. arXiv preprint arXiv:2312.17142 (2023).https: //doi.org/10.48550/arXiv.2312.17142,https://arxiv.org/abs/2312.17142

  38. [39]

    In: Advances in Neural Information Processing Systems (NeurIPS) (2024), https : / / proceedings

    Ren, J., Xie, K., Mirzaei, A., Liang, H., Zeng, X., Kreis, K., Liu, Z., Tor- ralba, A., Fidler, S., Kim, S.W., Ling, H.: L4gm: Large 4d gaussian reconstruc- tion model. In: Advances in Neural Information Processing Systems (NeurIPS) (2024), https : / / proceedings . neurips . cc / paper _ files / paper / 2024 / hash / 6808f2c57d9564a2639a4710e3bbd9b9-Abst...

  39. [40]

    In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J

    Singer, U., Sheynin, S., Polyak, A., Ashual, O., Makarov, I., Kokkinos, F., Goyal, N., Vedaldi, A., Parikh, D., Johnson, J., Taigman, Y.: Text-to-4D dynamic scene generation. In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J. (eds.) Proceedings of the 40th International Conference on Machine Learning. Proceedings of Machine L...

  40. [41]

    In: Proceedings of the 40th International Conference on Machine Learn- ing

    Singer, U., Sheynin, S., Polyak, A., Ashual, O., Makarov, I., Kokkinos, F., Goyal, N., Vedaldi, A., Parikh, D., Johnson, J., Taigman, Y.: Text-to-4d dynamic scene generation. In: Proceedings of the 40th International Conference on Machine Learn- ing. Proceedings of Machine Learning Research, vol. 202, pp. 31915–31929. PMLR (Jul 2023),https://proceedings.m...

  41. [42]

    48550/arXiv.2402.05054,https://arxiv.org/abs/2402.05054

    Tang, J., Chen, Z., Chen, X., Wang, T., Zeng, G., Liu, Z.: Lgm: Large multi-view gaussian model for high-resolution 3d generation (2024).https://doi.org/10. 48550/arXiv.2402.05054,https://arxiv.org/abs/2402.05054

  42. [43]

    Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., Bermano, A.H.: Human motion diffusion model (2022).https://doi.org/10.48550/arXiv.2209.14916 , https://arxiv.org/abs/2209.14916

  43. [44]

    In: International Conference on Learning Representations (ICLR) (2025),https://openreview.net/forum?id=ykD8a9gJvy

    Wang, X., Zhou, B., Curless, B., Kemelmacher-Shlizerman, I., Holynski, A., Seitz, S.M.: Generative inbetweening: Adapting image-to-video models for keyframe interpolation. In: International Conference on Learning Representations (ICLR) (2025),https://openreview.net/forum?id=ykD8a9gJvy

  44. [45]

    Vidu4d: Single generated video to high-fidelity 4d reconstruction with dynamic gaussian sur- fels.arXiv preprint arXiv:2405.16822, 2024

    Wang, Y., Wang, X., Chen, Z., Wang, Z., Sun, F., Zhu, J.: Vidu4d: Single generated video to high-fidelity 4d reconstruction with dynamic gaussian surfels (2024).https: //doi.org/10.48550/arXiv.2405.16822,https://arxiv.org/abs/2405.16822 30 H. Nishizawa et al

  45. [46]

    arXiv preprint arXiv:2602.22742 (2026)

    Watanabe, A., Yu, Q., Simo-Serra, E., Fujiwara, K.: Projflow: Projection sampling with flow matching for zero-shot exact spatial motion control. arXiv preprint arXiv:2602.22742 (2026)

  46. [47]

    Watson, D., Saxena, S., Li, L., Tagliasacchi, A., Fleet, D.J.: Controlling space and time with diffusion models (2024).https://doi.org/10.48550/arXiv.2407.07860, https://arxiv.org/abs/2407.07860

  47. [48]

    In: Proceedings of the AAAI Conference on Artificial Intelligence (2024)

    Wei, D., Sun, X., Sun, H., Li, B., Hu, S., Li, W., Lu, J.: Diffkfc: Keyframes- collaborated diffusion for human motion synthesis with text control. In: Proceedings of the AAAI Conference on Artificial Intelligence (2024)

  48. [49]

    Wu, G., Yi, T., Fang, J., Xie, L., Zhang, X., Wei, W., Liu, W., Tian, Q., Wang, X.: 4d gaussian splatting for real-time dynamic scene rendering (2024).https: //doi.org/10.48550/arXiv.2310.08528,https://arxiv.org/abs/2310.08528

  49. [50]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Wu, R., Gao, R., Poole, B., Trevithick, A., Zheng, C., Barron, J.T., Holynski, A.: Cat4d: Create anything in 4d with multi-view video diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 26057–26068 (June 2025),https://openaccess.thecvf. com/content/CVPR2025/html/Wu_CAT4D_Create_Anything_in_4D...

  50. [51]

    In: Computer Vision – ECCV 2024

    Wu, Z., Yu, C., Jiang, Y., Cao, C., Wang, F., Bai, X.: Sc4d: Sparse-controlled video-to-4d generation and motion transfer. In: Computer Vision – ECCV 2024. pp. 361–379 (2024).https://doi.org/10.1007/978-3-031-72624-8_21

  51. [52]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Wu, Z., Yu, C., Wang, F., Bai, X.: Animateanymesh: A feed-forward 4d foundation model for text-driven universal mesh animation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 13557–13568 (October 2025)

  52. [53]

    In: International Conference on Learning Representations (ICLR) (2024),https://neu-vi.github.io/omnicontrol/

    Xie, Y., Jampani, V., Zhong, L., Sun, D., Jiang, H.: Omnicontrol: Control any joint at any time for human motion generation. In: International Conference on Learning Representations (ICLR) (2024),https://neu-vi.github.io/omnicontrol/

  53. [54]

    arXiv preprint arXiv:2312.17225 (2023)

    Yin, Y., Xu, D., Wang, Z., Zhao, Y., Wei, Y.: 4dgen: Grounded 4d content generation with spatial-temporal consistency. arXiv preprint arXiv:2312.17225 (2023)

  54. [55]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Yun, K., Hong, S., Kim, C., Noh, J.: Anymole: Any character motion in-betweening leveraging video diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 27838–27848 (June 2025), https://openaccess.thecvf.com/content/CVPR2025/html/Yun_AnyMoLe_ Any_Character_Motion_In-betweening_Leveraging_Video_D...

  55. [56]

    In: Computer Vision – ECCV 2024

    Zeng, Y., Jiang, Y., Zhu, S., Lu, Y., Lin, Y., Zhu, H., Hu, W., Cao, X., Yao, Y.: Stag4d: Spatial-temporal anchored generative 4d gaussians. In: Computer Vision – ECCV 2024. Lecture Notes in Computer Science, vol. 15094, pp. 163–179. Springer (2024). https://doi.org/10.1007/978-3-031-72764-1_10 , https://www.ecva. net/papers/eccv_2024/papers_ECCV/html/528...

  56. [57]

    Continual reinforcement learning with multi-timescale replay (2020)

    Zhang, H., Chen, X., Wang, Y., Liu, X., Wang, Y., Qiao, Y.: 4diffusion: Multi-view video diffusion model for 4d generation (2024).https://doi.org/10.48550/arXiv. 2405.20674,https://arxiv.org/abs/2405.20674

  57. [58]

    Advances in Neural Information Processing Systems38, 30546–30566 (2026)

    Zhang, L., Cai, S., Li, M., Wetzstein, G., Agrawala, M.: Frame context packing and drift prevention in next-frame-prediction video diffusion models. Advances in Neural Information Processing Systems38, 30546–30566 (2026)

  58. [59]

    org/10.48550/arXiv.2208.15001,https://arxiv.org/abs/2208.15001 Feed-forward Motion In-betweening for Any 4D 31

    Zhang, M., Cai, Z., Pan, L., Hong, F., Guo, X., Yang, L., Liu, Z.: Motiondiffuse: Text-driven human motion generation with diffusion model (2022).https://doi. org/10.48550/arXiv.2208.15001,https://arxiv.org/abs/2208.15001 Feed-forward Motion In-betweening for Any 4D 31

  59. [60]

    arXiv preprint arXiv:2311.14603 (2023)

    Zhao, Y., Yan, Z., Xie, E., Hong, L., Li, Z., Lee, G.H.: Animate124: Animating one image to 4d dynamic scene. arXiv preprint arXiv:2311.14603 (2023)

  60. [61]

    In: ACM SIGGRAPH Conference Proceedings (2025).https://doi.org/ 10.1145/3721238.3730664, https://dl.acm.org/doi/10.1145/3721238.3730664

    Zheng, B., Chen, K., Yao, Y., Zeng, Z., Jiang, X., Wang, H., Lasenby, J., Jin, X.: Autokeyframe: Autoregressive keyframe generation for human motion synthesis and editing. In: ACM SIGGRAPH Conference Proceedings (2025).https://doi.org/ 10.1145/3721238.3730664, https://dl.acm.org/doi/10.1145/3721238.3730664

  61. [62]

    arXiv preprint arXiv:2501.01722 (2025).https: //doi.org/10.48550/arXiv.2501.01722,https://arxiv.org/abs/2501.01722

    Zhu, H., He, T., Yu, X., Guo, J., Chen, Z., Bian, J.: Ar4d: Autoregressive 4d generation from monocular videos. arXiv preprint arXiv:2501.01722 (2025).https: //doi.org/10.48550/arXiv.2501.01722,https://arxiv.org/abs/2501.01722