Feed-forward Motion In-betweening for Any 4D

Hirokatsu Kataoka; Hiroki Nishizawa; Hubert P. H. Shum; Shigeo Morishima; Yoshihiro Fukuhara

arxiv: 2606.22131 · v1 · pith:M4HHNWZ2new · submitted 2026-06-20 · 💻 cs.CV

Feed-forward Motion In-betweening for Any 4D

Hiroki Nishizawa , Hubert P. H. Shum , Yoshihiro Fukuhara , Hirokatsu Kataoka , Shigeo Morishima This is my paper

Pith reviewed 2026-06-26 12:19 UTC · model grok-4.3

classification 💻 cs.CV

keywords 4D mesh generationmotion in-betweeningfeed-forward generationkeyframe conditioningrectified flowmesh VAEarbitrary topology

0 comments

The pith

A feed-forward framework generates arbitrary 4D mesh sequences by conditioning a rectified flow model on sparse keyframes via universal latents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a method to synthesize the missing frames in 4D mesh animations (changing 3D shapes over time) when only a few keyframes are supplied. It first trains a frame-wise VAE to turn each mesh into latent tokens that ignore topology differences by anchoring everything to one reference mesh. These tokens then drive a rectified flow model whose MMDiT backbone fills in the in-between frames in one fast pass. The goal is to replace slow test-time optimization or distillation steps while giving users direct control over long sequences through the chosen keyframes.

Core claim

The paper claims that universal mesh-animation latents, produced by a frame-wise mesh VAE that outputs topology-agnostic tokens anchored to a reference mesh, enable a keyframe-conditioned rectified flow model with MMDiT backbone to synthesize non-keyframe frames for arbitrary 4D meshes in a single feed-forward step, yielding strong performance and improved controllability on DyMesh16 and DyMesh32 benchmarks.

What carries the argument

Frame-wise mesh VAE producing topology-agnostic latent tokens anchored by a reference mesh, paired with a keyframe-conditioned rectified flow model using MMDiT backbone.

If this is right

Long-horizon 4D sequences can be produced without error accumulation from short-horizon generation.
Users gain direct spatiotemporal control by editing only the supplied keyframes.
Inference becomes fast enough for practical use in animation and games without per-sequence optimization.
The same latent space supports meshes of varying shapes and connectivities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same latent construction might transfer to other dynamic 3D representations such as point clouds or neural fields if the anchoring step generalizes.
Real-time interactive 4D editing tools could become feasible if the flow model supports incremental keyframe updates.
World-modeling pipelines that currently rely on video diffusion priors could switch to this feed-forward route for faster rollout.

Load-bearing premise

Universal mesh-animation latents exist and a frame-wise VAE can reliably encode arbitrary topologies into consistent tokens anchored by a reference mesh for conditioning.

What would settle it

Run the model on DyMesh32 test sequences where the generated in-between frames diverge from ground truth by more than baseline methods on standard geometric and temporal metrics, or where controllability metrics show no gain over prior feed-forward generators.

Figures

Figures reproduced from arXiv: 2606.22131 by Hirokatsu Kataoka, Hiroki Nishizawa, Hubert P. H. Shum, Shigeo Morishima, Yoshihiro Fukuhara.

**Figure 1.** Figure 1: We introduce CompletionAny4D, the first feedforward motion in-betweenings for any 4D mesh, completes the 4D scenes, objects, animals, and characters from sparse keyframes and text prompts in a few seconds. Abstract. 4D dynamics (3D geometry evolving over time) is a fundamental representation of the physical world and plays a crucial role in world modeling (e.g., animation and games). Owing to the scarcity… view at source ↗

**Figure 2.** Figure 2: Method overview of our CompletionAny4D, a framework of frame-wise latent encoding and generation for motion in-betweening. 3.2 The CompletionAny4D Framework The CompletionAny4D mesh in-betweening pipeline is illustrated in [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: The architecture of our frame-wise mesh VAE reference mesh M0, we extract a shape structure token sequence Vb n 0 that is shared across the entire sequence. For each timestep t, we encode the mesh frame Mt into a per-frame motion token sequence Vb t and we do FPS(Farthest Point Sampling) to reduce the vertex counts to n. Our latents is consisted of full vertex shape structure latent Vb 0 and reduced vertex… view at source ↗

**Figure 4.** Figure 4: The architecture of our Keyframe-conditioned frame-wise Rectified Flow model. Proposed Keyframe Encoding and Masking. Replacing keyframes only at sampling time provides weak motion control and often yields unnatural motions. We therefore develop a CondMDI-style masked latent replacement scheme for our frame-wise rectified flow model. Let m ∈ {0, 1} T be a binary mask over frame indices i ∈ {1, . . . , T}, … view at source ↗

**Figure 5.** Figure 5: A qualitative result on Dymesh16 dataset. We sampled 7 frames from nonkeyframes, the orange colored meshes are used as the keyframes which are conditioned for each methods. Keyframe conditioning for long horizons. In real-world applications, longhorizon motion generation is often more important than short-horizon synthesis. We therefore evaluate whether our 16-frame keyframe-conditioned frame-wise Rectif… view at source ↗

**Figure 6.** Figure 6: A qualitative result on Dymesh32 dataset with AR generation by AnimateAnyMesh [52] and motion in-betweenings. For visualization, we sampled 6 frames from the non-keyframes, the orange colored meshes are used as the keyframes for each methods and stages. We next evaluate motion controllability and generation quality using quantitative metrics. As shown in [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Representative frames from the six fixed orthographic views used for VBench evaluation: front, right side, back, left side, top, and bottom. All methods are rendered with the same camera, lighting, background, material, and frame rate settings. VBench is computed on the six rendered videos from these viewpoints, and the reported score is the mean across views. C.2 Human evaluation User Study Protocol and S… view at source ↗

**Figure 8.** Figure 8: Representative snapshots used in the user study. This figure should match the actual evaluation interface and examples. D Training and Implementation Details D.1 Data and model variants Dataset Curation and Split We train on the DyMesh16 subset used in the main paper. We restrict the vertex count to 512–4096, resulting in approximately 260K samples in total with an 80/20 train/validation split. The main pa… view at source ↗

**Figure 9.** Figure 9: Illustration of the autoregressive 31-frame rollout on the DyMesh32 dataset using AnimateAnyMesh [52]. For clarity, text prompts are omitted; the figure focuses on how frames are normalized and denormalized at each stage and how the stage outputs are stitched together. Text prompts are specified independently for each stage. 3. Normalize the stage-2 local input using c1 and s1. 4. Replace the first local s… view at source ↗

**Figure 10.** Figure 10: Illustration of the 31-frame longer-horizon generation on the DyMesh32 dataset using CompletionAny4D. For clarity, text prompts are omitted; the figure focuses on how frames are normalized and denormalized at each stage and how the stage outputs are stitched together. Text prompts are specified independently for each stage. This normalization/de-normalization flow and the final stage stitching are illustr… view at source ↗

**Figure 11.** Figure 11: 5-point Likert score distributions (short-horizon user study). [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗

**Figure 12.** Figure 12: 5-point Likert score distributions (long-horizon user study). [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗

read the original abstract

4D dynamics (3D geometry evolving over time) is a fundamental representation of the physical world and plays a crucial role in world modeling (e.g., animation and games). Owing to the scarcity of large-scale, long-horizon 4D mesh data with arbitrary shapes, early text-to-4D methods rely on distillation or test-time optimization from video diffusion priors, making inference prohibitively slow. Recent feed-forward generators greatly reduce inference cost but offer limited spatiotemporal controllability, and short-horizon generation often leads to error accumulation in long-horizon sequences. We propose a novel feed-forward in-betweening framework for arbitrary 4D meshes with keyframe conditioning. Building on universal mesh-animation latents, we introduce a frame-wise mesh VAE that encodes each frame into topology-agnostic latent tokens anchored by a reference mesh for keyframe conditioning. We further introduce a keyframe-conditioned rectified flow model with an MMDiT backbone that synthesizes non-keyframe frames conditioned on sparse keyframes. Experiments show strong performance and improved controllability on both DyMesh16 and DyMesh32 benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sketches a feed-forward 4D mesh in-betweening pipeline with a frame-wise VAE and MMDiT rectified flow but supplies no metrics or implementation details to check whether it works.

read the letter

The main thing to know is that this paper describes a feed-forward way to generate long 4D mesh sequences by conditioning a rectified flow model on sparse keyframes, after encoding frames with a topology-agnostic VAE anchored to a reference mesh. The combination is framed as new for arbitrary shapes.

It does a reasonable job stating the practical problem: distillation and optimization methods are slow, while existing feed-forward generators lack long-horizon control and accumulate errors. Targeting keyframe conditioning directly addresses a real need in animation pipelines.

The soft spots are clear and central. The abstract asserts strong performance and improved controllability on DyMesh16 and DyMesh32 but gives zero quantitative results, no ablation tables, no baseline implementation details, and no error analysis. Without those, there is no way to judge if the VAE actually delivers usable latents across topologies or if the flow model improves anything. The assumption that universal mesh-animation latents exist is stated without equations, loss terms, or verification.

This is for people working on 4D generation for games, simulation, or world models who need faster inference with explicit control. A reader who needs reproducible results or code will not get much from the abstract alone.

I would bring it to a reading group as maybe, mainly to see the full architecture and any hidden experiments. I would not cite it yet. It deserves peer review because the problem is relevant and the framing is straightforward, even if the evaluation section will need substantial work.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes a feed-forward motion in-betweening framework for arbitrary 4D meshes with keyframe conditioning. It builds on universal mesh-animation latents via a frame-wise mesh VAE that encodes each frame into topology-agnostic latent tokens anchored by a reference mesh. A keyframe-conditioned rectified flow model with an MMDiT backbone then synthesizes the non-keyframe frames. The central claim is that this yields strong performance and improved controllability on the DyMesh16 and DyMesh32 benchmarks.

Significance. If the performance claims hold under quantitative scrutiny, the work would advance feed-forward 4D mesh generation by providing controllable long-horizon synthesis without distillation or test-time optimization, addressing a key limitation in current text-to-4D and animation pipelines.

major comments (1)

[Abstract] Abstract: The assertion that 'Experiments show strong performance and improved controllability' is load-bearing for the paper's contribution yet supplies no quantitative metrics, error bars, ablation tables, or baseline implementation details, preventing verification of the claim.

minor comments (2)

[Abstract] Abstract: The term 'MMDiT backbone' is introduced without expansion or reference, reducing accessibility for readers outside the immediate subfield.
[Abstract] Abstract: The DyMesh16 and DyMesh32 benchmarks are named without any characterization of their content, sequence length, or topology diversity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address the point below.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion that 'Experiments show strong performance and improved controllability' is load-bearing for the paper's contribution yet supplies no quantitative metrics, error bars, ablation tables, or baseline implementation details, preventing verification of the claim.

Authors: We agree that the abstract would benefit from concrete quantitative support for the performance claims. While the full manuscript already contains detailed results with metrics, error bars, ablation tables, and baseline comparisons in Section 4 and the supplementary material, the abstract itself is currently high-level. In the revision we will update the abstract to include key quantitative results from the DyMesh16 and DyMesh32 benchmarks (e.g., primary error metrics and relative improvements), thereby making the central claim more verifiable without exceeding typical abstract length constraints. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The provided abstract and description contain no equations, derivations, or loss formulations. The method is explicitly described as building on external 'universal mesh-animation latents' rather than defining success metrics or latents in terms of its own outputs. Experiments are reported on independent DyMesh16/DyMesh32 benchmarks. No self-citation chains, fitted inputs renamed as predictions, or self-definitional reductions are present in the given text, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the existence and utility of universal mesh-animation latents plus the effectiveness of the newly introduced VAE and flow components; no independent evidence for these is supplied in the abstract.

axioms (1)

domain assumption Universal mesh-animation latents exist that allow topology-agnostic encoding of arbitrary 4D meshes.
The framework is built on these latents for keyframe conditioning.

invented entities (2)

frame-wise mesh VAE no independent evidence
purpose: Encodes each frame into topology-agnostic latent tokens anchored by a reference mesh
Introduced as the encoding component of the framework.
keyframe-conditioned rectified flow model with MMDiT backbone no independent evidence
purpose: Synthesizes non-keyframe frames from sparse keyframes
Introduced as the generative component of the framework.

pith-pipeline@v0.9.1-grok · 5739 in / 1391 out tokens · 39716 ms · 2026-06-26T12:19:04.286815+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 26 canonical work pages · 3 internal anchors

[1]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Bae, H., Oh, G., Lee, K., Kim, S., Shin, H., Lee, K.M.: Less is more: Improving motion diffusion models with sparse keyframes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 7902–7913 (October 2025)

2025
[2]

In: Computer Vision – ECCV 2024

Bahmani, S., Liu, X., Yifan, W., Skorokhodov, I., Rong, V., Liu, Z., Liu, X., Park, J.J., Tulyakov, S., Wetzstein, G., Tagliasacchi, A., Lindell, D.B.: TC4D: Trajectory-conditioned text-to-4d generation. In: Computer Vision – ECCV 2024. Lecture Notes in Computer Science, vol. 15104, pp. 53–72. Springer (2024).https: //doi.org/10.1007/978-3-031-72952-2_4

work page doi:10.1007/978-3-031-72952-2_4 2024
[3]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Bahmani, S., Skorokhodov, I., Rong, V., Wetzstein, G., Guibas, L., Wonka, P., Tulyakov, S., Park, J.J., Tagliasacchi, A., Lindell, D.B.: 4d-fy: Text-to-4d gener- ation using hybrid score distillation sampling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 7996– 8006 (June 2024), https://openaccess.thecvf...

2024
[5]

Deitke, M., Liu, R., Wallingford, M., Ngo, H., Michel, O., Kusupati, A., Fan, A., Laforte, C., Voleti, V., Gadre, S.Y., VanderBilt, E., Kembhavi, A., Vondrick, C., Gkioxari, G., Ehsani, K., Schmidt, L., Farhadi, A.: Objaverse-xl: A universe of 10m+ 3d objects (2023).https://doi.org/10.48550/arXiv.2307.05663, https: //arxiv.org/abs/2307.05663

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.05663 2023
[6]

InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Deitke, M., Schwenk, D., Salvador, J., Weihs, L., Michel, O., VanderBilt, E., Schmidt, L., Ehsani, K., Kembhavi, A., Farhadi, A.: Objaverse: A universe of annotated 3d objects (2022). https://doi.org/10.48550/arXiv.2212.08051 , https://arxiv.org/abs/2212.08051

work page doi:10.48550/arxiv.2212.08051 2022
[7]

arXiv preprint arXiv:2403.03206 (2024)

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., Podell, D., Dockhorn, T., English, Z., Lacey, K., Goodwin, A., Marek, Y., Rombach, R.: Scaling rectified flow transformers for high-resolution image synthesis. arXiv preprint arXiv:2403.03206 (2024)

Pith/arXiv arXiv 2024
[8]

Geng, Z., Han, C., Hayder, Z., Liu, J., Shah, M., Mian, A.: Text-guided 3d human motion generation with keyframe-based parallel skip transformer (2024).https: //doi.org/10.48550/arXiv.2405.15439,https://arxiv.org/abs/2405.15439

work page doi:10.48550/arxiv.2405.15439 2024
[9]

In: ACM SIGGRAPH Asia Conference Papers (2025)

Goel, P., Tevet, G., Liu, C.K., Fatahalian, K.: Generating detailed character mo- tion from blocking poses. In: ACM SIGGRAPH Asia Conference Papers (2025). https://doi.org/10.1145/3757377.3763874 , https://dl.acm.org/doi/10. 1145/3757377.3763874

work page doi:10.1145/3757377.3763874 2025
[10]

Computer Graphics Forum (2025).https://doi.org/ 10.1111/cgf.70060

Goel, P., Zhang, H., Liu, C.K., Fatahalian, K.: Generative motion infilling from imprecisely timed keyframes. Computer Graphics Forum (2025).https://doi.org/ 10.1111/cgf.70060

work page doi:10.1111/cgf.70060 2025
[11]

00063,https://arxiv.org/abs/2312.00063

Guo, C., Mu, Y., Javed, M.G., Wang, S., Cheng, L.: Momask: Generative masked modeling of 3d human motions (2023).https://doi.org/10.48550/arXiv.2312. 00063,https://arxiv.org/abs/2312.00063

work page doi:10.48550/arxiv.2312 2023
[12]

ACM Transactions on Graphics39(4) (July 2020).https://doi.org/10.1145/ 3386569.3392480

Harvey, F.G., Yurick, M., Nowrouzezahrai, D., Pal, C.: Robust motion in-betweening. ACM Transactions on Graphics39(4) (July 2020).https://doi.org/10.1145/ 3386569.3392480

arXiv 2020
[13]

Flexible diffusion model- ing of long videos.arXiv preprint arXiv:2205.11495, 2022

Harvey, W., Naderiparizi, S., Masrani, V., Weilbach, C., Wood, F.: Flexible diffusion modeling of long videos. In: Advances in Neural Information Processing Systems (2022). https://doi.org/10.48550/arXiv.2205.11495, https://arxiv.org/abs/ 2205.11495

work page doi:10.48550/arxiv.2205.11495 2022
[14]

Yenpure, S

Hong, S., Kim, H., Cho, K., Noh, J.: Long-term motion in-betweening via keyframe prediction. Computer Graphics Forum (2024).https://doi.org/10.1111/cgf. 15171

work page doi:10.1111/cgf 2024
[15]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., Wang, Y., Chen, X., Wang, L., Lin, D., Qiao, Y., Liu, Z.: Vbench: Comprehensive benchmark suite for video generative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 21807–21818 (June 2024),https://o...

2024
[16]

Nishizawa et al

Jiang, Y., Yu, C., Cao, C., Wang, F., Hu, W., Gao, J.: Animate3d: Animating any 3d model with multi-view video diffusion (2024).https://doi.org/10.48550/ arXiv.2406.11216,https://arxiv.org/abs/2406.11216 28 H. Nishizawa et al

arXiv 2024
[17]

In: The Twelfth International Conference on Learning Representations (2024),https://openreview.net/forum? id=sPUrdFGepF

Jiang, Y., Zhang, L., Gao, J., Hu, W., Yao, Y.: Consistent4d: Consistent 360° dynamic object generation from monocular video. In: The Twelfth International Conference on Learning Representations (2024),https://openreview.net/forum? id=sPUrdFGepF

2024
[18]

48550/arXiv.2305.12577,https://arxiv.org/abs/2305.12577

Karunratanakul, K., Preechakul, K., Suwajanakorn, S., Tang, S.: Guided motion diffusion for controllable human motion synthesis (2023).https://doi.org/10. 48550/arXiv.2305.12577,https://arxiv.org/abs/2305.12577

arXiv 2023
[19]

In: International Conference on 3D Vision (3DV)

Kaufmann, M., Aksan, E., Song, J., Pece, F., Ziegler, R., Hilliges, O.: Convolutional autoencoders for human motion infilling. In: International Conference on 3D Vision (3DV). pp. 918–927 (2020).https://doi.org/10.48550/arXiv.2010.11531

work page doi:10.48550/arxiv.2010.11531 2020
[20]

Peters and S

Kim, J., Byun, T., Shin, S., Won, J., Choi, S.: Conditional motion in-betweening. Pattern Recognition132, 108894 (December 2022).https://doi.org/10.1016/j. patcog.2022.108894

work page doi:10.1016/j 2022
[21]

In: International Conference on Learning Representations (ICLR) (2015),https://arxiv.org/abs/ 1412.6980

Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: International Conference on Learning Representations (ICLR) (2015),https://arxiv.org/abs/ 1412.6980

Pith/arXiv arXiv 2015
[22]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Li, Y., Takehara, H., Taketomi, T., Zheng, B., Nießner, M.: 4dcomplete: Non-rigid motion estimation beyond the observable surface. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 12706–12716 (2021), https://openaccess.thecvf.com/content/ICCV2021/html/Li_4DComplete_Non- Rigid_Motion_Estimation_Beyond_the_Observable_S...

2021
[23]

arXiv preprint arXiv:2512.14284 (2025)

Li, Z., Zhang, M., Wu, T., Tan, J., Wang, J., Lin, D.: Ss4d: Native 4d generative model via structured spacetime latents. arXiv preprint arXiv:2512.14284 (2025)

arXiv 2025
[24]

In: Advances in Neural Information Processing Systems (2024)

Li, Z., Chen, Y., Liu, P.: Dreammesh4d: Video-to-4d generation with sparse- controlled gaussian-mesh hybrid representation. In: Advances in Neural Information Processing Systems (2024)

2024
[25]

arXiv preprint arXiv:2312.13763 (2023)

Ling, H., Kim, S.W., Torralba, A., Fidler, S., Kreis, K.: Align your gaussians: Text- to-4d with dynamic 3d gaussians and composed diffusion models. arXiv preprint arXiv:2312.13763 (2023)

arXiv 2023
[26]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Ling, H., Kim, S.W., Torralba, A., Fidler, S., Kreis, K.: Align your gaussians: Text- to-4d with dynamic 3d gaussians and composed diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 8576–8588 (June 2024),https://openaccess.thecvf.com/content/CVPR2024/ html/Ling_Align_Your_Gaussians_Text- to-...

2024
[27]

arXiv preprint arXiv:2209.03003 (2022)

Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003 (2022)

Pith/arXiv arXiv 2022
[28]

Choreographing a world of dynamic objects, 2026

Lyu, Y., Geng, C., Dharmarajan, K., Zhang, Y., Alzayer, H., Wu, S., Wu, J.: Choreographing a world of dynamic objects. arXiv preprint arXiv:2601.04194 (2026). https://doi.org/10.48550/arXiv.2601.04194 , https://arxiv.org/abs/2601. 04194

work page doi:10.48550/arxiv.2601.04194 2026
[29]

Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: Amass: Archive of motion capture as surface shapes (2019).https://doi.org/10.48550/ arXiv.1904.03278,https://arxiv.org/abs/1904.03278

Pith/arXiv arXiv 2019
[30]

arXiv preprint arXiv:2504.08366 (2025)

Nag, S., Cohen-Or, D., Zhang, H., Mahdavi-Amiri, A.: In-2-4d: Inbetweening from two single-view images to 4d generation. arXiv preprint arXiv:2504.08366 (2025). https://doi.org/10.48550/arXiv.2504.08366

work page doi:10.48550/arxiv.2504.08366 2025
[31]

arXiv preprint arXiv:2201.06701 (2022).https://doi.org/10.48550/arXiv.2201.06701 Feed-forward Motion In-betweening for Any 4D 29

Oreshkin, B.N., Valkanas, A., Harvey, F.G., Ménard, L.S., Bocquelet, F., Coates, M.J.: Motion inbetweening via deepδ-interpolator. arXiv preprint arXiv:2201.06701 (2022).https://doi.org/10.48550/arXiv.2201.06701 Feed-forward Motion In-betweening for Any 4D 29

work page doi:10.48550/arxiv.2201.06701 2022
[32]

arXiv preprint arXiv:2401.08742 (2024)

Pan, Z., Yang, Z., Zhu, X., Zhang, L.: Fast dynamic 3d object generation from a single-view video. arXiv preprint arXiv:2401.08742 (2024)

arXiv 2024
[33]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 4195– 4205 (October 2023)

2023
[34]

arXiv preprint arXiv:2504.09413 (2025).https://doi.org/10.48550/ arXiv.2504.09413

Qin, J.: Scalable motion in-betweening via diffusion and physics-based character adaptation. arXiv preprint arXiv:2504.09413 (2025).https://doi.org/10.48550/ arXiv.2504.09413

arXiv 2025
[35]

Computer Graphics Forum (2024).https://doi.org/10.1111/cgf.15260

Qin, J., Yan, P., An, B.: Robust diffusion-based motion in-betweening. Computer Graphics Forum (2024).https://doi.org/10.1111/cgf.15260

work page doi:10.1111/cgf.15260 2024
[36]

ACM Transactions on Graphics41(6) (December 2022)

Qin, J., Zheng, Y., Zhou, K.: Motion in-betweening via two-stage transformers. ACM Transactions on Graphics41(6) (December 2022). https://doi.org/10. 1145/3550454.3555454

arXiv 2022
[37]

In: Meila, M., Zhang, T

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning. Proceedings of Machine Learning Resea...

2021
[38]

DreamGaussian4D: Generative 4D gaussian splatting.arXiv preprint arXiv:2312.17142, 2023

Ren, J., Pan, L., Tang, J., Zhang, C., Cao, A., Zeng, G., Liu, Z.: Dreamgaussian4d: Generative 4d gaussian splatting. arXiv preprint arXiv:2312.17142 (2023).https: //doi.org/10.48550/arXiv.2312.17142,https://arxiv.org/abs/2312.17142

work page doi:10.48550/arxiv.2312.17142 2023
[39]

In: Advances in Neural Information Processing Systems (NeurIPS) (2024), https : / / proceedings

Ren, J., Xie, K., Mirzaei, A., Liang, H., Zeng, X., Kreis, K., Liu, Z., Tor- ralba, A., Fidler, S., Kim, S.W., Ling, H.: L4gm: Large 4d gaussian reconstruc- tion model. In: Advances in Neural Information Processing Systems (NeurIPS) (2024), https : / / proceedings . neurips . cc / paper _ files / paper / 2024 / hash / 6808f2c57d9564a2639a4710e3bbd9b9-Abst...

2024
[40]

In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J

Singer, U., Sheynin, S., Polyak, A., Ashual, O., Makarov, I., Kokkinos, F., Goyal, N., Vedaldi, A., Parikh, D., Johnson, J., Taigman, Y.: Text-to-4D dynamic scene generation. In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J. (eds.) Proceedings of the 40th International Conference on Machine Learning. Proceedings of Machine L...

2023
[41]

In: Proceedings of the 40th International Conference on Machine Learn- ing

Singer, U., Sheynin, S., Polyak, A., Ashual, O., Makarov, I., Kokkinos, F., Goyal, N., Vedaldi, A., Parikh, D., Johnson, J., Taigman, Y.: Text-to-4d dynamic scene generation. In: Proceedings of the 40th International Conference on Machine Learn- ing. Proceedings of Machine Learning Research, vol. 202, pp. 31915–31929. PMLR (Jul 2023),https://proceedings.m...

2023
[42]

48550/arXiv.2402.05054,https://arxiv.org/abs/2402.05054

Tang, J., Chen, Z., Chen, X., Wang, T., Zeng, G., Liu, Z.: Lgm: Large multi-view gaussian model for high-resolution 3d generation (2024).https://doi.org/10. 48550/arXiv.2402.05054,https://arxiv.org/abs/2402.05054

arXiv 2024
[43]

Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., Bermano, A.H.: Human motion diffusion model (2022).https://doi.org/10.48550/arXiv.2209.14916 , https://arxiv.org/abs/2209.14916

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2209.14916 2022
[44]

In: International Conference on Learning Representations (ICLR) (2025),https://openreview.net/forum?id=ykD8a9gJvy

Wang, X., Zhou, B., Curless, B., Kemelmacher-Shlizerman, I., Holynski, A., Seitz, S.M.: Generative inbetweening: Adapting image-to-video models for keyframe interpolation. In: International Conference on Learning Representations (ICLR) (2025),https://openreview.net/forum?id=ykD8a9gJvy

2025
[45]

Vidu4d: Single generated video to high-fidelity 4d reconstruction with dynamic gaussian sur- fels.arXiv preprint arXiv:2405.16822, 2024

Wang, Y., Wang, X., Chen, Z., Wang, Z., Sun, F., Zhu, J.: Vidu4d: Single generated video to high-fidelity 4d reconstruction with dynamic gaussian surfels (2024).https: //doi.org/10.48550/arXiv.2405.16822,https://arxiv.org/abs/2405.16822 30 H. Nishizawa et al

work page doi:10.48550/arxiv.2405.16822 2024
[46]

arXiv preprint arXiv:2602.22742 (2026)

Watanabe, A., Yu, Q., Simo-Serra, E., Fujiwara, K.: Projflow: Projection sampling with flow matching for zero-shot exact spatial motion control. arXiv preprint arXiv:2602.22742 (2026)

arXiv 2026
[47]

Watson, D., Saxena, S., Li, L., Tagliasacchi, A., Fleet, D.J.: Controlling space and time with diffusion models (2024).https://doi.org/10.48550/arXiv.2407.07860, https://arxiv.org/abs/2407.07860

work page doi:10.48550/arxiv.2407.07860 2024
[48]

In: Proceedings of the AAAI Conference on Artificial Intelligence (2024)

Wei, D., Sun, X., Sun, H., Li, B., Hu, S., Li, W., Lu, J.: Diffkfc: Keyframes- collaborated diffusion for human motion synthesis with text control. In: Proceedings of the AAAI Conference on Artificial Intelligence (2024)

2024
[49]

Wu, G., Yi, T., Fang, J., Xie, L., Zhang, X., Wei, W., Liu, W., Tian, Q., Wang, X.: 4d gaussian splatting for real-time dynamic scene rendering (2024).https: //doi.org/10.48550/arXiv.2310.08528,https://arxiv.org/abs/2310.08528

work page doi:10.48550/arxiv.2310.08528 2024
[50]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Wu, R., Gao, R., Poole, B., Trevithick, A., Zheng, C., Barron, J.T., Holynski, A.: Cat4d: Create anything in 4d with multi-view video diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 26057–26068 (June 2025),https://openaccess.thecvf. com/content/CVPR2025/html/Wu_CAT4D_Create_Anything_in_4D...

2025
[51]

In: Computer Vision – ECCV 2024

Wu, Z., Yu, C., Jiang, Y., Cao, C., Wang, F., Bai, X.: Sc4d: Sparse-controlled video-to-4d generation and motion transfer. In: Computer Vision – ECCV 2024. pp. 361–379 (2024).https://doi.org/10.1007/978-3-031-72624-8_21

work page doi:10.1007/978-3-031-72624-8_21 2024
[52]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Wu, Z., Yu, C., Wang, F., Bai, X.: Animateanymesh: A feed-forward 4d foundation model for text-driven universal mesh animation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 13557–13568 (October 2025)

2025
[53]

In: International Conference on Learning Representations (ICLR) (2024),https://neu-vi.github.io/omnicontrol/

Xie, Y., Jampani, V., Zhong, L., Sun, D., Jiang, H.: Omnicontrol: Control any joint at any time for human motion generation. In: International Conference on Learning Representations (ICLR) (2024),https://neu-vi.github.io/omnicontrol/

2024
[54]

arXiv preprint arXiv:2312.17225 (2023)

Yin, Y., Xu, D., Wang, Z., Zhao, Y., Wei, Y.: 4dgen: Grounded 4d content generation with spatial-temporal consistency. arXiv preprint arXiv:2312.17225 (2023)

arXiv 2023
[55]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Yun, K., Hong, S., Kim, C., Noh, J.: Anymole: Any character motion in-betweening leveraging video diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 27838–27848 (June 2025), https://openaccess.thecvf.com/content/CVPR2025/html/Yun_AnyMoLe_ Any_Character_Motion_In-betweening_Leveraging_Video_D...

2025
[56]

In: Computer Vision – ECCV 2024

Zeng, Y., Jiang, Y., Zhu, S., Lu, Y., Lin, Y., Zhu, H., Hu, W., Cao, X., Yao, Y.: Stag4d: Spatial-temporal anchored generative 4d gaussians. In: Computer Vision – ECCV 2024. Lecture Notes in Computer Science, vol. 15094, pp. 163–179. Springer (2024). https://doi.org/10.1007/978-3-031-72764-1_10 , https://www.ecva. net/papers/eccv_2024/papers_ECCV/html/528...

work page doi:10.1007/978-3-031-72764-1_10 2024
[57]

Continual reinforcement learning with multi-timescale replay (2020)

Zhang, H., Chen, X., Wang, Y., Liu, X., Wang, Y., Qiao, Y.: 4diffusion: Multi-view video diffusion model for 4d generation (2024).https://doi.org/10.48550/arXiv. 2405.20674,https://arxiv.org/abs/2405.20674

work page internal anchor Pith review doi:10.48550/arxiv 2024
[58]

Advances in Neural Information Processing Systems38, 30546–30566 (2026)

Zhang, L., Cai, S., Li, M., Wetzstein, G., Agrawala, M.: Frame context packing and drift prevention in next-frame-prediction video diffusion models. Advances in Neural Information Processing Systems38, 30546–30566 (2026)

2026
[59]

org/10.48550/arXiv.2208.15001,https://arxiv.org/abs/2208.15001 Feed-forward Motion In-betweening for Any 4D 31

Zhang, M., Cai, Z., Pan, L., Hong, F., Guo, X., Yang, L., Liu, Z.: Motiondiffuse: Text-driven human motion generation with diffusion model (2022).https://doi. org/10.48550/arXiv.2208.15001,https://arxiv.org/abs/2208.15001 Feed-forward Motion In-betweening for Any 4D 31

work page doi:10.48550/arxiv.2208.15001 2022
[60]

arXiv preprint arXiv:2311.14603 (2023)

Zhao, Y., Yan, Z., Xie, E., Hong, L., Li, Z., Lee, G.H.: Animate124: Animating one image to 4d dynamic scene. arXiv preprint arXiv:2311.14603 (2023)

arXiv 2023
[61]

In: ACM SIGGRAPH Conference Proceedings (2025).https://doi.org/ 10.1145/3721238.3730664, https://dl.acm.org/doi/10.1145/3721238.3730664

Zheng, B., Chen, K., Yao, Y., Zeng, Z., Jiang, X., Wang, H., Lasenby, J., Jin, X.: Autokeyframe: Autoregressive keyframe generation for human motion synthesis and editing. In: ACM SIGGRAPH Conference Proceedings (2025).https://doi.org/ 10.1145/3721238.3730664, https://dl.acm.org/doi/10.1145/3721238.3730664

work page doi:10.1145/3721238.3730664 2025
[62]

arXiv preprint arXiv:2501.01722 (2025).https: //doi.org/10.48550/arXiv.2501.01722,https://arxiv.org/abs/2501.01722

Zhu, H., He, T., Yu, X., Guo, J., Chen, Z., Bian, J.: Ar4d: Autoregressive 4d generation from monocular videos. arXiv preprint arXiv:2501.01722 (2025).https: //doi.org/10.48550/arXiv.2501.01722,https://arxiv.org/abs/2501.01722

work page doi:10.48550/arxiv.2501.01722 2025

[1] [1]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Bae, H., Oh, G., Lee, K., Kim, S., Shin, H., Lee, K.M.: Less is more: Improving motion diffusion models with sparse keyframes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 7902–7913 (October 2025)

2025

[2] [2]

In: Computer Vision – ECCV 2024

Bahmani, S., Liu, X., Yifan, W., Skorokhodov, I., Rong, V., Liu, Z., Liu, X., Park, J.J., Tulyakov, S., Wetzstein, G., Tagliasacchi, A., Lindell, D.B.: TC4D: Trajectory-conditioned text-to-4d generation. In: Computer Vision – ECCV 2024. Lecture Notes in Computer Science, vol. 15104, pp. 53–72. Springer (2024).https: //doi.org/10.1007/978-3-031-72952-2_4

work page doi:10.1007/978-3-031-72952-2_4 2024

[3] [3]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Bahmani, S., Skorokhodov, I., Rong, V., Wetzstein, G., Guibas, L., Wonka, P., Tulyakov, S., Park, J.J., Tagliasacchi, A., Lindell, D.B.: 4d-fy: Text-to-4d gener- ation using hybrid score distillation sampling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 7996– 8006 (June 2024), https://openaccess.thecvf...

2024

[4] [5]

Deitke, M., Liu, R., Wallingford, M., Ngo, H., Michel, O., Kusupati, A., Fan, A., Laforte, C., Voleti, V., Gadre, S.Y., VanderBilt, E., Kembhavi, A., Vondrick, C., Gkioxari, G., Ehsani, K., Schmidt, L., Farhadi, A.: Objaverse-xl: A universe of 10m+ 3d objects (2023).https://doi.org/10.48550/arXiv.2307.05663, https: //arxiv.org/abs/2307.05663

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.05663 2023

[5] [6]

InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Deitke, M., Schwenk, D., Salvador, J., Weihs, L., Michel, O., VanderBilt, E., Schmidt, L., Ehsani, K., Kembhavi, A., Farhadi, A.: Objaverse: A universe of annotated 3d objects (2022). https://doi.org/10.48550/arXiv.2212.08051 , https://arxiv.org/abs/2212.08051

work page doi:10.48550/arxiv.2212.08051 2022

[6] [7]

arXiv preprint arXiv:2403.03206 (2024)

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., Podell, D., Dockhorn, T., English, Z., Lacey, K., Goodwin, A., Marek, Y., Rombach, R.: Scaling rectified flow transformers for high-resolution image synthesis. arXiv preprint arXiv:2403.03206 (2024)

Pith/arXiv arXiv 2024

[7] [8]

Geng, Z., Han, C., Hayder, Z., Liu, J., Shah, M., Mian, A.: Text-guided 3d human motion generation with keyframe-based parallel skip transformer (2024).https: //doi.org/10.48550/arXiv.2405.15439,https://arxiv.org/abs/2405.15439

work page doi:10.48550/arxiv.2405.15439 2024

[8] [9]

In: ACM SIGGRAPH Asia Conference Papers (2025)

Goel, P., Tevet, G., Liu, C.K., Fatahalian, K.: Generating detailed character mo- tion from blocking poses. In: ACM SIGGRAPH Asia Conference Papers (2025). https://doi.org/10.1145/3757377.3763874 , https://dl.acm.org/doi/10. 1145/3757377.3763874

work page doi:10.1145/3757377.3763874 2025

[9] [10]

Computer Graphics Forum (2025).https://doi.org/ 10.1111/cgf.70060

Goel, P., Zhang, H., Liu, C.K., Fatahalian, K.: Generative motion infilling from imprecisely timed keyframes. Computer Graphics Forum (2025).https://doi.org/ 10.1111/cgf.70060

work page doi:10.1111/cgf.70060 2025

[10] [11]

00063,https://arxiv.org/abs/2312.00063

Guo, C., Mu, Y., Javed, M.G., Wang, S., Cheng, L.: Momask: Generative masked modeling of 3d human motions (2023).https://doi.org/10.48550/arXiv.2312. 00063,https://arxiv.org/abs/2312.00063

work page doi:10.48550/arxiv.2312 2023

[11] [12]

ACM Transactions on Graphics39(4) (July 2020).https://doi.org/10.1145/ 3386569.3392480

Harvey, F.G., Yurick, M., Nowrouzezahrai, D., Pal, C.: Robust motion in-betweening. ACM Transactions on Graphics39(4) (July 2020).https://doi.org/10.1145/ 3386569.3392480

arXiv 2020

[12] [13]

Flexible diffusion model- ing of long videos.arXiv preprint arXiv:2205.11495, 2022

Harvey, W., Naderiparizi, S., Masrani, V., Weilbach, C., Wood, F.: Flexible diffusion modeling of long videos. In: Advances in Neural Information Processing Systems (2022). https://doi.org/10.48550/arXiv.2205.11495, https://arxiv.org/abs/ 2205.11495

work page doi:10.48550/arxiv.2205.11495 2022

[13] [14]

Yenpure, S

Hong, S., Kim, H., Cho, K., Noh, J.: Long-term motion in-betweening via keyframe prediction. Computer Graphics Forum (2024).https://doi.org/10.1111/cgf. 15171

work page doi:10.1111/cgf 2024

[14] [15]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., Wang, Y., Chen, X., Wang, L., Lin, D., Qiao, Y., Liu, Z.: Vbench: Comprehensive benchmark suite for video generative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 21807–21818 (June 2024),https://o...

2024

[15] [16]

Nishizawa et al

Jiang, Y., Yu, C., Cao, C., Wang, F., Hu, W., Gao, J.: Animate3d: Animating any 3d model with multi-view video diffusion (2024).https://doi.org/10.48550/ arXiv.2406.11216,https://arxiv.org/abs/2406.11216 28 H. Nishizawa et al

arXiv 2024

[16] [17]

In: The Twelfth International Conference on Learning Representations (2024),https://openreview.net/forum? id=sPUrdFGepF

Jiang, Y., Zhang, L., Gao, J., Hu, W., Yao, Y.: Consistent4d: Consistent 360° dynamic object generation from monocular video. In: The Twelfth International Conference on Learning Representations (2024),https://openreview.net/forum? id=sPUrdFGepF

2024

[17] [18]

48550/arXiv.2305.12577,https://arxiv.org/abs/2305.12577

Karunratanakul, K., Preechakul, K., Suwajanakorn, S., Tang, S.: Guided motion diffusion for controllable human motion synthesis (2023).https://doi.org/10. 48550/arXiv.2305.12577,https://arxiv.org/abs/2305.12577

arXiv 2023

[18] [19]

In: International Conference on 3D Vision (3DV)

Kaufmann, M., Aksan, E., Song, J., Pece, F., Ziegler, R., Hilliges, O.: Convolutional autoencoders for human motion infilling. In: International Conference on 3D Vision (3DV). pp. 918–927 (2020).https://doi.org/10.48550/arXiv.2010.11531

work page doi:10.48550/arxiv.2010.11531 2020

[19] [20]

Peters and S

Kim, J., Byun, T., Shin, S., Won, J., Choi, S.: Conditional motion in-betweening. Pattern Recognition132, 108894 (December 2022).https://doi.org/10.1016/j. patcog.2022.108894

work page doi:10.1016/j 2022

[20] [21]

In: International Conference on Learning Representations (ICLR) (2015),https://arxiv.org/abs/ 1412.6980

Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: International Conference on Learning Representations (ICLR) (2015),https://arxiv.org/abs/ 1412.6980

Pith/arXiv arXiv 2015

[21] [22]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Li, Y., Takehara, H., Taketomi, T., Zheng, B., Nießner, M.: 4dcomplete: Non-rigid motion estimation beyond the observable surface. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 12706–12716 (2021), https://openaccess.thecvf.com/content/ICCV2021/html/Li_4DComplete_Non- Rigid_Motion_Estimation_Beyond_the_Observable_S...

2021

[22] [23]

arXiv preprint arXiv:2512.14284 (2025)

Li, Z., Zhang, M., Wu, T., Tan, J., Wang, J., Lin, D.: Ss4d: Native 4d generative model via structured spacetime latents. arXiv preprint arXiv:2512.14284 (2025)

arXiv 2025

[23] [24]

In: Advances in Neural Information Processing Systems (2024)

Li, Z., Chen, Y., Liu, P.: Dreammesh4d: Video-to-4d generation with sparse- controlled gaussian-mesh hybrid representation. In: Advances in Neural Information Processing Systems (2024)

2024

[24] [25]

arXiv preprint arXiv:2312.13763 (2023)

Ling, H., Kim, S.W., Torralba, A., Fidler, S., Kreis, K.: Align your gaussians: Text- to-4d with dynamic 3d gaussians and composed diffusion models. arXiv preprint arXiv:2312.13763 (2023)

arXiv 2023

[25] [26]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Ling, H., Kim, S.W., Torralba, A., Fidler, S., Kreis, K.: Align your gaussians: Text- to-4d with dynamic 3d gaussians and composed diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 8576–8588 (June 2024),https://openaccess.thecvf.com/content/CVPR2024/ html/Ling_Align_Your_Gaussians_Text- to-...

2024

[26] [27]

arXiv preprint arXiv:2209.03003 (2022)

Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003 (2022)

Pith/arXiv arXiv 2022

[27] [28]

Choreographing a world of dynamic objects, 2026

Lyu, Y., Geng, C., Dharmarajan, K., Zhang, Y., Alzayer, H., Wu, S., Wu, J.: Choreographing a world of dynamic objects. arXiv preprint arXiv:2601.04194 (2026). https://doi.org/10.48550/arXiv.2601.04194 , https://arxiv.org/abs/2601. 04194

work page doi:10.48550/arxiv.2601.04194 2026

[28] [29]

Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: Amass: Archive of motion capture as surface shapes (2019).https://doi.org/10.48550/ arXiv.1904.03278,https://arxiv.org/abs/1904.03278

Pith/arXiv arXiv 2019

[29] [30]

arXiv preprint arXiv:2504.08366 (2025)

Nag, S., Cohen-Or, D., Zhang, H., Mahdavi-Amiri, A.: In-2-4d: Inbetweening from two single-view images to 4d generation. arXiv preprint arXiv:2504.08366 (2025). https://doi.org/10.48550/arXiv.2504.08366

work page doi:10.48550/arxiv.2504.08366 2025

[30] [31]

arXiv preprint arXiv:2201.06701 (2022).https://doi.org/10.48550/arXiv.2201.06701 Feed-forward Motion In-betweening for Any 4D 29

Oreshkin, B.N., Valkanas, A., Harvey, F.G., Ménard, L.S., Bocquelet, F., Coates, M.J.: Motion inbetweening via deepδ-interpolator. arXiv preprint arXiv:2201.06701 (2022).https://doi.org/10.48550/arXiv.2201.06701 Feed-forward Motion In-betweening for Any 4D 29

work page doi:10.48550/arxiv.2201.06701 2022

[31] [32]

arXiv preprint arXiv:2401.08742 (2024)

Pan, Z., Yang, Z., Zhu, X., Zhang, L.: Fast dynamic 3d object generation from a single-view video. arXiv preprint arXiv:2401.08742 (2024)

arXiv 2024

[32] [33]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 4195– 4205 (October 2023)

2023

[33] [34]

arXiv preprint arXiv:2504.09413 (2025).https://doi.org/10.48550/ arXiv.2504.09413

Qin, J.: Scalable motion in-betweening via diffusion and physics-based character adaptation. arXiv preprint arXiv:2504.09413 (2025).https://doi.org/10.48550/ arXiv.2504.09413

arXiv 2025

[34] [35]

Computer Graphics Forum (2024).https://doi.org/10.1111/cgf.15260

Qin, J., Yan, P., An, B.: Robust diffusion-based motion in-betweening. Computer Graphics Forum (2024).https://doi.org/10.1111/cgf.15260

work page doi:10.1111/cgf.15260 2024

[35] [36]

ACM Transactions on Graphics41(6) (December 2022)

Qin, J., Zheng, Y., Zhou, K.: Motion in-betweening via two-stage transformers. ACM Transactions on Graphics41(6) (December 2022). https://doi.org/10. 1145/3550454.3555454

arXiv 2022

[36] [37]

In: Meila, M., Zhang, T

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning. Proceedings of Machine Learning Resea...

2021

[37] [38]

DreamGaussian4D: Generative 4D gaussian splatting.arXiv preprint arXiv:2312.17142, 2023

Ren, J., Pan, L., Tang, J., Zhang, C., Cao, A., Zeng, G., Liu, Z.: Dreamgaussian4d: Generative 4d gaussian splatting. arXiv preprint arXiv:2312.17142 (2023).https: //doi.org/10.48550/arXiv.2312.17142,https://arxiv.org/abs/2312.17142

work page doi:10.48550/arxiv.2312.17142 2023

[38] [39]

In: Advances in Neural Information Processing Systems (NeurIPS) (2024), https : / / proceedings

Ren, J., Xie, K., Mirzaei, A., Liang, H., Zeng, X., Kreis, K., Liu, Z., Tor- ralba, A., Fidler, S., Kim, S.W., Ling, H.: L4gm: Large 4d gaussian reconstruc- tion model. In: Advances in Neural Information Processing Systems (NeurIPS) (2024), https : / / proceedings . neurips . cc / paper _ files / paper / 2024 / hash / 6808f2c57d9564a2639a4710e3bbd9b9-Abst...

2024

[39] [40]

In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J

Singer, U., Sheynin, S., Polyak, A., Ashual, O., Makarov, I., Kokkinos, F., Goyal, N., Vedaldi, A., Parikh, D., Johnson, J., Taigman, Y.: Text-to-4D dynamic scene generation. In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J. (eds.) Proceedings of the 40th International Conference on Machine Learning. Proceedings of Machine L...

2023

[40] [41]

In: Proceedings of the 40th International Conference on Machine Learn- ing

Singer, U., Sheynin, S., Polyak, A., Ashual, O., Makarov, I., Kokkinos, F., Goyal, N., Vedaldi, A., Parikh, D., Johnson, J., Taigman, Y.: Text-to-4d dynamic scene generation. In: Proceedings of the 40th International Conference on Machine Learn- ing. Proceedings of Machine Learning Research, vol. 202, pp. 31915–31929. PMLR (Jul 2023),https://proceedings.m...

2023

[41] [42]

48550/arXiv.2402.05054,https://arxiv.org/abs/2402.05054

Tang, J., Chen, Z., Chen, X., Wang, T., Zeng, G., Liu, Z.: Lgm: Large multi-view gaussian model for high-resolution 3d generation (2024).https://doi.org/10. 48550/arXiv.2402.05054,https://arxiv.org/abs/2402.05054

arXiv 2024

[42] [43]

Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., Bermano, A.H.: Human motion diffusion model (2022).https://doi.org/10.48550/arXiv.2209.14916 , https://arxiv.org/abs/2209.14916

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2209.14916 2022

[43] [44]

In: International Conference on Learning Representations (ICLR) (2025),https://openreview.net/forum?id=ykD8a9gJvy

Wang, X., Zhou, B., Curless, B., Kemelmacher-Shlizerman, I., Holynski, A., Seitz, S.M.: Generative inbetweening: Adapting image-to-video models for keyframe interpolation. In: International Conference on Learning Representations (ICLR) (2025),https://openreview.net/forum?id=ykD8a9gJvy

2025

[44] [45]

Vidu4d: Single generated video to high-fidelity 4d reconstruction with dynamic gaussian sur- fels.arXiv preprint arXiv:2405.16822, 2024

Wang, Y., Wang, X., Chen, Z., Wang, Z., Sun, F., Zhu, J.: Vidu4d: Single generated video to high-fidelity 4d reconstruction with dynamic gaussian surfels (2024).https: //doi.org/10.48550/arXiv.2405.16822,https://arxiv.org/abs/2405.16822 30 H. Nishizawa et al

work page doi:10.48550/arxiv.2405.16822 2024

[45] [46]

arXiv preprint arXiv:2602.22742 (2026)

Watanabe, A., Yu, Q., Simo-Serra, E., Fujiwara, K.: Projflow: Projection sampling with flow matching for zero-shot exact spatial motion control. arXiv preprint arXiv:2602.22742 (2026)

arXiv 2026

[46] [47]

Watson, D., Saxena, S., Li, L., Tagliasacchi, A., Fleet, D.J.: Controlling space and time with diffusion models (2024).https://doi.org/10.48550/arXiv.2407.07860, https://arxiv.org/abs/2407.07860

work page doi:10.48550/arxiv.2407.07860 2024

[47] [48]

In: Proceedings of the AAAI Conference on Artificial Intelligence (2024)

Wei, D., Sun, X., Sun, H., Li, B., Hu, S., Li, W., Lu, J.: Diffkfc: Keyframes- collaborated diffusion for human motion synthesis with text control. In: Proceedings of the AAAI Conference on Artificial Intelligence (2024)

2024

[48] [49]

Wu, G., Yi, T., Fang, J., Xie, L., Zhang, X., Wei, W., Liu, W., Tian, Q., Wang, X.: 4d gaussian splatting for real-time dynamic scene rendering (2024).https: //doi.org/10.48550/arXiv.2310.08528,https://arxiv.org/abs/2310.08528

work page doi:10.48550/arxiv.2310.08528 2024

[49] [50]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Wu, R., Gao, R., Poole, B., Trevithick, A., Zheng, C., Barron, J.T., Holynski, A.: Cat4d: Create anything in 4d with multi-view video diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 26057–26068 (June 2025),https://openaccess.thecvf. com/content/CVPR2025/html/Wu_CAT4D_Create_Anything_in_4D...

2025

[50] [51]

In: Computer Vision – ECCV 2024

Wu, Z., Yu, C., Jiang, Y., Cao, C., Wang, F., Bai, X.: Sc4d: Sparse-controlled video-to-4d generation and motion transfer. In: Computer Vision – ECCV 2024. pp. 361–379 (2024).https://doi.org/10.1007/978-3-031-72624-8_21

work page doi:10.1007/978-3-031-72624-8_21 2024

[51] [52]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Wu, Z., Yu, C., Wang, F., Bai, X.: Animateanymesh: A feed-forward 4d foundation model for text-driven universal mesh animation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 13557–13568 (October 2025)

2025

[52] [53]

In: International Conference on Learning Representations (ICLR) (2024),https://neu-vi.github.io/omnicontrol/

Xie, Y., Jampani, V., Zhong, L., Sun, D., Jiang, H.: Omnicontrol: Control any joint at any time for human motion generation. In: International Conference on Learning Representations (ICLR) (2024),https://neu-vi.github.io/omnicontrol/

2024

[53] [54]

arXiv preprint arXiv:2312.17225 (2023)

Yin, Y., Xu, D., Wang, Z., Zhao, Y., Wei, Y.: 4dgen: Grounded 4d content generation with spatial-temporal consistency. arXiv preprint arXiv:2312.17225 (2023)

arXiv 2023

[54] [55]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Yun, K., Hong, S., Kim, C., Noh, J.: Anymole: Any character motion in-betweening leveraging video diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 27838–27848 (June 2025), https://openaccess.thecvf.com/content/CVPR2025/html/Yun_AnyMoLe_ Any_Character_Motion_In-betweening_Leveraging_Video_D...

2025

[55] [56]

In: Computer Vision – ECCV 2024

Zeng, Y., Jiang, Y., Zhu, S., Lu, Y., Lin, Y., Zhu, H., Hu, W., Cao, X., Yao, Y.: Stag4d: Spatial-temporal anchored generative 4d gaussians. In: Computer Vision – ECCV 2024. Lecture Notes in Computer Science, vol. 15094, pp. 163–179. Springer (2024). https://doi.org/10.1007/978-3-031-72764-1_10 , https://www.ecva. net/papers/eccv_2024/papers_ECCV/html/528...

work page doi:10.1007/978-3-031-72764-1_10 2024

[56] [57]

Continual reinforcement learning with multi-timescale replay (2020)

Zhang, H., Chen, X., Wang, Y., Liu, X., Wang, Y., Qiao, Y.: 4diffusion: Multi-view video diffusion model for 4d generation (2024).https://doi.org/10.48550/arXiv. 2405.20674,https://arxiv.org/abs/2405.20674

work page internal anchor Pith review doi:10.48550/arxiv 2024

[57] [58]

Advances in Neural Information Processing Systems38, 30546–30566 (2026)

Zhang, L., Cai, S., Li, M., Wetzstein, G., Agrawala, M.: Frame context packing and drift prevention in next-frame-prediction video diffusion models. Advances in Neural Information Processing Systems38, 30546–30566 (2026)

2026

[58] [59]

org/10.48550/arXiv.2208.15001,https://arxiv.org/abs/2208.15001 Feed-forward Motion In-betweening for Any 4D 31

Zhang, M., Cai, Z., Pan, L., Hong, F., Guo, X., Yang, L., Liu, Z.: Motiondiffuse: Text-driven human motion generation with diffusion model (2022).https://doi. org/10.48550/arXiv.2208.15001,https://arxiv.org/abs/2208.15001 Feed-forward Motion In-betweening for Any 4D 31

work page doi:10.48550/arxiv.2208.15001 2022

[59] [60]

arXiv preprint arXiv:2311.14603 (2023)

Zhao, Y., Yan, Z., Xie, E., Hong, L., Li, Z., Lee, G.H.: Animate124: Animating one image to 4d dynamic scene. arXiv preprint arXiv:2311.14603 (2023)

arXiv 2023

[60] [61]

In: ACM SIGGRAPH Conference Proceedings (2025).https://doi.org/ 10.1145/3721238.3730664, https://dl.acm.org/doi/10.1145/3721238.3730664

Zheng, B., Chen, K., Yao, Y., Zeng, Z., Jiang, X., Wang, H., Lasenby, J., Jin, X.: Autokeyframe: Autoregressive keyframe generation for human motion synthesis and editing. In: ACM SIGGRAPH Conference Proceedings (2025).https://doi.org/ 10.1145/3721238.3730664, https://dl.acm.org/doi/10.1145/3721238.3730664

work page doi:10.1145/3721238.3730664 2025

[61] [62]

arXiv preprint arXiv:2501.01722 (2025).https: //doi.org/10.48550/arXiv.2501.01722,https://arxiv.org/abs/2501.01722

Zhu, H., He, T., Yu, X., Guo, J., Chen, Z., Bian, J.: Ar4d: Autoregressive 4d generation from monocular videos. arXiv preprint arXiv:2501.01722 (2025).https: //doi.org/10.48550/arXiv.2501.01722,https://arxiv.org/abs/2501.01722

work page doi:10.48550/arxiv.2501.01722 2025