Image-to-Video Diffusion: From Foundations to Open Frontiers

Hangtao Zhang; Ke Li; Leo Yu Zhang; Shijia Zhou; Wenbo Pan; Xianlong Wang; Xiaohua Jia; Yuqi Wang; Zeyu Ye

arxiv: 2605.17248 · v1 · pith:KNZVHSFXnew · submitted 2026-05-17 · 💻 cs.CV

Image-to-Video Diffusion: From Foundations to Open Frontiers

Xianlong Wang , Wenbo Pan , Shijia Zhou , Ke Li , Yuqi Wang , Zeyu Ye , Hangtao Zhang , Leo Yu Zhang

show 1 more author

Xiaohua Jia

This is my paper

Pith reviewed 2026-05-20 15:01 UTC · model grok-4.3

classification 💻 cs.CV

keywords image-to-video generationdiffusion modelsgenerative AIvideo synthesismodel taxonomytemporal modelingcondition encodingnoise prior

0 comments

The pith

Image-to-video diffusion reduces to four core designs identified through a dedicated taxonomy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats diffusion-based image-to-video generation as a distinct topic rather than part of general video synthesis. It reviews task setup, architectures, data, and metrics before grouping methods into categories based on how the models are structured and trained. This grouping leads to the extraction of four recurring design choices that shape performance: how conditions from the input image are incorporated, how temporal information across frames is handled, what noise patterns are used to start generation, and how spatial and temporal resolution are increased. The analysis also covers real-world applications and lists remaining technical hurdles.

Core claim

By creating a taxonomy organized by architecture and training paradigm, the work shows that image-to-video diffusion models share four key design elements—condition encoding, temporal modeling, noise prior design, and spatial-temporal upsampling—which together address the requirements for content consistency, identity preservation, and motion coherence.

What carries the argument

Taxonomy of methods based on architecture and training paradigm, which distills four core designs: condition encoding, temporal modeling, noise prior design, and spatial-temporal upsampling.

If this is right

Future models can be designed by selecting and combining specific implementations for each of the four core designs.
Comparisons between methods become more straightforward when evaluated against variations in condition encoding or temporal modeling.
Datasets and metrics can be refined to better test the effectiveness of noise prior choices and upsampling strategies.
Applications in content creation gain from targeted improvements in temporal coherence and identity preservation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar taxonomies could be applied to text-to-video or other conditional generation tasks to find overlapping design principles.
Empirical studies that vary one core design while holding others fixed could quantify the contribution of each to overall video quality.
Integration with efficiency considerations might identify design choices that maintain quality at lower computational cost.

Load-bearing premise

That a focused taxonomy and analysis for image-to-video diffusion alone will yield clearer organization and insights than embedding it in wider video generation discussions.

What would settle it

Finding that most recent high-performing I2V models do not rely on the proposed four core designs or fit outside the taxonomy would indicate the classification is incomplete.

Figures

Figures reproduced from arXiv: 2605.17248 by Hangtao Zhang, Ke Li, Leo Yu Zhang, Shijia Zhou, Wenbo Pan, Xianlong Wang, Xiaohua Jia, Yuqi Wang, Zeyu Ye.

**Figure 1.** Figure 1: Basic architecture of image-to-video (I2V) diffusion models. Abstract—Diffusion-based image-to-video (I2V) generation has become a central direction in generative models by turning a reference image, with optional conditions, into a temporally coherent video. Compared with broader video generation settings, this task places stricter demands on content consistency, identity preservation, and motion cohere… view at source ↗

**Figure 2.** Figure 2: Video effects of I2V models under different conditions. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Illustration of model architectures of diffusion I2V. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: A taxonomy of diffusion I2V schemes, organized by model architecture and then subdivided by training paradigm. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Illustration of four key technologies in I2V diffusion process. [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Four open problems and future research directions in diffusion I2V. [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

read the original abstract

Diffusion-based \textit{image-to-video} (I2V) generation has become a central direction in generative models by turning a reference image, with optional conditions, into a temporally coherent video. Compared with broader video generation settings, this task places stricter demands on content consistency, identity preservation, and motion coherence. Although the literature grows rapidly, existing works mostly discuss I2V generation within broader topics and still lack a dedicated taxonomy together with a systematic analysis centered on this field. This work addresses that gap by treating diffusion I2V generation as a standalone subject. It first reviews the task formulation, model architectures, datasets, and evaluation metrics, and then organizes existing methods through a taxonomy based on architecture and training paradigm. It further distills four core designs, namely condition encoding, temporal modeling, noise prior design, and spatial-temporal upsampling, and discusses representative application scenarios together with major open challenges.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This survey organizes I2V diffusion work around four core designs and gives a clear map of the area, but its main value depends on whether the claimed gap in dedicated taxonomies actually holds up against recent video generation surveys.

read the letter

The key takeaway is that this paper offers a structured taxonomy for diffusion image-to-video methods, breaking them down into four main design elements: condition encoding, temporal modeling, noise prior design, and spatial-temporal upsampling. It also covers the basics like task formulation, architectures, datasets, and evaluation. What the paper does well is pulling together a lot of recent work into one place with a consistent lens based on architecture and training paradigm. This can help researchers see patterns across papers that might otherwise look scattered. The discussion of applications and challenges at the end points to where the field might go next. On the soft spots, the central claim is that no prior work has done a dedicated taxonomy for I2V specifically. That might be true, but it would be good to double-check against the latest surveys on video diffusion models to make sure this isn't overlapping too much with existing overviews. If those already have solid I2V sections, the contribution shrinks to reorganization. Since this is a survey without new experiments or proofs, the strength rests entirely on how complete and fair the coverage is. The abstract mentions a systematic review, which is a start, but the actual paper needs to show it didn't cherry-pick. This kind of paper is mainly for newcomers to the subfield or for people who want a quick reference on the main ideas. It could be useful in a reading group if the group is focused on generative models. I would recommend sending it to peer review. Surveys that organize fast-growing areas like this one can be valuable if they hold up under scrutiny on the gap they claim to fill.

Referee Report

2 major / 2 minor

Summary. The paper presents a survey on diffusion-based image-to-video (I2V) generation, claiming to address a gap where prior works treat I2V only within broader video generation topics. It reviews task formulation, model architectures, datasets, and evaluation metrics; organizes methods via a taxonomy based on architecture and training paradigm; distills four core designs (condition encoding, temporal modeling, noise prior design, spatial-temporal upsampling); and discusses applications and open challenges.

Significance. If the taxonomy is shown to be exhaustive and the gap claim is substantiated by direct comparison to existing video diffusion surveys, the work could provide a useful organizing framework for a fast-moving subfield, highlighting design choices and challenges that are specific to the stricter consistency requirements of I2V.

major comments (2)

[Introduction] Introduction: The assertion that 'existing works mostly discuss I2V generation within broader topics and still lack a dedicated taxonomy' is load-bearing for the central contribution; the manuscript should include an explicit comparison (e.g., a table or dedicated subsection) to recent video diffusion surveys to demonstrate that their I2V coverage is not already comparable in organization or depth.
[Taxonomy] Taxonomy section: The taxonomy organized by architecture and training paradigm must be justified as exhaustive; without a clear accounting of how all representative methods (including those published after the survey cutoff) map onto the categories, the claim of a systematic analysis centered on I2V risks appearing selective.

minor comments (2)

[Datasets and Evaluation Metrics] Datasets and metrics sections: Provide more detail on how the listed datasets and metrics specifically capture identity preservation and motion coherence, which the abstract identifies as stricter demands for I2V.
[Core Designs] Four core designs: When distilling condition encoding, temporal modeling, noise prior design, and spatial-temporal upsampling, include a brief cross-reference table showing which methods exemplify each design to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our survey. These suggestions will help strengthen the justification for our contribution and the systematic nature of the taxonomy. We respond to each major comment below.

read point-by-point responses

Referee: [Introduction] Introduction: The assertion that 'existing works mostly discuss I2V generation within broader topics and still lack a dedicated taxonomy' is load-bearing for the central contribution; the manuscript should include an explicit comparison (e.g., a table or dedicated subsection) to recent video diffusion surveys to demonstrate that their I2V coverage is not already comparable in organization or depth.

Authors: We agree that an explicit comparison strengthens the central claim. In the revised manuscript we will add a dedicated subsection (and accompanying table) in the Introduction that directly compares our survey against recent video diffusion surveys. The table will summarize differences in scope, depth of I2V-specific analysis (condition encoding, temporal modeling, noise prior, and upsampling), and organizational structure, thereby substantiating that prior works treat I2V only peripherally and lack a dedicated taxonomy. revision: yes
Referee: [Taxonomy] Taxonomy section: The taxonomy organized by architecture and training paradigm must be justified as exhaustive; without a clear accounting of how all representative methods (including those published after the survey cutoff) map onto the categories, the claim of a systematic analysis centered on I2V risks appearing selective.

Authors: We acknowledge the value of demonstrating coverage. The taxonomy is organized along two axes—architecture (convolutional versus transformer-based backbones) and training paradigm (from-scratch training versus adaptation/fine-tuning)—because these axes capture the principal design decisions that differentiate I2V diffusion methods. In revision we will add an explicit justification paragraph and a mapping table that places representative methods into the categories. For works published after the survey cutoff we will insert a brief discussion noting that emerging methods continue to align with the same architectural and training paradigms; we will also state our intention to incorporate post-cutoff papers in future updates. This approach preserves the systematic character of the analysis while remaining transparent about temporal scope. revision: partial

Circularity Check

0 steps flagged

No significant circularity in external literature synthesis

full rationale

The paper is a survey synthesizing task formulations, architectures, datasets, metrics, and methods from external literature into a taxonomy organized by architecture/training paradigm and four distilled core designs. No derivations, equations, parameter fittings, or predictions appear that could reduce to the paper's own inputs by construction. The gap assertion (existing works lack a dedicated I2V-centered taxonomy) is an external claim about prior surveys, verifiable independently and not self-referential or load-bearing via self-citation chains. The work remains self-contained as an organizational review without any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a survey paper that does not introduce new free parameters, axioms, or invented entities; it reviews existing literature on diffusion-based image-to-video generation.

pith-pipeline@v0.9.0 · 5706 in / 1061 out tokens · 65238 ms · 2026-05-20T15:01:07.978414+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

190 extracted references · 190 canonical work pages · 26 internal anchors

[1]

Dynamic-i2v: Exploring image-to-video generation models via multimodal llm,

P. Liu, X. Ren, F. Liu, Q. Xie, Q. Zheng, Y . Zhang, H. Lu, and Y . Yang, “Dynamic-i2v: Exploring image-to-video generation models via multimodal llm,”arXiv preprint arXiv:2505.19901, 2025

work page arXiv 2025
[2]

Levitor: 3d trajectory oriented image-to-video synthesis,

H. Wang, H. Ouyang, Q. Wang, W. Wang, K. L. Cheng, Q. Chen, Y . Shen, and L. Wang, “Levitor: 3d trajectory oriented image-to-video synthesis,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 12 490–12 500

work page 2025
[3]

Motioncanvas: Cinematic shot design with controllable image-to-video generation,

J. Xing, L. Mai, C. Ham, J. Huang, A. Mahapatra, C.-W. Fu, T.- T. Wong, and F. Liu, “Motioncanvas: Cinematic shot design with controllable image-to-video generation,” inProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference, 2025, pp. 1–11

work page 2025
[4]

Versatile transi- tion generation with image-to-video diffusion,

Z. Yang, J. Zhang, Y . Yu, S. Lu, and S. Bai, “Versatile transi- tion generation with image-to-video diffusion,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 16 981–16 990

work page 2025
[5]

I2v3d: Controllable image-to-video generation with 3d guidance,

Z. Zhang, D. Chen, and J. Liao, “I2v3d: Controllable image-to-video generation with 3d guidance,”arXiv preprint arXiv:2503.09733, 2025

work page arXiv 2025
[6]

Realcam-i2v: Real-world image-to-video generation with interactive complex camera control,

T. Li, G. Zheng, R. Jiang, S. Zhan, T. Wu, Y . Lu, Y . Lin, C. Deng, Y . Xiong, M. Chen, L. Cheng, and X. Li, “Realcam-i2v: Real-world image-to-video generation with interactive complex camera control,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 28 785–28 796

work page 2025
[7]

Extrapolating and decoupling image-to-video generation models: Motion modeling is easier than you think,

J. Tian, X. Qu, Z. Lu, W. Wei, S. Liu, and Y . Cheng, “Extrapolating and decoupling image-to-video generation models: Motion modeling is easier than you think,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 12 512–12 521

work page 2025
[8]

Tagrpo: Boosting grpo on image-to-video generation with direct trajectory alignment,

J. Wang, J. Lu, G. Xu, C. Chen, H. Yang, L. Wang, P. Chen, M. Chen, Z. Hu, L. Wu, S. Shao, Q. Lu, and P. Luo, “Tagrpo: Boosting grpo on image-to-video generation with direct trajectory alignment,”arXiv preprint arXiv:2601.05729, 2026

work page arXiv 2026
[9]

Auto-Encoding Variational Bayes

D. P. Kingma and M. Welling, “Auto-encoding variational bayes,”arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[10]

I4vgen: Image as free stepping stone for text-to-video generation,

X. Guo, J. Liu, M. Cui, L. Bo, and D. Huang, “I4vgen: Image as free stepping stone for text-to-video generation,”arXiv preprint arXiv:2406.02230, 2024

work page arXiv 2024
[11]

Emu video: Factorizing text-to-video generation by explicit image conditioning,

R. Girdhar, M. Singh, A. Brown, Q. Duval, S. Azadi, S. S. Rambhatla, A. Shah, X. Yin, D. Parikh, and I. Misra, “Emu video: Factorizing text-to-video generation by explicit image conditioning,”arXiv preprint arXiv:2311.10709, 2023

work page arXiv 2023
[12]

Rv-gan: Recurrent gan for uncondi- tional video generation,

S. Gupta, A. Keshari, and S. Das, “Rv-gan: Recurrent gan for uncondi- tional video generation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2024–2033. 13

work page 2022
[13]

Styleinv: A temporal style modulated inversion network for unconditional video generation,

Y . Wang, L. Jiang, and C. C. Loy, “Styleinv: A temporal style modulated inversion network for unconditional video generation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 22 851–22 861

work page 2023
[14]

Realisdance: Equip controllable character anima- tion with realistic hands.arXiv preprint arXiv:2409.06202,

J. Zhou, B. Wang, W. Chen, J. Bai, D. Li, A. Zhang, H. Xu, M. Yang, and F. Wang, “Realisdance: Equip controllable character animation with realistic hands,”arXiv preprint arXiv:2409.06202, 2024

work page arXiv 2024
[15]

Animate anyone: Consistent and controllable image-to-video synthesis for character animation,

L. Hu, “Animate anyone: Consistent and controllable image-to-video synthesis for character animation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 8153–8163

work page 2024
[16]

CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation

D. Xu, W. Nie, C. Liu, S. Liu, J. Kautz, Z. Wang, and A. Vahdat, “Camco: Camera-controllable 3d-consistent image-to-video genera- tion,”arXiv preprint arXiv:2406.02509, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Mofa-video: Controllable image animation via generative motion field adaptions in frozen image-to-video diffusion model,

M. Niu, X. Cun, X. Wang, Y . Zhang, Y . Shan, and Y . Zheng, “Mofa-video: Controllable image animation via generative motion field adaptions in frozen image-to-video diffusion model,” inProceedings of the European Conference on Computer Vision. Springer, 2024, pp. 111–128

work page 2024
[18]

Champ: Controllable and consistent human image animation with 3d parametric guidance,

S. Zhu, J. L. Chen, Z. Dai, Z. Dong, Y . Xu, X. Cao, Y . Yao, H. Zhu, and S. Zhu, “Champ: Controllable and consistent human image animation with 3d parametric guidance,” inProceedings of the European Conference on Computer Vision. Springer, 2024, pp. 145– 162

work page 2024
[19]

Human video generation from a single image with 3d pose and view control,

T. Wang, C.-H. Yao, T. Hu, M. B. R. Reddy, M.-H. Yang, and V . Jampani, “Human video generation from a single image with 3d pose and view control,”arXiv preprint arXiv:2602.21188, 2026

work page arXiv 2026
[20]

Pia: Your per- sonalized image animator via plug-and-play modules in text-to-image models,

Y . Zhang, Z. Xing, Y . Zeng, Y . Fang, and K. Chen, “Pia: Your per- sonalized image animator via plug-and-play modules in text-to-image models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 7747–7756

work page 2024
[21]

Customcrafter: Customized video generation with preserving motion and concept composition abilities,

T. Wu, Y . Zhang, X. Wang, X. Zhou, G. Zheng, Z. Qi, Y . Shan, and X. Li, “Customcrafter: Customized video generation with preserving motion and concept composition abilities,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 8, 2025, pp. 8469– 8477

work page 2025
[22]

U-net: Convolutional net- works for biomedical image segmentation,

O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional net- works for biomedical image segmentation,” inProceedings of the Medical Image Computing and Computer-Assisted Intervention, 2015, pp. 234–241

work page 2015
[23]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y . Levi, Z. English, V . V oleti, A. Letts, J. Varun, and R. Rombach, “Stable video diffusion: Scaling latent video diffusion models to large datasets,”arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Megactor-sigma: Unlocking flexible mixed-modal control in portrait animation with diffusion transformer,

S. Yang, H. Li, J. Wu, M. Jing, L. Li, R. Ji, J. Liang, H. Fan, and J. Wang, “Megactor-sigma: Unlocking flexible mixed-modal control in portrait animation with diffusion transformer,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 9, 2025, pp. 9256–9264

work page 2025
[25]

Tora: Trajectory-oriented diffusion transformer for video generation,

Z. Zhang, J. Liao, M. Li, Z. Dai, B. Qiu, S. Zhu, L. Qin, and W. Wang, “Tora: Trajectory-oriented diffusion transformer for video generation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 2063–2073

work page 2025
[26]

Humandit: Pose-guided diffusion transformer for long- form human motion video generation.arXiv preprint arXiv:2502.04847, 2025

Q. Gan, Y . Ren, C. Zhang, Z. Ye, P. Xie, X. Yin, Z. Yuan, B. Peng, and J. Zhu, “Humandit: Pose-guided diffusion transformer for long-form human motion video generation,”arXiv preprint arXiv:2502.04847, 2025

work page arXiv 2025
[27]

A survey on video diffusion models,

Z. Xing, Q. Feng, H. Chen, Q. Dai, H. Hu, H. Xu, Z. Wu, and Y .-G. Jiang, “A survey on video diffusion models,”ACM Computing Surveys, vol. 57, no. 2, pp. 1–42, 2024

work page 2024
[28]

Survey of video diffusion models: Foundations, implementations, and applications,

Y . Wang, X. Liu, W. Pang, L. Ma, S. Yuan, P. Debevec, and N. Yu, “Survey of video diffusion models: Foundations, implementations, and applications,”arXiv preprint arXiv:2504.16081, 2025

work page arXiv 2025
[29]

Con- trollable video generation: A survey.arXiv preprint arXiv:2507.16869, 2025

Y . Ma, K. Feng, Z. Hu, X. Wang, Y . Wang, M. Zheng, B. Wang, Q. Wang, X. He, H. Wang, C. Zhu, H. Liu, Y . He, Z. Wang, Z. Li, X. Li, S. Han, Y . Guo, W. Liu, D. Xu, L. Zhang, and Q. Chen, “Controllable video generation: A survey,”arXiv preprint arXiv:2507.16869, 2025

work page arXiv 2025
[30]

Diffusion model-based video editing: A survey,

W. Sun, R.-C. Tu, J. Liao, and D. Tao, “Diffusion model-based video editing: A survey,”arXiv preprint arXiv:2407.07111, 2024

work page arXiv 2024
[31]

Efficient diffusion models: A comprehen- sive survey from principles to practices,

Z. Ma, Y . Zhang, G. Jia, L. Zhao, Y . Ma, M. Ma, G. Liu, K. Zhang, N. Ding, J. Li, and B. Zhou, “Efficient diffusion models: A comprehen- sive survey from principles to practices,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

work page 2025
[32]

Bridging text and video generation: A survey,

N. Kumar, P. Bhandari, and G. Maragatham, “Bridging text and video generation: A survey,”arXiv preprint arXiv:2510.04999, 2025

work page arXiv 2025
[33]

Human motion video generation: A survey,

H. Xue, X. Luo, Z. Hu, X. Zhang, X. Xiang, Y . Dai, J. Liu, Z. Zhang, M. Li, J. Yang, F. Ma, Z. Wu, C. Yang, Z. Dai, and F. R. Yu, “Human motion video generation: A survey,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

work page 2025
[34]

Diffusion models: A comprehensive survey of methods and applications,

L. Yang, Z. Zhang, Y . Song, S. Hong, R. Xu, Y . Zhao, W. Zhang, B. Cui, and M.-H. Yang, “Diffusion models: A comprehensive survey of methods and applications,”ACM Computing Surveys, vol. 56, no. 4, pp. 1–39, 2023

work page 2023
[35]

Humanvid: demystifying training data for camera-controllable human image animation,

Z. Wang, Y . Li, Y . Zeng, Y . Fang, Y . Guo, W. Liu, J. Tan, K. Chen, T. Xue, B. Dai, and D. Lin, “Humanvid: demystifying training data for camera-controllable human image animation,” inProceedings of the Conference on Neural Information Processing Systems, 2024, pp. 20 111–20 131

work page 2024
[36]

Unianimate: Taming unified video diffusion models for consistent human image animation,

X. Wang, S. Zhang, C. Gao, J. Wang, X. Zhou, Y . Zhang, L. Yan, and N. Sang, “Unianimate: Taming unified video diffusion models for consistent human image animation,”Science China Information Sciences, vol. 68, no. 10, pp. 1–14, 2025

work page 2025
[37]

Dimensionx: Create any 3d and 4d scenes from a single image with decoupled video diffusion,

W. Sun, S. Chen, F. Liu, Z. Chen, Y . Duan, J. Zhu, J. Zhang, and Y . Wang, “Dimensionx: Create any 3d and 4d scenes from a single image with decoupled video diffusion,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 13 695–13 706

work page 2025
[38]

Phyrpr: Training-free physics- constrained video generation,

Y . Zhao, H. Li, X. He, and B. Wu, “Phyrpr: Training-free physics- constrained video generation,”arXiv preprint arXiv:2601.09255, 2026

work page arXiv 2026
[39]

Prompt image to life: Training- free text-driven image-to-video generation

J. Liu, Y . Yao, B. Zhu, F. Wang, W. Luo, J. Su, Y . Zhang, Y . Wang, L. Ma, Q. Liu, J. Luo, and G.-J. Qi, “Prompt image to life: Training- free text-driven image-to-video generation.”

work page
[40]

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Y . Gao, H. Guo, T. Hoang, W. Huang, L. Jiang, F. Kong, H. Li, J. Li, L. Li, X. Li, X. Li, Y . Li, S. Lin, Z. Lin, J. Liu, S. Liu, X. Nie, Z. Qing, Y . Ren, L. Sun, Z. Tian, R. Wang, S. Wang, G. Wei, G. Wu, J. Wu, R. Xia, F. Xiao, X. Xiao, J. Yan, C. Yang, J. Yang, R. Yang, T. Yang, Y . Yang, Z. Ye, X. Zeng, Y . Zeng, H. Zhang, Y . Zhao, X. Zheng, P. Zhu,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Sana-video: Efficient video generation with block linear diffusion transformer.arXiv preprint arXiv:2509.24695, 2025

J. Chen, Y . Zhao, J. Yu, R. Chu, J. Chen, S. Yang, X. Wang, Y . Pan, D. Zhou, H. Ling, H. Liu, H. Yi, H. Zhang, M. Li, Y . Chen, H. Cai, S. Fidler, P. Luo, S. Han, and E. Xie, “Sana-video: Efficient video generation with block linear diffusion transformer,”arXiv preprint arXiv:2509.24695, 2025

work page arXiv 2025
[42]

Frame context packing and drift prevention in next-frame-prediction video diffusion models,

L. Zhang, S. Cai, M. Li, G. Wetzstein, and M. Agrawala, “Frame context packing and drift prevention in next-frame-prediction video diffusion models,” inProceedings of the Conference on Neural Infor- mation Processing Systems, 2025

work page 2025
[43]

Animate anyone 2: High-fidelity character image animation with environment affordance.arXiv preprint arXiv:2502.06145, 2025

L. Hu, G. Wang, Z. Shen, X. Gao, D. Meng, L. Zhuo, P. Zhang, B. Zhang, and L. Bo, “Animate anyone 2: High-fidelity charac- ter image animation with environment affordance,”arXiv preprint arXiv:2502.06145, 2025

work page arXiv 2025
[44]

Make-a-video: Text-to-video generation without text-video data,

U. Singer, A. Polyak, T. Hayes, X. Yin, J. An, S. Zhang, Q. Hu, H. Yang, O. Ashual, O. Gafni, D. Parikh, S. Gupta, and Y . Taigman, “Make-a-video: Text-to-video generation without text-video data,” in Proceedings of the International Conference on Learning Representa- tions, 2023

work page 2023
[45]

Mvportrait: Text-guided motion and emotion control for multi-view vivid portrait animation,

Y . Lin, H. Fung, J. Xu, Z. Ren, A. S. Lau, G. Yin, and X. Li, “Mvportrait: Text-guided motion and emotion control for multi-view vivid portrait animation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 26 242–26 252

work page 2025
[46]

Stiv: Scalable text and image conditioned video generation,

Z. Lin, W. Liu, C. Chen, J. Lu, W. Hu, T.-J. Fu, J. Allardice, Z. Lai, L. Song, B. Zhang, C. Chen, Y . Fei, L. Li, Y . Yang, Y . Sun, and K.-W. Chang, “Stiv: Scalable text and image conditioned video generation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 16 249–16 259

work page 2025
[47]

Conditional image- to-video generation with latent flow diffusion models,

H. Ni, C. Shi, K. Li, S. X. Huang, and M. R. Min, “Conditional image- to-video generation with latent flow diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, 2023, pp. 18 444–18 455

work page 2023
[48]

Mardini: Masked autoregressive diffusion for video generation at scale.arXiv preprint arXiv:2410.20280, 2024

H. Liu, S. Liu, Z. Zhou, M. Xu, Y . Xie, X. Han, J. C. P ´erez, D. Liu, K. Kahatapitiya, M. Jia, J.-C. Wu, S. He, T. Xiang, J. Schmidhuber, and J.-M. P ´erez-R´ua, “Mardini: Masked autoregressive diffusion for video generation at scale,”arXiv preprint arXiv:2410.20280, 2024

work page arXiv 2024
[49]

LTX-Video: Realtime Video Latent Diffusion

Y . HaCohen, N. Chiprut, B. Brazowski, D. Shalem, D. Moshe, E. Richardson, E. Levin, G. Shiran, N. Zabari, O. Gordon, P. Panet, S. Weissbuch, V . Kulikov, Y . Bitterman, Z. Melumian, and O. Bibi, “Ltx-video: Realtime video latent diffusion,”arXiv preprint arXiv:2501.00103, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

Leanvae: An ultra-efficient reconstruction vae for video diffusion models,

Y . Cheng and F. Yuan, “Leanvae: An ultra-efficient reconstruction vae for video diffusion models,”arXiv preprint arXiv:2503.14325, 2025

work page arXiv 2025
[51]

Motion-i2v: Consistent and controllable image-to-video generation with explicit 14 motion modeling,

X. Shi, Z. Huang, F.-Y . Wang, W. Bian, D. Li, Y . Zhang, M. Zhang, K. C. Cheung, S. See, H. Qin, J. Dai, and H. Li, “Motion-i2v: Consistent and controllable image-to-video generation with explicit 14 motion modeling,” inProceedings of the ACM SIGGRAPH 2024 Conference, 2024, pp. 1–11

work page 2024
[52]

Dynamicrafter: Animating open-domain images with video diffusion priors,

J. Xing, M. Xia, Y . Zhang, H. Chen, W. Yu, H. Liu, G. Liu, X. Wang, Y . Shan, and T.-T. Wong, “Dynamicrafter: Animating open-domain images with video diffusion priors,” inProceedings of the European Conference on Computer Vision. Springer, 2024, pp. 399–417

work page 2024
[53]

Reducio! generating 1k video within 16 seconds using extremely compressed motion latents,

R. Tian, Q. Dai, J. Bao, K. Qiu, Y . Yang, C. Luo, Z. Wu, and Y .- G. Jiang, “Reducio! generating 1k video within 16 seconds using extremely compressed motion latents,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 19 237– 19 247

work page 2025
[54]

Wan: Open and Advanced Large-Scale Video Generative Models

T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, J. Zeng, J. Wang, J. Zhang, J. Zhou, J. Wang, J. Chen, K. Zhu, K. Zhao, K. Yan, L. Huang, M. Feng, N. Zhang, P. Li, P. Wu, R. Chu, R. Feng, S. Zhang, S. Sun, T. Fang, T. Wang, T. Gui, T. Weng, T. Shen, W. Lin, W. Wang, W. Wang, W. Zhou, W. Wang, W. Shen, W. Yu, X. Shi, X....

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

Emo: Emote portrait alive generating expressive portrait videos with audio2video diffusion model under weak conditions,

L. Tian, Q. Wang, B. Zhang, and L. Bo, “Emo: Emote portrait alive generating expressive portrait videos with audio2video diffusion model under weak conditions,” inProceedings of the European Conference on Computer Vision. Springer, 2024, pp. 244–260

work page 2024
[56]

Talkingmachines: Real-time audio-driven facetime-style video via autoregressive diffusion models,

C. Low and W. Wang, “Talkingmachines: Real-time audio-driven facetime-style video via autoregressive diffusion models,”arXiv preprint arXiv:2506.03099, 2025

work page arXiv 2025
[57]

I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models

S. Zhang, J. Wang, Y . Zhang, K. Zhao, H. Yuan, Z. Qin, X. Wang, D. Zhao, and J. Zhou, “I2vgen-xl: High-quality image-to-video synthe- sis via cascaded diffusion models,”arXiv preprint arXiv:2311.04145, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

Trip: Temporal residual learning with image noise prior for image-to-video diffusion models,

Z. Zhang, F. Long, Y . Pan, Z. Qiu, T. Yao, Y . Cao, and T. Mei, “Trip: Temporal residual learning with image noise prior for image-to-video diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 8671–8681

work page 2024
[59]

Atomovideo: High fidelity image-to-video generation,

L. Gong, Y . Zhu, W. Li, X. Kang, B. Wang, T. Ge, and B. Zheng, “Atomovideo: High fidelity image-to-video generation,”arXiv preprint arXiv:2403.01800, 2024

work page arXiv 2024
[60]

Ti2v-zero: Zero-shot image conditioning for text-to-video diffusion models,

H. Ni, B. Egger, S. Lohit, A. Cherian, Y . Wang, T. Koike-Akino, S. X. Huang, and T. K. Marks, “Ti2v-zero: Zero-shot image conditioning for text-to-video diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 9015–9025

work page 2024
[61]

Consisti2v: Enhancing visual consistency for image-to-video genera- tion,

W. Ren, H. Yang, G. Zhang, C. Wei, X. Du, W. Huang, and W. Chen, “Consisti2v: Enhancing visual consistency for image-to-video genera- tion,”arXiv preprint arXiv:2402.04324, 2024

work page arXiv 2024
[62]

VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

H. Chen, M. Xia, Y . He, Y . Zhang, X. Cun, S. Yang, J. Xing, Y . Liu, Q. Chen, X. Wang, C. Weng, and Y . Shan, “Videocrafter1: Open diffusion models for high-quality video generation,”arXiv preprint arXiv:2310.19512, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[63]

Videocrafter2: Overcoming data limitations for high-quality video diffusion models,

H. Chen, Y . Zhang, X. Cun, M. Xia, X. Wang, C. Weng, and Y . Shan, “Videocrafter2: Overcoming data limitations for high-quality video diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 7310–7320

work page 2024
[64]

Open-Sora: Democratizing Efficient Video Production for All

Z. Zheng, X. Peng, T. Yang, C. Shen, S. Li, H. Liu, Y . Zhou, T. Li, and Y . You, “Open-sora: Democratizing efficient video production for all,”arXiv preprint arXiv:2412.20404, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[65]

Open-Sora Plan: Open-Source Large Video Generation Model

B. Lin, Y . Ge, X. Cheng, Z. Li, B. Zhu, S. Wang, X. He, Y . Ye, S. Yuan, L. Chen, T. Jia, J. Zhang, Z. Tang, Y . Pang, B. She, C. Yan, Z. Hu, X. Dong, L. Chen, Z. Pan, X. Zhou, S. Dong, Y . Tian, and L. Yuan, “Open-sora plan: Open-source large video generation model,”arXiv preprint arXiv:2412.00131, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[66]

CogVideoX: Text-to-video diffusion models with an expert transformer,

Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y . Yang, W. Hong, X. Zhang, G. Feng, D. Yin, Y . Zhang, W. Wang, Y . Cheng, B. Xu, X. Gu, Y . Dong, and J. Tang, “CogVideoX: Text-to-video diffusion models with an expert transformer,” inProceedings of the International Conference on Learning Representations, 2025

work page 2025
[67]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wu, J. Zhang, K. Wu, Q. Lin, J. Yuan, Y . Long, A. Wang, A. Wang, C. Li, D. Huang, F. Yang, H. Tan, H. Wang, J. Song, J. Bai, J. Wu, J. Xue, J. Wang, K. Wang, M. Liu, P. Li, S. Li, W. Wang, W. Yu, X. Deng, Y . Li, Y . Chen, Y . Cui, Y . Peng, Z. Yu, Z. He, Z. Xu, Z. Zhou, Z. Xu, Y . ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[68]

Mimicmotion: High-quality human motion video generation with confidence-aware pose guidance,

Y . Zhang, J. Gu, L.-W. Wang, H. Wang, J. Cheng, Y . Zhu, and F. Zou, “Mimicmotion: High-quality human motion video generation with confidence-aware pose guidance,” inProceedings of the International Conference on Machine Learning. PMLR, 2025, pp. 74 896–74 910

work page 2025
[69]

Box- imator: Generating rich and controllable motions for video synthesis,

J. Wang, Y . Zhang, J. Zou, Y . Zeng, G. Wei, L. Yuan, and H. Li, “Box- imator: Generating rich and controllable motions for video synthesis,” inProceedings of the International Conference on Machine Learning. PMLR, 2024, pp. 52 274–52 289

work page 2024
[70]

Lmp: Leveraging motion prior in zero-shot video generation with diffusion transformer,

C. Chen, X. Yang, J. Shu, C. Wang, and Y . Li, “Lmp: Leveraging motion prior in zero-shot video generation with diffusion transformer,” arXiv preprint arXiv:2505.14167, 2025

work page arXiv 2025
[71]

Movideo: Motion-aware video generation with diffusion model,

J. Liang, Y . Fan, K. Zhang, R. Timofte, L. Van Gool, and R. Ranjan, “Movideo: Motion-aware video generation with diffusion model,” in Proceedings of the European Conference on Computer Vision, 2024, pp. 56–74

work page 2024
[72]

Loopy: Taming audio-driven portrait avatar with long-term motion depen- dency,

J. Jiang, C. Liang, J. Yang, G. Lin, T. Zhong, and Y . Zheng, “Loopy: Taming audio-driven portrait avatar with long-term motion depen- dency,” inProceedings of the International Conference on Learning Representations, 2025

work page 2025
[73]

Hallo: Hierarchical audio-driven visual synthesis for portrait image animation,

M. Xu, H. Li, Q. Su, H. Shang, L. Zhang, C. Liu, J. Wang, Y . Yao, and S. Zhu, “Hallo: Hierarchical audio-driven visual synthesis for portrait image animation,”arXiv preprint arXiv:2406.08801, 2024

work page arXiv 2024
[74]

Hallo2: Long-duration and high-resolution audio-driven portrait image animation,

J. Cui, H. Li, Y . Yao, H. Zhu, H. Shang, K. Cheng, H. Zhou, S. Zhu, and J. Wang, “Hallo2: Long-duration and high-resolution audio-driven portrait image animation,”arXiv preprint arXiv:2410.07718, 2024

work page arXiv 2024
[75]

Sonic: Shifting focus to global audio perception in portrait animation,

X. Ji, X. Hu, Z. Xu, J. Zhu, C. Lin, Q. He, J. Zhang, D. Luo, Y . Chen, Q. Lin, Q. Lu, and C. Wang, “Sonic: Shifting focus to global audio perception in portrait animation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 193–203

work page 2025
[76]

Echomimic: Lifelike audio-driven portrait animations through editable landmark conditions,

Z. Chen, J. Cao, Z. Chen, Y . Li, and C. Ma, “Echomimic: Lifelike audio-driven portrait animations through editable landmark conditions,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 3, 2025, pp. 2403–2410

work page 2025
[77]

Aniportrait: Audio-driven synthesis of photorealistic portrait animation,

H. Wei, Z. Yang, and Z. Wang, “Aniportrait: Audio-driven synthesis of photorealistic portrait animation,”arXiv preprint arXiv:2403.17694, 2024

work page arXiv 2024
[78]

Stableavatar: Infinite-length audio-driven avatar video generation,

S. Tu, Y . Pan, Y . Huang, X. Han, Z. Xing, Q. Dai, C. Luo, Z. Wu, and Y .-G. Jiang, “Stableavatar: Infinite-length audio-driven avatar video generation,”arXiv preprint arXiv:2508.08248, 2025

work page arXiv 2025
[79]

Omniavatar: Efficient audio-driven avatar video generation with adaptive body animation,

Q. Gan, R. Yang, J. Zhu, S. Xue, and S. Hoi, “Omniavatar: Efficient audio-driven avatar video generation with adaptive body animation,” arXiv preprint arXiv:2506.18866, 2025

work page arXiv 2025
[80]

Cyberhost: A one-stage diffusion framework for audio-driven talking body generation,

G. Lin, J. Jiang, C. Liang, T. Zhong, J. Yang, Z. Zheng, and Y . Zheng, “Cyberhost: A one-stage diffusion framework for audio-driven talking body generation,” inProceedings of the International Conference on Learning Representations, 2025

work page 2025

Showing first 80 references.

[1] [1]

Dynamic-i2v: Exploring image-to-video generation models via multimodal llm,

P. Liu, X. Ren, F. Liu, Q. Xie, Q. Zheng, Y . Zhang, H. Lu, and Y . Yang, “Dynamic-i2v: Exploring image-to-video generation models via multimodal llm,”arXiv preprint arXiv:2505.19901, 2025

work page arXiv 2025

[2] [2]

Levitor: 3d trajectory oriented image-to-video synthesis,

H. Wang, H. Ouyang, Q. Wang, W. Wang, K. L. Cheng, Q. Chen, Y . Shen, and L. Wang, “Levitor: 3d trajectory oriented image-to-video synthesis,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 12 490–12 500

work page 2025

[3] [3]

Motioncanvas: Cinematic shot design with controllable image-to-video generation,

J. Xing, L. Mai, C. Ham, J. Huang, A. Mahapatra, C.-W. Fu, T.- T. Wong, and F. Liu, “Motioncanvas: Cinematic shot design with controllable image-to-video generation,” inProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference, 2025, pp. 1–11

work page 2025

[4] [4]

Versatile transi- tion generation with image-to-video diffusion,

Z. Yang, J. Zhang, Y . Yu, S. Lu, and S. Bai, “Versatile transi- tion generation with image-to-video diffusion,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 16 981–16 990

work page 2025

[5] [5]

I2v3d: Controllable image-to-video generation with 3d guidance,

Z. Zhang, D. Chen, and J. Liao, “I2v3d: Controllable image-to-video generation with 3d guidance,”arXiv preprint arXiv:2503.09733, 2025

work page arXiv 2025

[6] [6]

Realcam-i2v: Real-world image-to-video generation with interactive complex camera control,

T. Li, G. Zheng, R. Jiang, S. Zhan, T. Wu, Y . Lu, Y . Lin, C. Deng, Y . Xiong, M. Chen, L. Cheng, and X. Li, “Realcam-i2v: Real-world image-to-video generation with interactive complex camera control,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 28 785–28 796

work page 2025

[7] [7]

Extrapolating and decoupling image-to-video generation models: Motion modeling is easier than you think,

J. Tian, X. Qu, Z. Lu, W. Wei, S. Liu, and Y . Cheng, “Extrapolating and decoupling image-to-video generation models: Motion modeling is easier than you think,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 12 512–12 521

work page 2025

[8] [8]

Tagrpo: Boosting grpo on image-to-video generation with direct trajectory alignment,

J. Wang, J. Lu, G. Xu, C. Chen, H. Yang, L. Wang, P. Chen, M. Chen, Z. Hu, L. Wu, S. Shao, Q. Lu, and P. Luo, “Tagrpo: Boosting grpo on image-to-video generation with direct trajectory alignment,”arXiv preprint arXiv:2601.05729, 2026

work page arXiv 2026

[9] [9]

Auto-Encoding Variational Bayes

D. P. Kingma and M. Welling, “Auto-encoding variational bayes,”arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[10] [10]

I4vgen: Image as free stepping stone for text-to-video generation,

X. Guo, J. Liu, M. Cui, L. Bo, and D. Huang, “I4vgen: Image as free stepping stone for text-to-video generation,”arXiv preprint arXiv:2406.02230, 2024

work page arXiv 2024

[11] [11]

Emu video: Factorizing text-to-video generation by explicit image conditioning,

R. Girdhar, M. Singh, A. Brown, Q. Duval, S. Azadi, S. S. Rambhatla, A. Shah, X. Yin, D. Parikh, and I. Misra, “Emu video: Factorizing text-to-video generation by explicit image conditioning,”arXiv preprint arXiv:2311.10709, 2023

work page arXiv 2023

[12] [12]

Rv-gan: Recurrent gan for uncondi- tional video generation,

S. Gupta, A. Keshari, and S. Das, “Rv-gan: Recurrent gan for uncondi- tional video generation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2024–2033. 13

work page 2022

[13] [13]

Styleinv: A temporal style modulated inversion network for unconditional video generation,

Y . Wang, L. Jiang, and C. C. Loy, “Styleinv: A temporal style modulated inversion network for unconditional video generation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 22 851–22 861

work page 2023

[14] [14]

Realisdance: Equip controllable character anima- tion with realistic hands.arXiv preprint arXiv:2409.06202,

J. Zhou, B. Wang, W. Chen, J. Bai, D. Li, A. Zhang, H. Xu, M. Yang, and F. Wang, “Realisdance: Equip controllable character animation with realistic hands,”arXiv preprint arXiv:2409.06202, 2024

work page arXiv 2024

[15] [15]

Animate anyone: Consistent and controllable image-to-video synthesis for character animation,

L. Hu, “Animate anyone: Consistent and controllable image-to-video synthesis for character animation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 8153–8163

work page 2024

[16] [16]

CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation

D. Xu, W. Nie, C. Liu, S. Liu, J. Kautz, Z. Wang, and A. Vahdat, “Camco: Camera-controllable 3d-consistent image-to-video genera- tion,”arXiv preprint arXiv:2406.02509, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

Mofa-video: Controllable image animation via generative motion field adaptions in frozen image-to-video diffusion model,

M. Niu, X. Cun, X. Wang, Y . Zhang, Y . Shan, and Y . Zheng, “Mofa-video: Controllable image animation via generative motion field adaptions in frozen image-to-video diffusion model,” inProceedings of the European Conference on Computer Vision. Springer, 2024, pp. 111–128

work page 2024

[18] [18]

Champ: Controllable and consistent human image animation with 3d parametric guidance,

S. Zhu, J. L. Chen, Z. Dai, Z. Dong, Y . Xu, X. Cao, Y . Yao, H. Zhu, and S. Zhu, “Champ: Controllable and consistent human image animation with 3d parametric guidance,” inProceedings of the European Conference on Computer Vision. Springer, 2024, pp. 145– 162

work page 2024

[19] [19]

Human video generation from a single image with 3d pose and view control,

T. Wang, C.-H. Yao, T. Hu, M. B. R. Reddy, M.-H. Yang, and V . Jampani, “Human video generation from a single image with 3d pose and view control,”arXiv preprint arXiv:2602.21188, 2026

work page arXiv 2026

[20] [20]

Pia: Your per- sonalized image animator via plug-and-play modules in text-to-image models,

Y . Zhang, Z. Xing, Y . Zeng, Y . Fang, and K. Chen, “Pia: Your per- sonalized image animator via plug-and-play modules in text-to-image models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 7747–7756

work page 2024

[21] [21]

Customcrafter: Customized video generation with preserving motion and concept composition abilities,

T. Wu, Y . Zhang, X. Wang, X. Zhou, G. Zheng, Z. Qi, Y . Shan, and X. Li, “Customcrafter: Customized video generation with preserving motion and concept composition abilities,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 8, 2025, pp. 8469– 8477

work page 2025

[22] [22]

U-net: Convolutional net- works for biomedical image segmentation,

O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional net- works for biomedical image segmentation,” inProceedings of the Medical Image Computing and Computer-Assisted Intervention, 2015, pp. 234–241

work page 2015

[23] [23]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y . Levi, Z. English, V . V oleti, A. Letts, J. Varun, and R. Rombach, “Stable video diffusion: Scaling latent video diffusion models to large datasets,”arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

Megactor-sigma: Unlocking flexible mixed-modal control in portrait animation with diffusion transformer,

S. Yang, H. Li, J. Wu, M. Jing, L. Li, R. Ji, J. Liang, H. Fan, and J. Wang, “Megactor-sigma: Unlocking flexible mixed-modal control in portrait animation with diffusion transformer,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 9, 2025, pp. 9256–9264

work page 2025

[25] [25]

Tora: Trajectory-oriented diffusion transformer for video generation,

Z. Zhang, J. Liao, M. Li, Z. Dai, B. Qiu, S. Zhu, L. Qin, and W. Wang, “Tora: Trajectory-oriented diffusion transformer for video generation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 2063–2073

work page 2025

[26] [26]

Humandit: Pose-guided diffusion transformer for long- form human motion video generation.arXiv preprint arXiv:2502.04847, 2025

Q. Gan, Y . Ren, C. Zhang, Z. Ye, P. Xie, X. Yin, Z. Yuan, B. Peng, and J. Zhu, “Humandit: Pose-guided diffusion transformer for long-form human motion video generation,”arXiv preprint arXiv:2502.04847, 2025

work page arXiv 2025

[27] [27]

A survey on video diffusion models,

Z. Xing, Q. Feng, H. Chen, Q. Dai, H. Hu, H. Xu, Z. Wu, and Y .-G. Jiang, “A survey on video diffusion models,”ACM Computing Surveys, vol. 57, no. 2, pp. 1–42, 2024

work page 2024

[28] [28]

Survey of video diffusion models: Foundations, implementations, and applications,

Y . Wang, X. Liu, W. Pang, L. Ma, S. Yuan, P. Debevec, and N. Yu, “Survey of video diffusion models: Foundations, implementations, and applications,”arXiv preprint arXiv:2504.16081, 2025

work page arXiv 2025

[29] [29]

Con- trollable video generation: A survey.arXiv preprint arXiv:2507.16869, 2025

Y . Ma, K. Feng, Z. Hu, X. Wang, Y . Wang, M. Zheng, B. Wang, Q. Wang, X. He, H. Wang, C. Zhu, H. Liu, Y . He, Z. Wang, Z. Li, X. Li, S. Han, Y . Guo, W. Liu, D. Xu, L. Zhang, and Q. Chen, “Controllable video generation: A survey,”arXiv preprint arXiv:2507.16869, 2025

work page arXiv 2025

[30] [30]

Diffusion model-based video editing: A survey,

W. Sun, R.-C. Tu, J. Liao, and D. Tao, “Diffusion model-based video editing: A survey,”arXiv preprint arXiv:2407.07111, 2024

work page arXiv 2024

[31] [31]

Efficient diffusion models: A comprehen- sive survey from principles to practices,

Z. Ma, Y . Zhang, G. Jia, L. Zhao, Y . Ma, M. Ma, G. Liu, K. Zhang, N. Ding, J. Li, and B. Zhou, “Efficient diffusion models: A comprehen- sive survey from principles to practices,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

work page 2025

[32] [32]

Bridging text and video generation: A survey,

N. Kumar, P. Bhandari, and G. Maragatham, “Bridging text and video generation: A survey,”arXiv preprint arXiv:2510.04999, 2025

work page arXiv 2025

[33] [33]

Human motion video generation: A survey,

H. Xue, X. Luo, Z. Hu, X. Zhang, X. Xiang, Y . Dai, J. Liu, Z. Zhang, M. Li, J. Yang, F. Ma, Z. Wu, C. Yang, Z. Dai, and F. R. Yu, “Human motion video generation: A survey,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

work page 2025

[34] [34]

Diffusion models: A comprehensive survey of methods and applications,

L. Yang, Z. Zhang, Y . Song, S. Hong, R. Xu, Y . Zhao, W. Zhang, B. Cui, and M.-H. Yang, “Diffusion models: A comprehensive survey of methods and applications,”ACM Computing Surveys, vol. 56, no. 4, pp. 1–39, 2023

work page 2023

[35] [35]

Humanvid: demystifying training data for camera-controllable human image animation,

Z. Wang, Y . Li, Y . Zeng, Y . Fang, Y . Guo, W. Liu, J. Tan, K. Chen, T. Xue, B. Dai, and D. Lin, “Humanvid: demystifying training data for camera-controllable human image animation,” inProceedings of the Conference on Neural Information Processing Systems, 2024, pp. 20 111–20 131

work page 2024

[36] [36]

Unianimate: Taming unified video diffusion models for consistent human image animation,

X. Wang, S. Zhang, C. Gao, J. Wang, X. Zhou, Y . Zhang, L. Yan, and N. Sang, “Unianimate: Taming unified video diffusion models for consistent human image animation,”Science China Information Sciences, vol. 68, no. 10, pp. 1–14, 2025

work page 2025

[37] [37]

Dimensionx: Create any 3d and 4d scenes from a single image with decoupled video diffusion,

W. Sun, S. Chen, F. Liu, Z. Chen, Y . Duan, J. Zhu, J. Zhang, and Y . Wang, “Dimensionx: Create any 3d and 4d scenes from a single image with decoupled video diffusion,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 13 695–13 706

work page 2025

[38] [38]

Phyrpr: Training-free physics- constrained video generation,

Y . Zhao, H. Li, X. He, and B. Wu, “Phyrpr: Training-free physics- constrained video generation,”arXiv preprint arXiv:2601.09255, 2026

work page arXiv 2026

[39] [39]

Prompt image to life: Training- free text-driven image-to-video generation

J. Liu, Y . Yao, B. Zhu, F. Wang, W. Luo, J. Su, Y . Zhang, Y . Wang, L. Ma, Q. Liu, J. Luo, and G.-J. Qi, “Prompt image to life: Training- free text-driven image-to-video generation.”

work page

[40] [40]

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Y . Gao, H. Guo, T. Hoang, W. Huang, L. Jiang, F. Kong, H. Li, J. Li, L. Li, X. Li, X. Li, Y . Li, S. Lin, Z. Lin, J. Liu, S. Liu, X. Nie, Z. Qing, Y . Ren, L. Sun, Z. Tian, R. Wang, S. Wang, G. Wei, G. Wu, J. Wu, R. Xia, F. Xiao, X. Xiao, J. Yan, C. Yang, J. Yang, R. Yang, T. Yang, Y . Yang, Z. Ye, X. Zeng, Y . Zeng, H. Zhang, Y . Zhao, X. Zheng, P. Zhu,...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

Sana-video: Efficient video generation with block linear diffusion transformer.arXiv preprint arXiv:2509.24695, 2025

J. Chen, Y . Zhao, J. Yu, R. Chu, J. Chen, S. Yang, X. Wang, Y . Pan, D. Zhou, H. Ling, H. Liu, H. Yi, H. Zhang, M. Li, Y . Chen, H. Cai, S. Fidler, P. Luo, S. Han, and E. Xie, “Sana-video: Efficient video generation with block linear diffusion transformer,”arXiv preprint arXiv:2509.24695, 2025

work page arXiv 2025

[42] [42]

Frame context packing and drift prevention in next-frame-prediction video diffusion models,

L. Zhang, S. Cai, M. Li, G. Wetzstein, and M. Agrawala, “Frame context packing and drift prevention in next-frame-prediction video diffusion models,” inProceedings of the Conference on Neural Infor- mation Processing Systems, 2025

work page 2025

[43] [43]

Animate anyone 2: High-fidelity character image animation with environment affordance.arXiv preprint arXiv:2502.06145, 2025

L. Hu, G. Wang, Z. Shen, X. Gao, D. Meng, L. Zhuo, P. Zhang, B. Zhang, and L. Bo, “Animate anyone 2: High-fidelity charac- ter image animation with environment affordance,”arXiv preprint arXiv:2502.06145, 2025

work page arXiv 2025

[44] [44]

Make-a-video: Text-to-video generation without text-video data,

U. Singer, A. Polyak, T. Hayes, X. Yin, J. An, S. Zhang, Q. Hu, H. Yang, O. Ashual, O. Gafni, D. Parikh, S. Gupta, and Y . Taigman, “Make-a-video: Text-to-video generation without text-video data,” in Proceedings of the International Conference on Learning Representa- tions, 2023

work page 2023

[45] [45]

Mvportrait: Text-guided motion and emotion control for multi-view vivid portrait animation,

Y . Lin, H. Fung, J. Xu, Z. Ren, A. S. Lau, G. Yin, and X. Li, “Mvportrait: Text-guided motion and emotion control for multi-view vivid portrait animation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 26 242–26 252

work page 2025

[46] [46]

Stiv: Scalable text and image conditioned video generation,

Z. Lin, W. Liu, C. Chen, J. Lu, W. Hu, T.-J. Fu, J. Allardice, Z. Lai, L. Song, B. Zhang, C. Chen, Y . Fei, L. Li, Y . Yang, Y . Sun, and K.-W. Chang, “Stiv: Scalable text and image conditioned video generation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 16 249–16 259

work page 2025

[47] [47]

Conditional image- to-video generation with latent flow diffusion models,

H. Ni, C. Shi, K. Li, S. X. Huang, and M. R. Min, “Conditional image- to-video generation with latent flow diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, 2023, pp. 18 444–18 455

work page 2023

[48] [48]

Mardini: Masked autoregressive diffusion for video generation at scale.arXiv preprint arXiv:2410.20280, 2024

H. Liu, S. Liu, Z. Zhou, M. Xu, Y . Xie, X. Han, J. C. P ´erez, D. Liu, K. Kahatapitiya, M. Jia, J.-C. Wu, S. He, T. Xiang, J. Schmidhuber, and J.-M. P ´erez-R´ua, “Mardini: Masked autoregressive diffusion for video generation at scale,”arXiv preprint arXiv:2410.20280, 2024

work page arXiv 2024

[49] [49]

LTX-Video: Realtime Video Latent Diffusion

Y . HaCohen, N. Chiprut, B. Brazowski, D. Shalem, D. Moshe, E. Richardson, E. Levin, G. Shiran, N. Zabari, O. Gordon, P. Panet, S. Weissbuch, V . Kulikov, Y . Bitterman, Z. Melumian, and O. Bibi, “Ltx-video: Realtime video latent diffusion,”arXiv preprint arXiv:2501.00103, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[50] [50]

Leanvae: An ultra-efficient reconstruction vae for video diffusion models,

Y . Cheng and F. Yuan, “Leanvae: An ultra-efficient reconstruction vae for video diffusion models,”arXiv preprint arXiv:2503.14325, 2025

work page arXiv 2025

[51] [51]

Motion-i2v: Consistent and controllable image-to-video generation with explicit 14 motion modeling,

X. Shi, Z. Huang, F.-Y . Wang, W. Bian, D. Li, Y . Zhang, M. Zhang, K. C. Cheung, S. See, H. Qin, J. Dai, and H. Li, “Motion-i2v: Consistent and controllable image-to-video generation with explicit 14 motion modeling,” inProceedings of the ACM SIGGRAPH 2024 Conference, 2024, pp. 1–11

work page 2024

[52] [52]

Dynamicrafter: Animating open-domain images with video diffusion priors,

J. Xing, M. Xia, Y . Zhang, H. Chen, W. Yu, H. Liu, G. Liu, X. Wang, Y . Shan, and T.-T. Wong, “Dynamicrafter: Animating open-domain images with video diffusion priors,” inProceedings of the European Conference on Computer Vision. Springer, 2024, pp. 399–417

work page 2024

[53] [53]

Reducio! generating 1k video within 16 seconds using extremely compressed motion latents,

R. Tian, Q. Dai, J. Bao, K. Qiu, Y . Yang, C. Luo, Z. Wu, and Y .- G. Jiang, “Reducio! generating 1k video within 16 seconds using extremely compressed motion latents,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 19 237– 19 247

work page 2025

[54] [54]

Wan: Open and Advanced Large-Scale Video Generative Models

T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, J. Zeng, J. Wang, J. Zhang, J. Zhou, J. Wang, J. Chen, K. Zhu, K. Zhao, K. Yan, L. Huang, M. Feng, N. Zhang, P. Li, P. Wu, R. Chu, R. Feng, S. Zhang, S. Sun, T. Fang, T. Wang, T. Gui, T. Weng, T. Shen, W. Lin, W. Wang, W. Wang, W. Zhou, W. Wang, W. Shen, W. Yu, X. Shi, X....

work page internal anchor Pith review Pith/arXiv arXiv 2025

[55] [55]

Emo: Emote portrait alive generating expressive portrait videos with audio2video diffusion model under weak conditions,

L. Tian, Q. Wang, B. Zhang, and L. Bo, “Emo: Emote portrait alive generating expressive portrait videos with audio2video diffusion model under weak conditions,” inProceedings of the European Conference on Computer Vision. Springer, 2024, pp. 244–260

work page 2024

[56] [56]

Talkingmachines: Real-time audio-driven facetime-style video via autoregressive diffusion models,

C. Low and W. Wang, “Talkingmachines: Real-time audio-driven facetime-style video via autoregressive diffusion models,”arXiv preprint arXiv:2506.03099, 2025

work page arXiv 2025

[57] [57]

I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models

S. Zhang, J. Wang, Y . Zhang, K. Zhao, H. Yuan, Z. Qin, X. Wang, D. Zhao, and J. Zhou, “I2vgen-xl: High-quality image-to-video synthe- sis via cascaded diffusion models,”arXiv preprint arXiv:2311.04145, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[58] [58]

Trip: Temporal residual learning with image noise prior for image-to-video diffusion models,

Z. Zhang, F. Long, Y . Pan, Z. Qiu, T. Yao, Y . Cao, and T. Mei, “Trip: Temporal residual learning with image noise prior for image-to-video diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 8671–8681

work page 2024

[59] [59]

Atomovideo: High fidelity image-to-video generation,

L. Gong, Y . Zhu, W. Li, X. Kang, B. Wang, T. Ge, and B. Zheng, “Atomovideo: High fidelity image-to-video generation,”arXiv preprint arXiv:2403.01800, 2024

work page arXiv 2024

[60] [60]

Ti2v-zero: Zero-shot image conditioning for text-to-video diffusion models,

H. Ni, B. Egger, S. Lohit, A. Cherian, Y . Wang, T. Koike-Akino, S. X. Huang, and T. K. Marks, “Ti2v-zero: Zero-shot image conditioning for text-to-video diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 9015–9025

work page 2024

[61] [61]

Consisti2v: Enhancing visual consistency for image-to-video genera- tion,

W. Ren, H. Yang, G. Zhang, C. Wei, X. Du, W. Huang, and W. Chen, “Consisti2v: Enhancing visual consistency for image-to-video genera- tion,”arXiv preprint arXiv:2402.04324, 2024

work page arXiv 2024

[62] [62]

VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

H. Chen, M. Xia, Y . He, Y . Zhang, X. Cun, S. Yang, J. Xing, Y . Liu, Q. Chen, X. Wang, C. Weng, and Y . Shan, “Videocrafter1: Open diffusion models for high-quality video generation,”arXiv preprint arXiv:2310.19512, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[63] [63]

Videocrafter2: Overcoming data limitations for high-quality video diffusion models,

H. Chen, Y . Zhang, X. Cun, M. Xia, X. Wang, C. Weng, and Y . Shan, “Videocrafter2: Overcoming data limitations for high-quality video diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 7310–7320

work page 2024

[64] [64]

Open-Sora: Democratizing Efficient Video Production for All

Z. Zheng, X. Peng, T. Yang, C. Shen, S. Li, H. Liu, Y . Zhou, T. Li, and Y . You, “Open-sora: Democratizing efficient video production for all,”arXiv preprint arXiv:2412.20404, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[65] [65]

Open-Sora Plan: Open-Source Large Video Generation Model

B. Lin, Y . Ge, X. Cheng, Z. Li, B. Zhu, S. Wang, X. He, Y . Ye, S. Yuan, L. Chen, T. Jia, J. Zhang, Z. Tang, Y . Pang, B. She, C. Yan, Z. Hu, X. Dong, L. Chen, Z. Pan, X. Zhou, S. Dong, Y . Tian, and L. Yuan, “Open-sora plan: Open-source large video generation model,”arXiv preprint arXiv:2412.00131, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[66] [66]

CogVideoX: Text-to-video diffusion models with an expert transformer,

Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y . Yang, W. Hong, X. Zhang, G. Feng, D. Yin, Y . Zhang, W. Wang, Y . Cheng, B. Xu, X. Gu, Y . Dong, and J. Tang, “CogVideoX: Text-to-video diffusion models with an expert transformer,” inProceedings of the International Conference on Learning Representations, 2025

work page 2025

[67] [67]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wu, J. Zhang, K. Wu, Q. Lin, J. Yuan, Y . Long, A. Wang, A. Wang, C. Li, D. Huang, F. Yang, H. Tan, H. Wang, J. Song, J. Bai, J. Wu, J. Xue, J. Wang, K. Wang, M. Liu, P. Li, S. Li, W. Wang, W. Yu, X. Deng, Y . Li, Y . Chen, Y . Cui, Y . Peng, Z. Yu, Z. He, Z. Xu, Z. Zhou, Z. Xu, Y . ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[68] [68]

Mimicmotion: High-quality human motion video generation with confidence-aware pose guidance,

Y . Zhang, J. Gu, L.-W. Wang, H. Wang, J. Cheng, Y . Zhu, and F. Zou, “Mimicmotion: High-quality human motion video generation with confidence-aware pose guidance,” inProceedings of the International Conference on Machine Learning. PMLR, 2025, pp. 74 896–74 910

work page 2025

[69] [69]

Box- imator: Generating rich and controllable motions for video synthesis,

J. Wang, Y . Zhang, J. Zou, Y . Zeng, G. Wei, L. Yuan, and H. Li, “Box- imator: Generating rich and controllable motions for video synthesis,” inProceedings of the International Conference on Machine Learning. PMLR, 2024, pp. 52 274–52 289

work page 2024

[70] [70]

Lmp: Leveraging motion prior in zero-shot video generation with diffusion transformer,

C. Chen, X. Yang, J. Shu, C. Wang, and Y . Li, “Lmp: Leveraging motion prior in zero-shot video generation with diffusion transformer,” arXiv preprint arXiv:2505.14167, 2025

work page arXiv 2025

[71] [71]

Movideo: Motion-aware video generation with diffusion model,

J. Liang, Y . Fan, K. Zhang, R. Timofte, L. Van Gool, and R. Ranjan, “Movideo: Motion-aware video generation with diffusion model,” in Proceedings of the European Conference on Computer Vision, 2024, pp. 56–74

work page 2024

[72] [72]

Loopy: Taming audio-driven portrait avatar with long-term motion depen- dency,

J. Jiang, C. Liang, J. Yang, G. Lin, T. Zhong, and Y . Zheng, “Loopy: Taming audio-driven portrait avatar with long-term motion depen- dency,” inProceedings of the International Conference on Learning Representations, 2025

work page 2025

[73] [73]

Hallo: Hierarchical audio-driven visual synthesis for portrait image animation,

M. Xu, H. Li, Q. Su, H. Shang, L. Zhang, C. Liu, J. Wang, Y . Yao, and S. Zhu, “Hallo: Hierarchical audio-driven visual synthesis for portrait image animation,”arXiv preprint arXiv:2406.08801, 2024

work page arXiv 2024

[74] [74]

Hallo2: Long-duration and high-resolution audio-driven portrait image animation,

J. Cui, H. Li, Y . Yao, H. Zhu, H. Shang, K. Cheng, H. Zhou, S. Zhu, and J. Wang, “Hallo2: Long-duration and high-resolution audio-driven portrait image animation,”arXiv preprint arXiv:2410.07718, 2024

work page arXiv 2024

[75] [75]

Sonic: Shifting focus to global audio perception in portrait animation,

X. Ji, X. Hu, Z. Xu, J. Zhu, C. Lin, Q. He, J. Zhang, D. Luo, Y . Chen, Q. Lin, Q. Lu, and C. Wang, “Sonic: Shifting focus to global audio perception in portrait animation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 193–203

work page 2025

[76] [76]

Echomimic: Lifelike audio-driven portrait animations through editable landmark conditions,

Z. Chen, J. Cao, Z. Chen, Y . Li, and C. Ma, “Echomimic: Lifelike audio-driven portrait animations through editable landmark conditions,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 3, 2025, pp. 2403–2410

work page 2025

[77] [77]

Aniportrait: Audio-driven synthesis of photorealistic portrait animation,

H. Wei, Z. Yang, and Z. Wang, “Aniportrait: Audio-driven synthesis of photorealistic portrait animation,”arXiv preprint arXiv:2403.17694, 2024

work page arXiv 2024

[78] [78]

Stableavatar: Infinite-length audio-driven avatar video generation,

S. Tu, Y . Pan, Y . Huang, X. Han, Z. Xing, Q. Dai, C. Luo, Z. Wu, and Y .-G. Jiang, “Stableavatar: Infinite-length audio-driven avatar video generation,”arXiv preprint arXiv:2508.08248, 2025

work page arXiv 2025

[79] [79]

Omniavatar: Efficient audio-driven avatar video generation with adaptive body animation,

Q. Gan, R. Yang, J. Zhu, S. Xue, and S. Hoi, “Omniavatar: Efficient audio-driven avatar video generation with adaptive body animation,” arXiv preprint arXiv:2506.18866, 2025

work page arXiv 2025

[80] [80]

Cyberhost: A one-stage diffusion framework for audio-driven talking body generation,

G. Lin, J. Jiang, C. Liang, T. Zhong, J. Yang, Z. Zheng, and Y . Zheng, “Cyberhost: A one-stage diffusion framework for audio-driven talking body generation,” inProceedings of the International Conference on Learning Representations, 2025

work page 2025