pith. machine review for the scientific record. sign in

arxiv: 2604.25819 · v1 · submitted 2026-04-28 · 💻 cs.CV · cs.SD

Recognition: unknown

Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation

Biao Jiang, Daquan Zhou, Jiabao Wang, Lianghua Huang, Ming-Ming Cheng, Qibin Hou, Yu Liu, Yupeng Shi, Yupeng Zhou, Zhifan Wu

Pith reviewed 2026-05-07 16:49 UTC · model grok-4.3

classification 💻 cs.CV cs.SD
keywords mutual forcingautoregressive generationaudio-video synchronizationself-distillationfast samplingcharacter animationmultimodal modelingstreaming generation
0
0 comments X

The pith

Mutual Forcing enables a single autoregressive model to produce synchronized audio and video for animated characters using only 4 to 8 steps by making its few-step and multi-step modes improve each other.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Mutual Forcing to solve joint audio-video modeling and fast autoregressive generation for character animation. It first trains separate uni-modal generators, then couples them for joint optimization on paired data. The central mechanism runs both few-step and multi-step generation inside one weight-shared model so the multi-step mode distills knowledge to the few-step mode while the few-step mode supplies realistic historical context to reduce training-inference mismatch. Because the modes share parameters these improvements reinforce each other directly on real data and eliminate the need for a separate bidirectional teacher model. Experiments show the resulting system matches or exceeds the quality of baselines that require around 50 sampling steps while using only 4 to 8 steps.

Core claim

Mutual Forcing integrates few-step and multi-step generation within a single weight-shared autoregressive audio-video model. The multi-step mode performs self-distillation to improve the few-step mode, while the few-step mode generates historical context to enhance training-inference consistency; shared parameters allow these two effects to reinforce each other. This produces native fast causal generation that matches or surpasses strong baselines requiring around 50 sampling steps when using only 4 to 8 steps, all without an additional bidirectional teacher.

What carries the argument

Mutual Forcing, the dual-mode self-evolution process in a weight-shared autoregressive model where few-step and multi-step modes mutually reinforce via self-distillation and consistency gains on real paired data.

If this is right

  • Matches or surpasses quality of 50-step baselines using only 4-8 steps.
  • Eliminates the requirement for a separate bidirectional teacher model.
  • Supports flexible training sequence lengths and direct improvement from real paired data.
  • Reduces training overhead relative to multi-stage distillation pipelines.
  • Enables streaming generation with native causal long-horizon audio-video synchronization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The self-reinforcement loop may allow stable scaling to longer sequences or higher resolutions without external supervision.
  • The same dual-mode structure could be applied to other autoregressive multimodal tasks such as text-to-video or speech synthesis.
  • Removing the teacher model simplifies the overall pipeline and may lower compute costs for future training runs.
  • Dynamically weighting the two modes during training could yield further quality gains beyond the fixed dual-mode setup.

Load-bearing premise

The two modes inside the shared model will mutually reinforce each other through self-distillation and improved training-inference consistency on real paired data without an external teacher.

What would settle it

A controlled ablation that disables one of the two modes during training and measures whether few-step generation quality drops below the multi-step baseline on the same paired audio-video dataset.

read the original abstract

In this work, we propose Mutual Forcing, a framework for fast autoregressive audio-video generation with long-horizon audio-video synchronization. Our approach addresses two key challenges: joint audio-video modeling and fast autoregressive generation. To ease joint audio-video optimization, we adopt a two-stage training strategy: we first train uni-modal generators and then couple them into a unified audio-video model for joint training on paired data. For streaming generation, we ask whether a native fast causal audio-video model can be trained directly, instead of following existing streaming distillation pipelines that typically train a bidirectional model first and then convert it into a causal generator through multiple distillation stages. Our answer is Mutual Forcing, which builds directly on native autoregressive model and integrates few-step and multi-step generation within a single weight-shared model, enabling self-distillation and improved training-inference consistency. The multi-step mode improves the few-step mode via self-distillation, while the few-step mode generates historical context during training to improve training-inference consistency; because the two modes share parameters, these two effects reinforce each other within a single model. Compared with prior approaches such as Self-Forcing, Mutual Forcing removes the need for an additional bidirectional teacher model, supports more flexible training sequence lengths, reduces training overhead, and allows the model to improve directly from real paired data rather than a fixed teacher. Experiments show that Mutual Forcing matches or surpasses strong baselines that require around 50 sampling steps while using only 4 to 8 steps, demonstrating substantial advantages in both efficiency and quality. The project page is available at https://mutualforcing.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes Mutual Forcing, a framework for fast autoregressive audio-video character generation. It uses a two-stage strategy of uni-modal pretraining followed by joint training on paired data. The core idea integrates few-step and multi-step generation modes inside a single weight-shared autoregressive model to perform self-distillation and enforce training-inference consistency, removing the need for a separate bidirectional teacher. The authors claim this yields generation quality matching or exceeding strong baselines that use ~50 sampling steps, while requiring only 4-8 steps.

Significance. If the central efficiency and quality claims are substantiated with rigorous metrics, the work would offer a meaningful simplification over existing distillation pipelines such as Self-Forcing by eliminating the bidirectional teacher and allowing direct improvement from real paired data. The dual-mode self-evolution within one model could reduce training overhead and support more flexible sequence lengths, which would be valuable for streaming audio-video applications. The absence of quantitative results, ablations, and dataset details in the current text, however, prevents a full assessment of impact.

major comments (3)
  1. [Abstract] Abstract: the central claim that Mutual Forcing 'matches or surpasses strong baselines that require around 50 sampling steps while using only 4 to 8 steps' is presented without any quantitative metrics (e.g., FID, synchronization error, PSNR, or user-study scores), ablation tables, error bars, or dataset descriptions. This renders the efficiency and quality advantages unverifiable and load-bearing for the paper's contribution.
  2. [§3] §3 (Mutual Forcing training loop): the mutual reinforcement argument assumes that few-step mode outputs quickly become sufficiently accurate to serve as historical context for multi-step training on real paired data. No analysis or early-training diagnostics are supplied to address the risk that degraded few-step prefixes introduce accumulating audio-video sync errors or token inconsistencies, which could prevent the claimed self-distillation from occurring.
  3. [§4] §4 (Experiments): the manuscript states advantages over baselines but provides no comparison tables, ablation studies on the two-stage pretraining, or controls isolating the contribution of weight sharing versus the dual-mode loop. Without these, it is impossible to confirm that the observed gains stem from Mutual Forcing rather than other factors.
minor comments (2)
  1. [§3.1] The notation for the two modes (few-step vs. multi-step) and the self-distillation loss should be defined more explicitly with equations to improve readability.
  2. [§4] The project page is referenced but the manuscript would benefit from a reproducibility statement detailing hyperparameters, training sequence lengths, and hardware used.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have reviewed each major comment carefully and provide point-by-point responses below. Where appropriate, we indicate revisions that will be incorporated into the next version of the manuscript to address the concerns raised.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that Mutual Forcing 'matches or surpasses strong baselines that require around 50 sampling steps while using only 4 to 8 steps' is presented without any quantitative metrics (e.g., FID, synchronization error, PSNR, or user-study scores), ablation tables, error bars, or dataset descriptions. This renders the efficiency and quality advantages unverifiable and load-bearing for the paper's contribution.

    Authors: We agree that the abstract would be strengthened by including key quantitative results to make the central claims immediately verifiable. In the revised manuscript, we will update the abstract to report specific metrics from our experiments, including FID, synchronization error, PSNR, user-study scores, dataset descriptions, and error bars. This will directly substantiate the efficiency and quality advantages without requiring the reader to consult later sections. revision: yes

  2. Referee: [§3] §3 (Mutual Forcing training loop): the mutual reinforcement argument assumes that few-step mode outputs quickly become sufficiently accurate to serve as historical context for multi-step training on real paired data. No analysis or early-training diagnostics are supplied to address the risk that degraded few-step prefixes introduce accumulating audio-video sync errors or token inconsistencies, which could prevent the claimed self-distillation from occurring.

    Authors: The potential for early-training error accumulation is a reasonable concern. While our final results demonstrate that self-distillation succeeds, we did not provide early-training diagnostics in the submitted version. We will add an analysis (in the main text or appendix) showing synchronization error and token consistency metrics tracked over training epochs to confirm that few-step prefixes stabilize sufficiently quickly for the mutual reinforcement to take effect. revision: yes

  3. Referee: [§4] §4 (Experiments): the manuscript states advantages over baselines but provides no comparison tables, ablation studies on the two-stage pretraining, or controls isolating the contribution of weight sharing versus the dual-mode loop. Without these, it is impossible to confirm that the observed gains stem from Mutual Forcing rather than other factors.

    Authors: We acknowledge that the experiments section requires more explicit supporting material to isolate the contributions of Mutual Forcing. The submitted manuscript contains comparative results, but to fully address this point we will expand §4 with detailed comparison tables against the 50-step baselines, dedicated ablations on the two-stage pretraining, and controls that separate the effects of weight sharing from the dual-mode self-distillation loop. These additions will clarify that the reported gains arise from the proposed framework. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes an empirical training framework (Mutual Forcing) that couples few-step and multi-step modes inside one weight-shared autoregressive model, using real paired data for self-distillation and consistency. No mathematical derivation chain is presented whose outputs reduce by construction to its inputs; performance claims rest on experimental comparisons to baselines rather than tautological definitions or fitted quantities renamed as predictions. The two-stage uni-modal pretraining followed by joint training is a procedural choice, not a self-referential loop. No load-bearing self-citations, uniqueness theorems, or ansatzes smuggled via prior work are invoked to force the central result. The approach is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the unstated premise that autoregressive models can be stably trained with mixed few-step and multi-step rollouts under shared parameters; no explicit free parameters, axioms, or invented entities are enumerated in the abstract.

pith-pipeline@v0.9.0 · 5631 in / 1220 out tokens · 59956 ms · 2026-05-07T16:49:22.256385+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 36 canonical work pages · 14 internal anchors

  1. [1]

    Seam- less interaction: Dyadic audiovisual motion modeling and large-scale dataset.arXiv preprint arXiv:2506.22554, 2025

    Vasu Agrawal, Akinniyi Akinyemi, Kathryn Alvero, Morteza Behrooz, Julia Buffalini, Fabio Maria Carlucci, Joy Chen, Junming Chen, Zhang Chen, Shiyang Cheng, et al. Seamless interaction: Dyadic audiovisual motion modeling and large-scale dataset.arXiv preprint arXiv:2506.22554, 2025

  2. [2]

    Funaudiollm: V oice understanding and generation foundation models for natural interaction between humans and llms.arXiv preprint arXiv:2407.04051, 2024

    Keyu An, Qian Chen, Chong Deng, Zhihao Du, Changfeng Gao, Zhifu Gao, Yue Gu, Ting He, Hangrui Hu, Kai Hu, et al. Funaudiollm: V oice understanding and generation foundation models for natural interaction between humans and llms. arXiv preprint arXiv:2407.04051, 2024

  3. [3]

    Genie: Generative interactive environments

    Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 2024

  4. [4]

    Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

    Boyuan Chen, Diego Mart ´ı Mons´o, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

  5. [5]

    SkyReels-V2: Infinite-length Film Generative Model

    Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, et al. Skyreels-v2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025

  6. [6]

    Panda-70m: Captioning 70m videos with multiple cross- modality teachers

    Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al. Panda-70m: Captioning 70m videos with multiple cross- modality teachers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13320–13331, 2024

  7. [7]

    Out of time: automated lip sync in the wild

    Joon Son Chung and Andrew Zisserman. Out of time: automated lip sync in the wild. InAsian conference on computer vision, pages 251–263. Springer, 2016

  8. [8]

    Self-forcing++: Towards minute-scale high-quality video generation.arXiv preprint arXiv:2510.02283, 2025

    Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Self- forcing++: Towards minute-scale high-quality video generation.arXiv preprint arXiv:2510.02283, 2025. 14

  9. [9]

    Stable audio open

    Zach Evans, Julian D Parker, CJ Carr, Zack Zukowski, Josiah Taylor, and Jordi Pons. Stable audio open. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2025

  10. [10]

    Phased dmd: Few-step distribution matching distillation via score matching within subintervals

    Xiangyu Fan, Zesong Qiu, Zhuguanyu Wu, Fanzhou Wang, Zhiqian Lin, Tianxiang Ren, Dahua Lin, Ruihao Gong, and Lei Yang. Phased dmd: Few-step distribution matching distillation via score matching within subintervals.arXiv preprint arXiv:2510.27684, 2025

  11. [11]

    One step diffusion via shortcut models.arXiv preprint arXiv:2410.12557, 2024

    Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models.arXiv preprint arXiv:2410.12557, 2024

  12. [12]

    OmniAvatar: Efficient audio-driven avatar video generation with adaptive body animation.arXiv preprint arXiv:2506.18866, 2025

    Qijun Gan, Ruizi Yang, Jianke Zhu, Shaofei Xue, and Steven Hoi. Omniavatar: Efficient audio-driven avatar video generation with adaptive body animation.arXiv preprint arXiv:2506.18866, 2025

  13. [13]

    Wan-s2v: Audio-driven cinematic video generation.arXiv preprint arXiv:2508.18621, 2025

    Xin Gao, Li Hu, Siqi Hu, Mingyang Huang, Chaonan Ji, Dechao Meng, Jinwei Qi, Penchong Qiao, Zhen Shen, Yafei Song, et al. Wan-s2v: Audio-driven cinematic video generation.arXiv preprint arXiv:2508.18621, 2025

  14. [14]

    Long video generation with time-agnostic vqgan and time-sensitive transformer

    Songwei Ge, Thomas Hayes, Harry Yang, Xi Yin, Guan Pang, David Jacobs, Jia-Bin Huang, and Devi Parikh. Long video generation with time-agnostic vqgan and time-sensitive transformer. InEuropean Conference on Computer Vision, pages 102–118. Springer, 2022

  15. [15]

    Sparsectrl: Adding sparse controls to text-to-video diffusion models

    Yuwei Guo, Ceyuan Yang, Anyi Rao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Sparsectrl: Adding sparse controls to text-to-video diffusion models. InEuropean Conference on Computer Vision, pages 330–348. Springer, 2024

  16. [16]

    AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

    Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725, 2023

  17. [17]

    Av-link: Temporally-aligned diffusion features for cross-modal audio-video generation

    Moayed Haji-Ali, Willi Menapace, Aliaksandr Siarohin, Ivan Skorokhodov, Alper Canberk, Kwot Sin Lee, Vicente Ordonez, and Sergey Tulyakov. Av-link: Temporally-aligned diffusion features for cross-modal audio-video generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19373–19385, 2025

  18. [18]

    Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation

    Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi, et al. Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation. In 2024 IEEE Spoken Language Technology Workshop (SLT), pages 885–890. IEEE, 2024

  19. [19]

    Latent video diffusion models for high-fidelity video generation with arbitrary lengths,

    Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity long video generation.arXiv preprint arXiv:2211.13221, 2022

  20. [20]

    Imagen Video: High Definition Video Generation with Diffusion Models

    Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models.arXiv preprint arXiv:2210.02303, 2022

  21. [21]

    Video diffusion models.Advances in neural information processing systems, 35:8633–8646, 2022

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models.Advances in neural information processing systems, 35:8633–8646, 2022

  22. [22]

    CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

    Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022

  23. [23]

    Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

  24. [24]

    Jova: Unified multimodal learning for joint video-audio generation.arXiv preprint arXiv:2512.13677, 2025

    Xiaohu Huang, Hao Zhou, Qiangpeng Yang, Shilei Wen, and Kai Han. Jova: Unified multimodal learning for joint video-audio generation.arXiv preprint arXiv:2512.13677, 2025

  25. [25]

    Vbench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

  26. [26]

    A simple but strong baseline for sounding video generation: Effective adaptation of audio and video diffusion models for joint generation

    Masato Ishii, Akio Hayakawa, Takashi Shibuya, and Yuki Mitsufuji. A simple but strong baseline for sounding video generation: Effective adaptation of audio and video diffusion models for joint generation. In2025 International Joint Conference on Neural Networks (IJCNN), pages 1–9. IEEE, 2025. 15

  27. [27]

    Pyramidal flow matching for efficient video generative modeling.arXiv preprint arXiv:2410.05954, 2024

    Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling.arXiv preprint arXiv:2410.05954, 2024

  28. [28]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

  29. [29]

    Javisdit: Joint audio-video diffusion transformer with hierarchical spatio-temporal prior synchronization.arXiv preprint arXiv:2503.23377, 2025

    Kai Liu, Wei Li, Lai Chen, Shengqiong Wu, Yanhao Zheng, Jiayi Ji, Fan Zhou, Rongxin Jiang, Jiebo Luo, Hao Fei, et al. Javisdit: Joint audio-video diffusion transformer with hierarchical spatio-temporal prior synchronization.arXiv preprint arXiv:2503.23377, 2025

  30. [30]

    Ovi: Twin backbone cross-modal fusion for audio-video generation

    Chetwin Low, Weimin Wang, and Calder Katyal. Ovi: Twin backbone cross-modal fusion for audio-video generation. arXiv preprint arXiv:2510.01284, 2025

  31. [31]

    Follow your pose: Pose- guided text-to-video generation using pose-free videos

    Yue Ma, Yingqing He, Xiaodong Cun, Xintao Wang, Siran Chen, Xiu Li, and Qifeng Chen. Follow your pose: Pose- guided text-to-video generation using pose-free videos. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4117–4125, 2024

  32. [32]

    Mofa-video: Controllable image animation via generative motion field adaptions in frozen image-to-video diffusion model

    Muyao Niu, Xiaodong Cun, Xintao Wang, Yong Zhang, Ying Shan, and Yinqiang Zheng. Mofa-video: Controllable image animation via generative motion field adaptions in frozen image-to-video diffusion model. InEuropean Conference on Computer Vision, pages 111–128. Springer, 2024

  33. [33]

    Video generation models as world simulators

    OpenAI. Video generation models as world simulators. https://openai.com/index/ video-generation-models-as-world-simulators/, 2024

  34. [34]

    Robust Speech Recognition via Large-Scale Weak Supervision

    Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision.arXiv preprint arXiv:2212.04356, 2022

  35. [35]

    Consisti2v: Enhancing visual consistency for image-to-video generation.arXiv preprint arXiv:2402.04324, 2024

    Weiming Ren, Huan Yang, Ge Zhang, Cong Wei, Xinrun Du, Wenhao Huang, and Wenhu Chen. Consisti2v: Enhancing visual consistency for image-to-video generation.arXiv preprint arXiv:2402.04324, 2024

  36. [36]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  37. [37]

    Mm-diffusion: Learning multi-modal diffusion models for joint audio and video generation

    Ludan Ruan, Yiyang Ma, Huan Yang, Huiguo He, Bei Liu, Jianlong Fu, Nicholas Jing Yuan, Qin Jin, and Baining Guo. Mm-diffusion: Learning multi-modal diffusion models for joint audio and video generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10219–10228, 2023

  38. [38]

    Improved Tech- niques for Training Consistency Models

    Yang Song and Prafulla Dhariwal. Improved techniques for training consistency models.arXiv preprint arXiv:2310.14189, 2023

  39. [39]

    RoFormer: Enhanced Transformer with Rotary Position Embedding

    Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: enhanced transformer with rotary position embedding. arxiv.arXiv preprint arXiv:2104.09864, 2021

  40. [40]

    MAGI-1: Autoregressive Video Generation at Scale

    Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025

  41. [41]

    Meta audiobox aesthetics: Unified automatic quality assessment for speech, music, and sound

    Andros Tjandra, Yi-Chiao Wu, Baishan Guo, John Hoffman, Brian Ellis, Apoorv Vyas, Bowen Shi, Sanyuan Chen, Matt Le, Nick Zacharov, Carleigh Wood, Ann Lee, and Wei-Ning Hsu. Meta audiobox aesthetics: Unified automatic quality assessment for speech, music, and sound. 2025

  42. [42]

    Generating the future with adversarial transformers

    Carl V ondrick and Antonio Torralba. Generating the future with adversarial transformers. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1020–1028, 2017

  43. [43]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  44. [44]

    Universe-1: Unified audio-video generation via stitching of experts.arXiv preprint arXiv:2509.06155, 2025

    Duomin Wang, Wei Zuo, Aojie Li, Ling-Hao Chen, Xinyao Liao, Deyu Zhou, Zixin Yin, Xili Dai, Daxin Jiang, and Gang Yu. Universe-1: Unified audio-video generation via stitching of experts.arXiv preprint arXiv:2509.06155, 2025

  45. [45]

    ModelScope Text-to-Video Technical Report

    Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report.arXiv preprint arXiv:2308.06571, 2023. 16

  46. [46]

    Av-dit: Taming image diffusion transformers for efficient joint audio and video generation

    Kai Wang, Shijian Deng, Jing Shi, Dimitrios Hatzinakos, and Yapeng Tian. Av-dit: Taming image diffusion transformers for efficient joint audio and video generation. InProceedings of the 33rd ACM International Conference on Multimedia, pages 10486–10495, 2025

  47. [47]

    Fantasytalk- ing: Realistic talking portrait generation via coherent motion synthesis

    Mengchao Wang, Qiang Wang, Fan Jiang, Yaqi Fan, Yunpeng Zhang, Yonggang Qi, Kun Zhao, and Mu Xu. Fantasytalk- ing: Realistic talking portrait generation via coherent motion synthesis. InProceedings of the 33rd ACM International Conference on Multimedia, pages 9891–9900, 2025

  48. [48]

    Emu3: Next-Token Prediction is All You Need

    Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024

  49. [49]

    arXiv preprint arXiv:2307.06942 (2023)

    Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, et al. Internvid: A large-scale video-text dataset for multimodal understanding and generation.arXiv preprint arXiv:2307.06942, 2023

  50. [50]

    Loong: Generating minute-level long videos with autoregres- sive language models,

    Yuqing Wang, Tianwei Xiong, Daquan Zhou, Zhijie Lin, Yang Zhao, Bingyi Kang, Jiashi Feng, and Xihui Liu. Loong: Generating minute-level long videos with autoregressive language models.arXiv preprint arXiv:2410.02757, 2024

  51. [51]

    Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation

    Yuanhui Wu, Yuan Huang, Yuan Xu, Yutong Li, Weiran Lin, Chen Zhang, Di Wang, Qingjian Kong, and Wei Xu. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. InIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023

  52. [52]

    Magicanimate: Temporally consistent human image animation using diffusion model

    Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Hanshu Yan, Jia-Wei Liu, Chenxu Zhang, Jiashi Feng, and Mike Zheng Shou. Magicanimate: Temporally consistent human image animation using diffusion model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1481–1490, 2024

  53. [53]

    Stand-in: A lightweight and plug-and-play identity control for video generation.arXiv preprint arXiv:2508.07901, 2025

    Bowen Xue, Zheng-Peng Duan, Qixin Yan, Wenjing Wang, Hao Liu, Chun-Le Guo, Chongyi Li, Chen Li, and Jing Lyu. Stand-in: A lightweight and plug-and-play identity control for video generation.arXiv preprint arXiv:2508.07901, 2025

  54. [54]

    Longlive: Real-time interactive long video generation.arXiv preprint arXiv:2509.22622, 2025

    Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, et al. Longlive: Real-time interactive long video generation.arXiv preprint arXiv:2509.22622, 2025

  55. [55]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

  56. [56]

    Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems, 37:47455– 47487, 2024

    Tianwei Yin, Micha¨el Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and Bill Freeman. Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems, 37:47455– 47487, 2024

  57. [57]

    From slow bidirectional to fast autoregressive video diffusion models

    Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22963–22974, 2025

  58. [58]

    arXiv preprint arXiv:2511.03334 (2025)

    Guozhen Zhang, Zixiang Zhou, Teng Hu, Ziqiao Peng, Youliang Zhang, Yi Chen, Yuan Zhou, Qinglin Lu, and Limin Wang. Uniavgen: Unified audio and video generation with asymmetric cross-modal interactions.arXiv preprint arXiv:2511.03334, 2025

  59. [59]

    Speakervid-5m: A large-scale high-quality dataset for audio-visual dyadic interactive human generation.arXiv preprint arXiv:2507.09862, 2025

    Youliang Zhang, Zhaoyang Li, Duomin Wang, Jiahe Zhang, Deyu Zhou, Zixin Yin, Xili Dai, Gang Yu, and Xiu Li. Speakervid-5m: A large-scale high-quality dataset for audio-visual dyadic interactive human generation.arXiv preprint arXiv:2507.09862, 2025

  60. [60]

    Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency

    Kaiwen Zheng, Yuji Wang, Qianli Ma, Huayu Chen, Jintao Zhang, Yogesh Balaji, Jianfei Chen, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Large scale diffusion distillation via score-regularized continuous-time consistency.arXiv preprint arXiv:2510.08431, 2025

  61. [61]

    Allegro: Open the black box of commercial-level video generation model.arXiv preprint arXiv:2410.15458, 2024

    Yuan Zhou, Qiuyue Wang, Yuxuan Cai, and Huan Yang. Allegro: Open the black box of commercial-level video generation model.arXiv preprint arXiv:2410.15458, 2024. 17