pith. machine review for the scientific record. sign in

arxiv: 2410.13720 · v2 · submitted 2024-10-17 · 💻 cs.CV · cs.AI· cs.LG· eess.IV

Recognition: 2 theorem links

· Lean Theorem

Movie Gen: A Cast of Media Foundation Models

Adam Polyak, Albert Pumarola, Ali Thabet, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Artsiom Sanakoyeu, Arun Mallya, Baishan Guo, Boris Araya, Bowen Shi, Breena Kerr, Carleigh Wood, Ce Liu, Cen Peng, Chih-Yao Ma, Ching-Yao Chuang, David Yan, Dhruv Choudhary, Dimitry Vengertsev, Dingkang Wang, Edgar Schonfeld, Elliot Blanchard, Felix Juefei-Xu, Fraylie Nord, Geet Sethi, Guan Pang, Haoyu Ma, Ishan Misra, Jeff Liang, Jialiang Wang, Ji Hou, John Hoffman, Jonas Kohler, Kaolin Fire, Karthik Sivakumar, Kiran Jagadeesh, Kunpeng Li, Lawrence Chen, Licheng Yu, Luxin Zhang, Luya Gao, Mannat Singh, Markos Georgopoulos, Mary Williamson, Matthew Yu, Matt Le, Mitesh Kumar Singh, Peizhao Zhang, Peter Vajda, Quentin Duval, Rashel Moritz, Rohit Girdhar, Roshan Sumbaly, Sai Saketh Rambhatla, Samaneh Azadi, Sam Tsai, Samyak Datta, Sanyuan Chen, Sara K. Sampson, Sean Bell, Sharadh Ramaswamy, Shelly Sheynin, Shikai Li, Siddharth Bhattacharya, Simone Parmeggiani, Simran Motwani, Steve Fine, Tao Xu, Tara Fowler, Tianhe Li, Tingbo Hou, Vladan Petrovic, Wei-Ning Hsu, Xiaoliang Dai, Xi Yin, Yaniv Taigman, Yaqiao Luo, Yen-Cheng Liu, Yi-Chiao Wu, Yue Zhao, Yuming Du, Yuval Kirstain, Zecheng He, Zijian He

Authors on Pith no claims yet

Pith reviewed 2026-05-11 14:10 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LGeess.IV
keywords text-to-videovideo generationfoundation modelsmultimodal generationvideo editingaudio synthesispersonalized video
0
0 comments X

The pith

Movie Gen introduces foundation models that generate 1080p videos with synchronized audio and claim state-of-the-art results on text-to-video, personalization, editing, and audio tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Movie Gen as a family of large-scale models for producing high-definition videos complete with matching sound. These models handle text prompts to create video, personalize clips using a reference image of a user, edit existing video according to instructions, and generate audio either from video input or from text alone. The biggest model has 30 billion parameters and uses a context window of 73,000 video tokens to output 16-second clips at 16 frames per second. The work describes simplifications and advances in architecture, data handling, and training that support scaling up to these sizes. A reader might care because the results suggest that large transformers can be adapted to create complex, synchronized media directly from simple inputs.

Core claim

The authors show that a set of transformer-based foundation models, led by a 30B-parameter model trained on up to 73K video tokens, can produce 1080p videos with varying aspect ratios and synchronized audio, while also supporting image-based personalization, instruction-driven editing, and both video-to-audio and text-to-audio synthesis, all at new levels of performance achieved through targeted changes to architecture, latent representations, training objectives, data curation, and inference methods.

What carries the argument

The 30B-parameter transformer with 73K video token context length that generates 16-second 1080p videos at 16 frames per second while incorporating the listed technical simplifications.

If this is right

  • Text-to-video synthesis reaches a new performance level.
  • Instruction-based video editing can be performed precisely at high quality.
  • Personalized videos can be created from a single user-provided image.
  • Video-to-audio and text-to-audio generation also attain leading results.
  • Scaling model size, data volume, and compute produces measurable gains in media generation quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the scaling improvements hold, similar methods could shorten the time needed to produce custom video content for education or entertainment.
  • The approach implies that language-model-style scaling laws may extend to joint video and audio generation.
  • Releasing model weights or detailed benchmark code would allow other groups to test extensions such as longer video durations.

Load-bearing premise

The state-of-the-art performance claims depend on internal evaluations and baseline comparisons whose exact protocols, datasets, and human preference methods are not specified in detail.

What would settle it

Independent evaluation of the models on public video generation benchmarks with the same human preference tests used internally would directly confirm or refute the claimed performance gains.

read the original abstract

We present Movie Gen, a cast of foundation models that generates high-quality, 1080p HD videos with different aspect ratios and synchronized audio. We also show additional capabilities such as precise instruction-based video editing and generation of personalized videos based on a user's image. Our models set a new state-of-the-art on multiple tasks: text-to-video synthesis, video personalization, video editing, video-to-audio generation, and text-to-audio generation. Our largest video generation model is a 30B parameter transformer trained with a maximum context length of 73K video tokens, corresponding to a generated video of 16 seconds at 16 frames-per-second. We show multiple technical innovations and simplifications on the architecture, latent spaces, training objectives and recipes, data curation, evaluation protocols, parallelization techniques, and inference optimizations that allow us to reap the benefits of scaling pre-training data, model size, and training compute for training large scale media generation models. We hope this paper helps the research community to accelerate progress and innovation in media generation models. All videos from this paper are available at https://go.fb.me/MovieGenResearchVideos.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces Movie Gen, a collection of foundation models for generating high-quality 1080p HD videos with synchronized audio from text, including capabilities for precise instruction-based video editing and personalized video generation from user images. The largest model is a 30B-parameter transformer trained with up to 73K video tokens (corresponding to 16-second videos at 16 fps). The work claims new state-of-the-art results on text-to-video synthesis, video personalization, video editing, video-to-audio generation, and text-to-audio generation, while describing technical innovations and simplifications in architecture, latent spaces, training objectives, data curation, evaluation protocols, parallelization, and inference optimizations to support scaling.

Significance. If the reported performance gains hold under transparent, reproducible conditions with appropriate baselines and statistical rigor, the work would constitute a substantial contribution to scalable media generation by demonstrating practical training of very large video transformers and by releasing example outputs to the community.

major comments (1)
  1. [Abstract] Abstract: The central claims of new state-of-the-art performance on text-to-video synthesis, video personalization, video editing, video-to-audio generation, and text-to-audio generation rest on internal evaluations whose test prompts, baseline selection criteria (including whether recent public models were included), rater count and instructions, preference aggregation method, and significance testing are not specified. This renders the empirical superiority assertions unverifiable from the provided text.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for their detailed review and for emphasizing the need for greater transparency in our evaluation methodology. We address the major comment below and commit to revisions that improve verifiability while respecting the practical constraints of large-scale internal evaluations.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claims of new state-of-the-art performance on text-to-video synthesis, video personalization, video editing, video-to-audio generation, and text-to-audio generation rest on internal evaluations whose test prompts, baseline selection criteria (including whether recent public models were included), rater count and instructions, preference aggregation method, and significance testing are not specified. This renders the empirical superiority assertions unverifiable from the provided text.

    Authors: We appreciate the referee's observation that the abstract's SOTA claims would be strengthened by more explicit details on our evaluation protocols. While the full manuscript contains a dedicated evaluation section describing our human preference studies, we agree that additional specificity is warranted. In the revised manuscript we will: (1) explicitly state the number of raters, their instructions, and qualification criteria; (2) detail the preference aggregation procedure (including any use of majority voting or ranking methods) and report statistical significance testing; (3) clarify baseline selection criteria, including which recent public models were considered and the rationale for inclusion or exclusion; and (4) provide representative test prompt examples along with a description of the prompt curation process. We will also add cross-references from the abstract to these expanded sections. However, we cannot release the complete proprietary test prompt set, as doing so would risk contamination of the benchmark for future models. These revisions will make the reported gains substantially more verifiable from the text while preserving the integrity of our internal evaluation framework. revision: partial

standing simulated objections not resolved
  • Release of the full proprietary test prompt set used for internal evaluations, as this would compromise the long-term validity of the benchmark.

Circularity Check

0 steps flagged

No significant circularity in claimed derivation chain

full rationale

The paper is an empirical report on training large-scale media generation models, with claims of SOTA performance resting on experimental results rather than any mathematical derivation chain. The provided abstract and text contain no equations, self-definitional constructs, fitted parameters renamed as predictions, or load-bearing self-citations that reduce the central claims to their own inputs by construction. Technical innovations in architecture, data, and training are presented as independent contributions, and the absence of any quoted reduction (e.g., Eq. X equivalent to input Y) confirms the derivation is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no technical details, equations, or methods are provided, so no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.0 · 5880 in / 1032 out tokens · 52423 ms · 2026-05-11T14:10:02.506352+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 47 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Divergence is Uncertainty: A Closed-Form Posterior Covariance for Flow Matching

    cs.LG 2026-05 unverdicted novelty 8.0

    In flow matching, the uncertainty of the clean data given the current state is exactly the divergence of the velocity field (up to a known scalar).

  2. CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models

    cs.CV 2026-05 unverdicted novelty 7.0

    CollabVR improves video reasoning performance by coupling vision-language models and video generation models in a closed-loop step-level collaboration that detects and repairs generation failures.

  3. LiVeAction: a Lightweight, Versatile, and Asymmetric Neural Codec Design for Real-time Operation

    eess.IV 2026-05 unverdicted novelty 7.0

    LiVeAction is a lightweight asymmetric neural codec using an FFT-inspired encoder and variance-based training that outperforms generative tokenizers in rate-distortion while supporting real-time use on resource-constr...

  4. FreeSpec: Training-Free Long Video Generation via Singular-Spectrum Reconstruction

    cs.CV 2026-05 unverdicted novelty 7.0

    FreeSpec uses SVD-based spectral reconstruction to fuse global low-rank and local high-rank features, reducing content drift and preserving temporal dynamics in long video generation.

  5. Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    Stream-R1 improves distillation of autoregressive streaming video diffusion models by adaptively weighting supervision with a reward model at both rollout and per-pixel levels.

  6. TMD-Bench: A Multi-Level Evaluation Paradigm for Music-Dance Co-Generation

    cs.SD 2026-05 unverdicted novelty 7.0

    TMD-Bench is a multi-level benchmark that measures music-dance co-generation quality including beat-level rhythmic synchronization, supported by a new dataset and Music Captioner, and shows commercial models lag in rh...

  7. Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting

    cs.CV 2026-04 unverdicted novelty 7.0

    Reshoot-Anything trains a diffusion transformer on pseudo multi-view triplets created by cropping and warping monocular videos to achieve temporally consistent video reshooting with robust camera control on dynamic scenes.

  8. Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Diffusion Video Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    Sparse Forcing adds a native trainable sparsity mechanism and PBSA kernel to autoregressive diffusion video models, yielding higher VBench scores and 1.1-1.27x speedups on 5s to 1min generations.

  9. Speculative Decoding for Autoregressive Video Generation

    cs.CV 2026-04 conditional novelty 7.0

    A training-free speculative decoding method for block-based autoregressive video diffusion uses a quality router on worst-frame ImageReward scores to accept drafter proposals, achieving up to 2.09x speedup at 95.7% qu...

  10. Flow of Truth: Proactive Temporal Forensics for Image-to-Video Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    Flow of Truth introduces a learnable forensic template and template-guided flow module that follows pixel motion to enable temporal tracing in image-to-video generation.

  11. Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis

    cs.CV 2026-04 unverdicted novelty 7.0

    Grounded Forcing introduces dual memory caching, reference-based positional embeddings, and proximity-weighted recaching to bridge stable semantics with local dynamics, improving long-range consistency in autoregressi...

  12. Not All Frames Deserve Full Computation: Accelerating Autoregressive Video Generation via Selective Computation and Predictive Extrapolation

    cs.CV 2026-04 conditional novelty 7.0

    SCOPE accelerates autoregressive video diffusion up to 4.73x by using a tri-modal cache-predict-recompute scheduler with Taylor extrapolation and selective active-frame computation while preserving output quality.

  13. Transfer between Modalities with MetaQueries

    cs.CV 2025-04 unverdicted novelty 7.0

    MetaQueries act as an efficient bridge allowing multimodal LLMs to augment diffusion-based image generation and editing without complex training or unfreezing the LLM backbone.

  14. Pyramid Forcing: Head-Aware Pyramid KV Cache Policy for High-Quality Long Video Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    Pyramid Forcing classifies attention heads into Anchor, Wave, and Veil types and applies type-specific KV cache policies to improve long-horizon autoregressive video generation quality.

  15. Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm

    cs.CV 2026-05 unverdicted novelty 6.0

    V2V-Zero adapts frozen VLMs for visual conditioning via hidden states from specification pages, scoring 0.85 on GenEval and 32.7 on a new seven-task benchmark while revealing capability hierarchies in attribute bindin...

  16. FIS-DiT: Breaking the Few-Step Video Inference Barrier via Training-Free Frame Interleaved Sparsity

    cs.CV 2026-05 unverdicted novelty 6.0

    FIS-DiT achieves 2.11-2.41x speedup on video DiT models in few-step regimes with negligible quality loss by exploiting frame-wise sparsity and consistency through a training-free interleaved execution strategy.

  17. WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors

    cs.CV 2026-05 unverdicted novelty 6.0

    The paper presents WorldReasonBench, a benchmark that tests video generators on maintaining physical, social, logical, and informational consistency when predicting future states from initial conditions and actions.

  18. Forcing-KV: Hybrid KV Cache Compression for Efficient Autoregressive Video Diffusion Models

    cs.CV 2026-05 unverdicted novelty 6.0

    Forcing-KV applies head-specific static and dynamic pruning to KV caches in AR video diffusion models, achieving over 29 fps, 30% memory reduction, and up to 2.82x speedup at maintained quality.

  19. FaithfulFaces: Pose-Faithful Facial Identity Preservation for Text-to-Video Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    FaithfulFaces introduces a pose-faithful identity aligner with a shared dictionary and invariance constraint to maintain facial identity in text-to-video generation under large pose changes and occlusions.

  20. Stream-T1: Test-Time Scaling for Streaming Video Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    Stream-T1 is a test-time scaling framework for streaming video generation using scaled noise propagation from history, reward pruning across short and long windows, and feedback-guided memory sinking to improve tempor...

  21. NoiseRater: Meta-Learned Noise Valuation for Diffusion Model Training

    cs.LG 2026-05 unverdicted novelty 6.0

    NoiseRater meta-learns instance-level importance scores for noise in diffusion training via bilevel optimization, then uses a two-stage pipeline to improve efficiency and generation quality on FFHQ and ImageNet.

  22. Leveraging Verifier-Based Reinforcement Learning in Image Editing

    cs.CV 2026-04 unverdicted novelty 6.0

    Edit-R1 trains a CoT-based reasoning reward model with GCPO and uses it to boost image editing performance over VLMs and models like FLUX.1-kontext via GRPO.

  23. VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations

    cs.CV 2026-04 unverdicted novelty 6.0

    VibeToken enables autoregressive image generation at arbitrary resolutions using 64 tokens for 1024x1024 images with 3.94 gFID, constant 179G FLOPs, and better efficiency than diffusion or fixed AR baselines.

  24. CineAGI: Character-Consistent Movie Creation through LLM-Orchestrated Multi-Modal Generation and Cross-Scene Integration

    cs.MM 2026-04 unverdicted novelty 6.0

    CineAGI is a multi-agent LLM framework that generates multi-scene movies with improved character consistency, narrative coherence, and audio-visual alignment.

  25. CityRAG: Stepping Into a City via Spatially-Grounded Video Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    CityRAG generates minutes-long 3D-consistent videos of real-world cities by grounding outputs in geo-registered data and using temporally unaligned training to disentangle fixed scenes from transient elements like weather.

  26. How Far Are Video Models from True Multimodal Reasoning?

    cs.CV 2026-04 unverdicted novelty 6.0

    Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.

  27. Identifying Ethical Biases in Action Recognition Models

    cs.CV 2026-04 unverdicted novelty 6.0

    The authors create a synthetic video auditing framework that detects statistically significant skin color biases in popular human action recognition models even when actions are identical.

  28. VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

    cs.CV 2026-04 unverdicted novelty 6.0

    VEFX-Bench releases a large human-labeled video editing dataset, a multi-dimensional reward model, and a standardized benchmark that better matches human judgments than generic evaluators.

  29. Why Open Source? A Game-Theoretic Analysis of the AI Race

    cs.GT 2026-04 unverdicted novelty 6.0

    A game-theoretic R&D race model shows that pure Nash equilibria for open-sourcing decisions exist and are computationally tractable in both discrete and continuous settings.

  30. ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling

    cs.MM 2026-04 unverdicted novelty 6.0

    ControlFoley introduces a unified framework for controllable video-to-audio generation using joint visual encoding, temporal-timbre decoupling, and robust multimodal training to handle cross-modal conflicts.

  31. Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction

    cs.CV 2026-04 unverdicted novelty 6.0

    Re2Pix decomposes video prediction into semantic feature forecasting followed by representation-conditioned diffusion synthesis, with nested dropout and mixed supervision to handle prediction errors.

  32. Latent-Compressed Variational Autoencoder for Video Diffusion Models

    cs.CV 2026-04 unverdicted novelty 6.0

    A frequency-based latent compression method for video VAEs yields higher reconstruction quality than channel-reduction baselines at fixed compression ratios.

  33. INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling

    cs.CV 2026-04 unverdicted novelty 6.0

    INSPATIO-WORLD is a real-time framework for high-fidelity 4D scene generation and navigation from monocular videos via STAR architecture with implicit caching, explicit geometric constraints, and distribution-matching...

  34. Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    cs.CV 2025-06 unverdicted novelty 6.0

    Self Forcing trains autoregressive video diffusion models by performing autoregressive rollout with KV caching during training to close the exposure bias gap, using a holistic video-level loss and few-step diffusion f...

  35. MAGI-1: Autoregressive Video Generation at Scale

    cs.CV 2025-05 unverdicted novelty 6.0

    MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.

  36. VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

    cs.CV 2025-03 accept novelty 6.0

    VBench-2.0 is a benchmark suite that automatically evaluates video generative models on five dimensions of intrinsic faithfulness: Human Fidelity, Controllability, Creativity, Physics, and Commonsense using VLMs, LLMs...

  37. Improving Video Generation with Human Feedback

    cs.CV 2025-01 unverdicted novelty 6.0

    A human preference dataset and VideoReward model enable Flow-DPO and Flow-NRG to produce smoother, better-aligned videos from text prompts in flow-based generators.

  38. LTX-Video: Realtime Video Latent Diffusion

    cs.CV 2024-12 conditional novelty 6.0

    LTX-Video integrates Video-VAE and transformer for 1:192 latent compression and real-time video diffusion by moving patchifying to the VAE and letting the decoder finish denoising in pixel space.

  39. $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    cs.LG 2024-10 unverdicted novelty 6.0

    π₀ is a vision-language-action flow model trained on diverse multi-platform robot data that supports zero-shot task performance, language instruction following, and efficient fine-tuning for dexterous tasks.

  40. Scaling Properties of Continuous Diffusion Spoken Language Models

    cs.CL 2026-04 unverdicted novelty 5.0

    Continuous diffusion spoken language models follow scaling laws for loss and phoneme divergence and generate emotive multi-speaker speech at 16B scale, though long-form coherence stays difficult.

  41. RF-HiT: Rectified Flow Hierarchical Transformer for General Medical Image Segmentation

    cs.CV 2026-04 unverdicted novelty 5.0

    RF-HiT uses rectified flow and a multi-scale hierarchical transformer to reach 91.27% Dice on ACDC and 87.40% on BraTS 2021 with only 10.14 GFLOPs, 13.6M parameters, and three inference steps.

  42. Motif-Video 2B: Technical Report

    cs.CV 2026-04 unverdicted novelty 5.0

    Motif-Video 2B achieves 83.76% VBench score, beating a 14B-parameter baseline with 7x fewer parameters and substantially less training data through shared cross-attention and a three-part backbone.

  43. Neural Computers

    cs.LG 2026-04 unverdicted novelty 5.0

    Neural Computers are introduced as a new machine form where computation, memory, and I/O are unified in a learned runtime state, with initial video-model experiments showing acquisition of basic interface primitives f...

  44. TIGFlow-GRPO: Trajectory Forecasting via Interaction-Aware Flow Matching and Reward-Guided Optimization

    cs.CV 2026-03 unverdicted novelty 5.0

    TIGFlow-GRPO uses a Trajectory-Interaction-Graph in conditional flow matching plus Flow-GRPO optimization to produce more accurate, socially compliant, and physically feasible trajectory forecasts on ETH/UCY and SDD datasets.

  45. World Simulation with Video Foundation Models for Physical AI

    cs.CV 2025-10 unverdicted novelty 4.0

    Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.

  46. Cosmos World Foundation Model Platform for Physical AI

    cs.CV 2025-01 unverdicted novelty 3.0

    The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.

  47. Flow Matching Guide and Code

    cs.LG 2024-12 unverdicted novelty 2.0

    Flow Matching is a generative modeling framework with mathematical foundations, design choices, extensions, and open-source PyTorch code for applications like image and text generation.

Reference graph

Works this paper leans on

88 extracted references · 88 canonical work pages · cited by 47 Pith papers · 28 internal anchors

  1. [1]

    Latent-Shift: Latent diffusion with temporal shift for efficient text-to-video generation.arXiv preprint arXiv:2304.08477,

    69 Jie An, Songyang Zhang, Harry Yang, Sonal Gupta, Jia-Bin Huang, Jiebo Luo, and Xi Yin. Latent-Shift: Latent diffusion with temporal shift for efficient text-to-video generation.arXiv preprint arXiv:2304.08477,

  2. [2]

    ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers

    Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, Tero Karras, and Ming-Yu Liu. eDiff-I: Text-to-image diffusion models with an ensemble of expert denoisers.arXiv preprint arXiv:2211.01324,

  3. [3]

    Lumiere: A space-time diffusion model for video generation,

    Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Yuanzhen Li, Tomer Michaeli, Oliver Wang, Deqing Sun, Tali Dekel, and Inbar Mosseri. Lumiere: A space-time diffusion model for video generation.arXiv preprint arXiv:2401.12945,

  4. [4]

    A note on the Inception Score

    Shane Barratt and Rishi Sharma. A note on the Inception score.arXiv preprint arXiv:1801.01973,

  5. [5]

    Meta Open Compute Project, Grand Teton AI platform.https://engineering

    Jeremy Baumgartner and Matt Bowman. Meta Open Compute Project, Grand Teton AI platform.https://engineering. fb.com/2022/10/18/open-source/ocp-summit-2022-grand-teton/ ,

  6. [6]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Black Forest Labs. FLUX, 2024.https://blackforestlabs.ai/. Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023a. Andreas Blattmann, R...

  7. [7]

    Language Models are Few-Shot Learners

    https://openai.com/research/video-generation-models-as-world-simulators. Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.arXiv preprint arXiv:2005.14165,

  8. [8]

    Still-moving: Customized video generation without customized video data.arXiv preprint arXiv:2407.08674,

    Hila Chefer, Shiran Zada, Roni Paiss, Ariel Ephrat, Omer Tov, Michael Rubinstein, Lior Wolf, Tali Dekel, Tomer Michaeli, and Inbar Mosseri. Still-moving: Customized video generation without customized video data.arXiv preprint arXiv:2407.08674,

  9. [9]

    VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

    Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, Chao Weng, and Ying Shan. VideoCrafter1: Open diffusion models for high-quality video generation. arXiv:2310.19512, 2023a. Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. VGGSound: A large-scale audio-visual data...

  10. [10]

    PhotoVerse: Tuning-free image customization with text-to-image diffusion models.arXiv preprint arXiv:2309.05793, 2023b

    Li Chen, Mengyi Zhao, Yiheng Liu, Mingxu Ding, Yangyang Song, Shizun Wang, Xu Wang, Hao Yang, Jing Liu, Kang Du, et al. PhotoVerse: Tuning-free image customization with text-to-image diffusion models.arXiv preprint arXiv:2309.05793, 2023b. 70 Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fa...

  11. [11]

    Emu: Enhanc- ing image generation models using photogenic needles in a haystack

    Xiaoliang Dai, Ji Hou, Chih-Yao Ma, Sam Tsai, Jialiang Wang, Rui Wang, Peizhao Zhang, Simon Vandenhende, Xiaofang Wang, Abhimanyu Dubey, et al. Emu: Enhancing image generation models using photogenic needles in a haystack. arXiv preprint arXiv:2309.15807,

  12. [12]

    Vision Transformers Need Registers

    Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers.arXiv preprint arXiv:2309.16588,

  13. [13]

    Diffusion Models Beat GANs on Image Synthesis

    Prafulla Dhariwal and Alex Nichol. Diffusion models beat GANs on image synthesis.arXiv preprint arXiv:2105.05233,

  14. [14]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783,

  15. [15]

    High Fidelity Neural Audio Compression

    Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression.arXiv preprint arXiv:2210.13438,

  16. [16]

    Understanding back-translation at scale

    Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. Understanding back-translation at scale.arXiv preprint arXiv:1808.09381,

  17. [17]

    Structure and content-guided video synthesis with diffusion models

    71 Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models.arXiv preprint arXiv:2302.03011,

  18. [18]

    An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

    FFmpeg Developers. FFmpeg.https://ffmpeg.org/. Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion.arXiv preprint arXiv:2208.01618,

  19. [19]

    Tokenflow: Con- sistent diffusion features for consistent video editing,

    Michal Geyer, Omer Bar-Tal, Shai Bagon, and Tali Dekel. TokenFlow: Consistent diffusion features for consistent video editing. arXiv preprint arXiv:2307.10373,

  20. [20]

    Text-to-audio generation using instruction-tuned LLM and latent diffusion model.arXiv preprint arXiv:2304.13731,

    Deepanway Ghosal, Navonil Majumder, Ambuj Mehrish, and Soujanya Poria. Text-to-audio generation using instruction-tuned LLM and latent diffusion model.arXiv preprint arXiv:2304.13731,

  21. [21]

    AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

    Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, and Bo Dai. AnimateDiff: Animate your personalized text-to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725,

  22. [22]

    Photorealistic video generation with diffusion models,

    Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei, Irfan Essa, Lu Jiang, and José Lezama. Photorealistic video generation with diffusion models.arXiv preprint arXiv:2312.06662,

  23. [23]

    Id-animator: Zero-shot identity- preserving human video generation,

    Xuanhua He, Quande Liu, Shengju Qian, Xin Wang, Tao Hu, Ke Cao, Keyu Yan, Man Zhou, and Jie Zhang. ID-Animator: Zero-shot identity-preserving human video generation.arXiv preprint arXiv:2404.15275, 2024a. Zecheng He, Bo Sun, Felix Juefei-Xu, Haoyu Ma, Ankit Ramchandani, Vincent Cheung, Siddharth Shah, Anmol Kalia, Harihar Subramanyam, Alireza Zareian, Li ...

  24. [24]

    Imagen Video: High Definition Video Generation with Diffusion Models

    Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, and Tim Salimans. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022a. Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and...

  25. [25]

    Direct2v: Large language models are frame-level directors for zero-shot text-to-video generation.arXiv preprint arXiv:2305.14330,

    72 Susung Hong, Junyoung Seo, Heeseong Shin, Sunghwan Hong, and Seungryong Kim. Direct2v: Large language models are frame-level directors for zero-shot text-to-video generation.arXiv preprint arXiv:2305.14330,

  26. [26]

    CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

    Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. CogVideo: Large-scale pre-training for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868,

  27. [27]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685,

  28. [28]

    Noise2Music: Text-conditioned music generation with diffusion models.arXiv preprint arXiv:2302.03917,

    Qingqing Huang, Daniel S Park, Tao Wang, Timo I Denk, Andy Ly, Nanxin Chen, Zhengdong Zhang, Zhishuai Zhang, Jiahui Yu, Christian Frank, et al. Noise2Music: Text-conditioned music generation with diffusion models.arXiv preprint arXiv:2302.03917,

  29. [29]

    Ideogram v2, 2024.https://ideogram.ai/

    Ideogram. Ideogram v2, 2024.https://ideogram.ai/. Jaeyong Kang, Soujanya Poria, and Dorien Herremans. Video2Music: Suitable music generation from videos using an affective multimodal transformer model.arXiv preprint arXiv:2311.00968,

  30. [30]

    Rave: Randomized noise shuffling for fast and consistent video editing with diffusion models.arXiv preprint arXiv:2312.04524,

    Ozgur Kara, Bariscan Kurtkaya, Hidir Yesiltepe, James M Rehg, and Pinar Yanardag. Rave: Randomized noise shuffling for fast and consistent video editing with diffusion models.arXiv preprint arXiv:2312.04524,

  31. [31]

    Text2video-zero: Text-to- image diffusion models are zero-shot video generators.arXiv preprint arXiv:2303.13439, 2023

    Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2Video-Zero: Text-to-image diffusion models are zero-shot video gen- erators. arXiv preprint arXiv:2303.13439,

  32. [32]

    Fr\’echet audio distance: A metric for evaluating music enhancement algo- rithms,

    Kevin Kilgour, Mauricio Zuluaga, Dominik Roblek, and Matthew Sharifi. Fréchet audio distance: A metric for evaluating music enhancement algorithms.arXiv preprint arXiv:1812.08466,

  33. [33]

    FIFO-Diffusion: Generating infinite videos from text without training

    Jihwan Kim, Junoh Kang, Jinyoung Choi, and Bohyung Han. FIFO-Diffusion: Generating infinite videos from text without training. arXiv preprint arXiv:2405.11473,

  34. [34]

    Auto-Encoding Variational Bayes

    Diederik P Kingma. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114,

  35. [35]

    Kling AI, 2024.https://klingai.com/

    KlingAI. Kling AI, 2024.https://klingai.com/. Jonas Kohler, Albert Pumarola, Edgar Schönfeld, Artsiom Sanakoyeu, Roshan Sumbaly, Peter Vajda, and Ali Thabet. Imagine Flash: Accelerating Emu diffusion models with backward distillation.arXiv preprint arXiv:2405.05224,

  36. [36]

    Videopoet: A large language model for zero-shot video generation.arXiv:2312.14125,

    Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Rachel Hornung, Hartwig Adam, Hassan Akbari, Yair Alon, Vighnesh Birodkar, et al. VideoPoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125,

  37. [37]

    Audiogen: Textually guided audio generation,

    Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre Défossez, Jade Copet, Devi Parikh, Yaniv Taigman, and Yossi Adi. AudioGen: Textually guided audio generation.arXiv preprint arXiv:2209.15352,

  38. [38]

    High fidelity text-guided music generation and editing via single-stage flow matching

    Gael Le Lan, Bowen Shi, Zhaoheng Ni, Sidd Srinivasan, Anurag Kumar, Brian Ellis, David Kant, Varun Nagaraja, Ernie Chang, Wei-Ning Hsu, et al. High fidelity text-guided music generation and editing via single-stage flow matching. arXiv preprint arXiv:2407.03648,

  39. [39]

    Voicebox: Text-guided multilingual universal speech generation at scale

    73 Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, and Wei-Ning Hsu. Voicebox: Text-guided multilingual universal speech generation at scale. arXiv preprint arXiv:2306.15687,

  40. [40]

    Bigvgan: A universal neural vocoder with large-scale training

    Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. Autoregressive image generation using residual quantization. InCVPR, 2022a. Sang-gil Lee, Wei Ping, Boris Ginsburg, Bryan Catanzaro, and Sungroh Yoon. BigVGAN: A universal neural vocoder with large-scale training.arXiv preprint arXiv:2206.04658, 2022b. Shenggui Li, Fuzhao Xue, Chaitanya Ba...

  41. [41]

    Videogen: A reference-guided latent diffusion ap- proach for high definition text-to-video generation

    Xin Li, Wenqing Chu, Ye Wu, Weihang Yuan, Fanglong Liu, Qi Zhang, Fu Li, Haocheng Feng, Errui Ding, and Jingdong Wang. VideoGen: A reference-guided latent diffusion approach for high definition text-to-video generation. arXiv preprint arXiv:2309.00398, 2023a. Xirui Li, Chao Ma, Xiaokang Yang, and Ming-Hsuan Yang. VidToMe: Video token merging for zero-shot...

  42. [42]

    PhotoMaker: Customizing realistic human photos via stacked ID embedding.arXiv preprint arXiv:2312.04461, 2023c

    Zhen Li, Mingdeng Cao, Xintao Wang, Zhongang Qi, Ming-Ming Cheng, and Ying Shan. PhotoMaker: Customizing realistic human photos via stacked ID embedding.arXiv preprint arXiv:2312.04461, 2023c. Feng Liang, Bichen Wu, Jialiang Wang, Licheng Yu, Kunpeng Li, Yinan Zhao, Ishan Misra, Jia-Bin Huang, Peizhao Zhang, Péter Vajda, and Diana Marculescu. FlowVid: Tam...

  43. [43]

    Ring Attention with Blockwise Transformers for Near-Infinite Context

    Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring attention with blockwise transformers for near-infinite context.arXiv preprint arXiv:2310.01889, 2023a. Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, and Mark D Plumbley. AudioLDM: Text-to-audio generation with latent diffusion models.arXiv preprint arXiv:2301.12503, 2023b....

  44. [44]

    Dream Machine, 2024.https://lumalabs.ai/dream-machine

    LumaLabs. Dream Machine, 2024.https://lumalabs.ai/dream-machine. Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning.Neurocomputing, 508:293–304,

  45. [45]

    Videofusion: Decomposed diffusion models for high-quality video generation

    Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Huang, Liang Wang, Yujun Shen, Deli Zhao, Jingren Zhou, and Tieniu Tan. Videofusion: Decomposed diffusion models for high-quality video generation.arXiv preprint arXiv:2303.08320,

  46. [46]

    Latte: Latent Diffusion Transformer for Video Generation

    Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024a. 74 Ze Ma, Daquan Zhou, Chun-Hsiao Yeh, Xue-She Wang, Xiuyu Li, Huanrui Yang, Zhen Dong, Kurt Keutzer, and Jiashi Feng. Magic-me: Identity-specific video customi...

  47. [47]

    Tango 2: Aligning diffusion-based text-to-audio generations through direct preference optimization

    Navonil Majumder, Chia-Yu Hung, Deepanway Ghosal, Wei-Ning Hsu, Rada Mihalcea, and Soujanya Poria. Tango 2: Aligning diffusion-based text-to-audio generations through direct preference optimization. arXiv preprint arXiv:2404.09956,

  48. [48]

    FoleyGen: Visually-guided audio generation.arXiv preprint arXiv:2309.10537,

    Xinhao Mei, Varun Nagaraja, Gael Le Lan, Zhaoheng Ni, Ernie Chang, Yangyang Shi, and Vikas Chandra. FoleyGen: Visually-guided audio generation.arXiv preprint arXiv:2309.10537,

  49. [49]

    Midjourney, 2024.https://www.midjourney.com/

    Midjourney. Midjourney, 2024.https://www.midjourney.com/. Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. Efficient large-scale language model training on GPU clusters using Megatron-LM. InSC,

  50. [50]

    Dall-E 3, 2024.https://openai.com/index/dall-e-3/

    OpenAI. Dall-E 3, 2024.https://openai.com/index/dall-e-3/. OpenAI. Video generation models as world simulators,

  51. [51]

    Daniil Ostashev, Yuwei Fang, Sergey Tulyakov, Kfir Aberman, et al

    https://openai.com/index/ video-generation-models-as-world-simulators/. Daniil Ostashev, Yuwei Fang, Sergey Tulyakov, Kfir Aberman, et al. MoA: Mixture-of-attention for subject-context disentanglement in personalized image generation.arXiv preprint arXiv:2404.11565,

  52. [52]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rom- bach. SDXL: improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952,

  53. [53]

    Instructvid2vid: Controllable video editing with natural language instructions,

    PySceneDetect Developers. PySceneDetect.https://www.scenedetect.com/. Bosheng Qin, Juncheng Li, Siliang Tang, Tat-Seng Chua, and Yueting Zhuang. InstructVid2Vid: Controllable video editing with natural language instructions.arXiv preprint arXiv:2305.12328,

  54. [54]

    Freenoise: Tuning-free longer video diffusion via noise rescheduling.arXiv preprint arXiv:2310.15169,

    Haonan Qiu, Menghan Xia, Yong Zhang, Yingqing He, Xintao Wang, Ying Shan, and Ziwei Liu. FreeNoise: Tuning-free longer video diffusion via noise rescheduling.arXiv preprint arXiv:2310.15169,

  55. [55]

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. ZeRO: Memory optimizations toward training trillion parameter models.arXiv preprint arXiv:1910.02054,

  56. [56]

    SelfEval: Leveraging the discriminative nature of generative models for evaluation

    Sai Saketh Rambhatla and Ishan Misra. SelfEval: Leveraging the discriminative nature of generative models for evaluation. arXiv preprint arXiv:2311.10708,

  57. [57]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with CLIP latents.arXiv preprint arXiv:2204.06125,

  58. [58]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feichtenhofer. SAM 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714,

  59. [59]

    arXiv:2101.06840 [cs.DC]https://arxiv.org/abs/2101.06840

    Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. ZeRO-Offload: Democratizing billion-scale model training.arXiv preprint arXiv:2101.06840,

  60. [60]

    DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation

    Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation. InCVPR, 2023a. Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Wei Wei, Tingbo Hou, Yael Pritch, Neal Wadhwa, Michael Rubinstein, and Kfir Aberman. HyperDreamBooth: HyperNetworks for...

  61. [61]

    GLU Variants Improve Transformer

    Noam Shazeer. GLU variants improve transformer.arXiv preprint arXiv:2002.05202,

  62. [62]

    Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers.arXiv preprint arXiv:2304.09116, 2023

    Kai Shen, Zeqian Ju, Xu Tan, Yanqing Liu, Yichong Leng, Lei He, Tao Qin, Sheng Zhao, and Jiang Bian. NaturalSpeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers.arXiv preprint arXiv:2304.09116,

  63. [63]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron- LM: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053,

  64. [64]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502,

  65. [65]

    Tran, Xavier Garcia, Jason Wei, Xuezhi Wang, Hyung Won Chung, Siamak Shakeri, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Denny Zhou, Neil Houlsby, and Donald Metzler

    Yi Tay, Mostafa Dehghani, Vinh Q Tran, Xavier Garcia, Jason Wei, Xuezhi Wang, Hyung Won Chung, Siamak Shakeri, Dara Bahri, Tal Schuster, et al. UL2: Unifying language learning paradigms.arXiv preprint arXiv:2205.05131,

  66. [66]

    Gemini: A Family of Highly Capable Multimodal Models

    Team Gemini. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,

  67. [67]

    VidMuse: A simple video-to-music generation framework with long-short-term modeling.arXiv preprint arXiv:2406.04321,

    Zeyue Tian, Zhaoyang Liu, Ruibin Yuan, Jiahao Pan, Xiaoqiang Huang, Qi fei Liu, Xu Tan, Qifeng Chen, Wei Xue, and Yi-Ting Guo. VidMuse: A simple video-to-music generation framework with long-short-term modeling.arXiv preprint arXiv:2406.04321,

  68. [68]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and finetuned chat models. arXiv preprint arXiv:2307.09288,

  69. [69]

    Audiobox: Unified audio generation with natural language prompts.arXiv preprint arXiv:2312.15821,

    Apoorv Vyas, Bowen Shi, Matthew Le, Andros Tjandra, Yi-Chiao Wu, Baishan Guo, Jiemin Zhang, Xinyue Zhang, Robert Adkins, William Ngan, et al. Audiobox: Unified audio generation with natural language prompts.arXiv preprint arXiv:2312.15821,

  70. [70]

    ModelScope Text-to-Video Technical Report

    Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. ModelScope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023a. Qixun Wang, Xu Bai, Haofan Wang, Zekui Qin, and Anthony Chen. InstantID: Zero-shot identity-preserving generation in seconds. arXiv preprint arXiv:2401.07519, 2024a. Tan Wang, Linjie Li, Kevi...

  71. [71]

    Lavie: High-quality video gener- ation with cascaded latent diffusion models

    Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. LaVie: High-quality video generation with cascaded latent diffusion models.arXiv preprint arXiv:2309.15103, 2023d. 77 Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility...

  72. [72]

    Fairy: Fast parallelized instruction-guided video-to-video synthesis,

    Bichen Wu, Ching-Yao Chuang, Xiaoyan Wang, Yichen Jia, Kapil Krishnakumar, Tong Xiao, Feng Liang, Licheng Yu, andPéterVajda. Fairy: Fastparallelizedinstruction-guidedvideo-to-videosynthesis. arXiv preprint arXiv:2312.13834, 2023a. Chenfei Wu, Jian Liang, Lei Ji, Fan Yang, Yuejian Fang, Daxin Jiang, and Nan Duan. Nüwa: Visual synthesis pre-training for neu...

  73. [73]

    Tune-A-Video: One-shot tuning of image diffusion models for text-to-video generation

    Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-A-Video: One-shot tuning of image diffusion models for text-to-video generation. In ICCV, 2023b. Jay Zhangjie Wu, Xiuyu Li, Difei Gao, Zhen Dong, Jinbin Bai, Aishani Singh, Xiaoyu Xiang, Youzeng Li, Zuwei Huang, Yuan...

  74. [74]

    Dynamicrafter: Animating open-domain images with video diffusion priors,

    Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. Large-scale contrastive language-audio pre-training with feature fusion and keyword-to-caption augmentation. InICASSP, 2023d. Jinbo Xing, Menghan Xia, Yong Zhang, Haoxin Chen, Xintao Wang, Tien-Tsin Wong, and Ying Shan. DynamiCrafter: Animating open-domain images wit...

  75. [75]

    Demysti- fying clip data

    Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demystifying CLIP data.arXiv preprint arXiv:2309.16671,

  76. [76]

    Video-to-audio generation with hidden alignment.arXiv preprint arXiv:2407.07464, 2024a

    Manjie Xu, Chenxing Li, Yong Ren, Rilin Chen, Yu Gu, Wei Liang, and Dong Yu. Video-to-audio generation with hidden alignment.arXiv preprint arXiv:2407.07464, 2024a. Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Hanshu Yan, Jia-Wei Liu, Chenxu Zhang, Jiashi Feng, and Mike Zheng Shou. MagicAnimate: Temporally consistent human image animation using diffusion m...

  77. [77]

    VideoGPT: Video Generation using VQ-VAE and Transformers

    Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. VideoGPT: Video generation using VQ-VAE and transformers. arXiv preprint arXiv:2104.10157,

  78. [78]

    Motion-conditioned image animation for video editing.arXiv preprint arXiv:2311.18827,

    Wilson Yan, Andrew Brown, Pieter Abbeel, Rohit Girdhar, and Samaneh Azadi. Motion-conditioned image animation for video editing.arXiv preprint arXiv:2311.18827,

  79. [79]

    Space-time diffusion features for zero-shot text-driven motion transfer.arXiv preprint arXiv:2311.17009,

    Danah Yatim, Rafail Fridman, Omer Bar Tal, Yoni Kasten, and Tali Dekel. Space-time diffusion features for zero-shot text-driven motion transfer.arXiv preprint arXiv:2311.17009,

  80. [80]

    IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

    Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. IP-Adapter: Text compatible image prompt adapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721,

Showing first 80 references.