Recognition: 2 theorem links
· Lean TheoremMovie Gen: A Cast of Media Foundation Models
Pith reviewed 2026-05-11 14:10 UTC · model grok-4.3
The pith
Movie Gen introduces foundation models that generate 1080p videos with synchronized audio and claim state-of-the-art results on text-to-video, personalization, editing, and audio tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors show that a set of transformer-based foundation models, led by a 30B-parameter model trained on up to 73K video tokens, can produce 1080p videos with varying aspect ratios and synchronized audio, while also supporting image-based personalization, instruction-driven editing, and both video-to-audio and text-to-audio synthesis, all at new levels of performance achieved through targeted changes to architecture, latent representations, training objectives, data curation, and inference methods.
What carries the argument
The 30B-parameter transformer with 73K video token context length that generates 16-second 1080p videos at 16 frames per second while incorporating the listed technical simplifications.
If this is right
- Text-to-video synthesis reaches a new performance level.
- Instruction-based video editing can be performed precisely at high quality.
- Personalized videos can be created from a single user-provided image.
- Video-to-audio and text-to-audio generation also attain leading results.
- Scaling model size, data volume, and compute produces measurable gains in media generation quality.
Where Pith is reading between the lines
- If the scaling improvements hold, similar methods could shorten the time needed to produce custom video content for education or entertainment.
- The approach implies that language-model-style scaling laws may extend to joint video and audio generation.
- Releasing model weights or detailed benchmark code would allow other groups to test extensions such as longer video durations.
Load-bearing premise
The state-of-the-art performance claims depend on internal evaluations and baseline comparisons whose exact protocols, datasets, and human preference methods are not specified in detail.
What would settle it
Independent evaluation of the models on public video generation benchmarks with the same human preference tests used internally would directly confirm or refute the claimed performance gains.
read the original abstract
We present Movie Gen, a cast of foundation models that generates high-quality, 1080p HD videos with different aspect ratios and synchronized audio. We also show additional capabilities such as precise instruction-based video editing and generation of personalized videos based on a user's image. Our models set a new state-of-the-art on multiple tasks: text-to-video synthesis, video personalization, video editing, video-to-audio generation, and text-to-audio generation. Our largest video generation model is a 30B parameter transformer trained with a maximum context length of 73K video tokens, corresponding to a generated video of 16 seconds at 16 frames-per-second. We show multiple technical innovations and simplifications on the architecture, latent spaces, training objectives and recipes, data curation, evaluation protocols, parallelization techniques, and inference optimizations that allow us to reap the benefits of scaling pre-training data, model size, and training compute for training large scale media generation models. We hope this paper helps the research community to accelerate progress and innovation in media generation models. All videos from this paper are available at https://go.fb.me/MovieGenResearchVideos.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Movie Gen, a collection of foundation models for generating high-quality 1080p HD videos with synchronized audio from text, including capabilities for precise instruction-based video editing and personalized video generation from user images. The largest model is a 30B-parameter transformer trained with up to 73K video tokens (corresponding to 16-second videos at 16 fps). The work claims new state-of-the-art results on text-to-video synthesis, video personalization, video editing, video-to-audio generation, and text-to-audio generation, while describing technical innovations and simplifications in architecture, latent spaces, training objectives, data curation, evaluation protocols, parallelization, and inference optimizations to support scaling.
Significance. If the reported performance gains hold under transparent, reproducible conditions with appropriate baselines and statistical rigor, the work would constitute a substantial contribution to scalable media generation by demonstrating practical training of very large video transformers and by releasing example outputs to the community.
major comments (1)
- [Abstract] Abstract: The central claims of new state-of-the-art performance on text-to-video synthesis, video personalization, video editing, video-to-audio generation, and text-to-audio generation rest on internal evaluations whose test prompts, baseline selection criteria (including whether recent public models were included), rater count and instructions, preference aggregation method, and significance testing are not specified. This renders the empirical superiority assertions unverifiable from the provided text.
Simulated Author's Rebuttal
We thank the referee for their detailed review and for emphasizing the need for greater transparency in our evaluation methodology. We address the major comment below and commit to revisions that improve verifiability while respecting the practical constraints of large-scale internal evaluations.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claims of new state-of-the-art performance on text-to-video synthesis, video personalization, video editing, video-to-audio generation, and text-to-audio generation rest on internal evaluations whose test prompts, baseline selection criteria (including whether recent public models were included), rater count and instructions, preference aggregation method, and significance testing are not specified. This renders the empirical superiority assertions unverifiable from the provided text.
Authors: We appreciate the referee's observation that the abstract's SOTA claims would be strengthened by more explicit details on our evaluation protocols. While the full manuscript contains a dedicated evaluation section describing our human preference studies, we agree that additional specificity is warranted. In the revised manuscript we will: (1) explicitly state the number of raters, their instructions, and qualification criteria; (2) detail the preference aggregation procedure (including any use of majority voting or ranking methods) and report statistical significance testing; (3) clarify baseline selection criteria, including which recent public models were considered and the rationale for inclusion or exclusion; and (4) provide representative test prompt examples along with a description of the prompt curation process. We will also add cross-references from the abstract to these expanded sections. However, we cannot release the complete proprietary test prompt set, as doing so would risk contamination of the benchmark for future models. These revisions will make the reported gains substantially more verifiable from the text while preserving the integrity of our internal evaluation framework. revision: partial
- Release of the full proprietary test prompt set used for internal evaluations, as this would compromise the long-term validity of the benchmark.
Circularity Check
No significant circularity in claimed derivation chain
full rationale
The paper is an empirical report on training large-scale media generation models, with claims of SOTA performance resting on experimental results rather than any mathematical derivation chain. The provided abstract and text contain no equations, self-definitional constructs, fitted parameters renamed as predictions, or load-bearing self-citations that reduce the central claims to their own inputs by construction. Technical innovations in architecture, data, and training are presented as independent contributions, and the absence of any quoted reduction (e.g., Eq. X equivalent to input Y) confirms the derivation is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 47 Pith papers
-
Divergence is Uncertainty: A Closed-Form Posterior Covariance for Flow Matching
In flow matching, the uncertainty of the clean data given the current state is exactly the divergence of the velocity field (up to a known scalar).
-
CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models
CollabVR improves video reasoning performance by coupling vision-language models and video generation models in a closed-loop step-level collaboration that detects and repairs generation failures.
-
LiVeAction: a Lightweight, Versatile, and Asymmetric Neural Codec Design for Real-time Operation
LiVeAction is a lightweight asymmetric neural codec using an FFT-inspired encoder and variance-based training that outperforms generative tokenizers in rate-distortion while supporting real-time use on resource-constr...
-
FreeSpec: Training-Free Long Video Generation via Singular-Spectrum Reconstruction
FreeSpec uses SVD-based spectral reconstruction to fuse global low-rank and local high-rank features, reducing content drift and preserving temporal dynamics in long video generation.
-
Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation
Stream-R1 improves distillation of autoregressive streaming video diffusion models by adaptively weighting supervision with a reward model at both rollout and per-pixel levels.
-
TMD-Bench: A Multi-Level Evaluation Paradigm for Music-Dance Co-Generation
TMD-Bench is a multi-level benchmark that measures music-dance co-generation quality including beat-level rhythmic synchronization, supported by a new dataset and Music Captioner, and shows commercial models lag in rh...
-
Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting
Reshoot-Anything trains a diffusion transformer on pseudo multi-view triplets created by cropping and warping monocular videos to achieve temporally consistent video reshooting with robust camera control on dynamic scenes.
-
Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Diffusion Video Generation
Sparse Forcing adds a native trainable sparsity mechanism and PBSA kernel to autoregressive diffusion video models, yielding higher VBench scores and 1.1-1.27x speedups on 5s to 1min generations.
-
Speculative Decoding for Autoregressive Video Generation
A training-free speculative decoding method for block-based autoregressive video diffusion uses a quality router on worst-frame ImageReward scores to accept drafter proposals, achieving up to 2.09x speedup at 95.7% qu...
-
Flow of Truth: Proactive Temporal Forensics for Image-to-Video Generation
Flow of Truth introduces a learnable forensic template and template-guided flow module that follows pixel motion to enable temporal tracing in image-to-video generation.
-
Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis
Grounded Forcing introduces dual memory caching, reference-based positional embeddings, and proximity-weighted recaching to bridge stable semantics with local dynamics, improving long-range consistency in autoregressi...
-
Not All Frames Deserve Full Computation: Accelerating Autoregressive Video Generation via Selective Computation and Predictive Extrapolation
SCOPE accelerates autoregressive video diffusion up to 4.73x by using a tri-modal cache-predict-recompute scheduler with Taylor extrapolation and selective active-frame computation while preserving output quality.
-
Transfer between Modalities with MetaQueries
MetaQueries act as an efficient bridge allowing multimodal LLMs to augment diffusion-based image generation and editing without complex training or unfreezing the LLM backbone.
-
Pyramid Forcing: Head-Aware Pyramid KV Cache Policy for High-Quality Long Video Generation
Pyramid Forcing classifies attention heads into Anchor, Wave, and Veil types and applies type-specific KV cache policies to improve long-horizon autoregressive video generation quality.
-
Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm
V2V-Zero adapts frozen VLMs for visual conditioning via hidden states from specification pages, scoring 0.85 on GenEval and 32.7 on a new seven-task benchmark while revealing capability hierarchies in attribute bindin...
-
FIS-DiT: Breaking the Few-Step Video Inference Barrier via Training-Free Frame Interleaved Sparsity
FIS-DiT achieves 2.11-2.41x speedup on video DiT models in few-step regimes with negligible quality loss by exploiting frame-wise sparsity and consistency through a training-free interleaved execution strategy.
-
WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors
The paper presents WorldReasonBench, a benchmark that tests video generators on maintaining physical, social, logical, and informational consistency when predicting future states from initial conditions and actions.
-
Forcing-KV: Hybrid KV Cache Compression for Efficient Autoregressive Video Diffusion Models
Forcing-KV applies head-specific static and dynamic pruning to KV caches in AR video diffusion models, achieving over 29 fps, 30% memory reduction, and up to 2.82x speedup at maintained quality.
-
FaithfulFaces: Pose-Faithful Facial Identity Preservation for Text-to-Video Generation
FaithfulFaces introduces a pose-faithful identity aligner with a shared dictionary and invariance constraint to maintain facial identity in text-to-video generation under large pose changes and occlusions.
-
Stream-T1: Test-Time Scaling for Streaming Video Generation
Stream-T1 is a test-time scaling framework for streaming video generation using scaled noise propagation from history, reward pruning across short and long windows, and feedback-guided memory sinking to improve tempor...
-
NoiseRater: Meta-Learned Noise Valuation for Diffusion Model Training
NoiseRater meta-learns instance-level importance scores for noise in diffusion training via bilevel optimization, then uses a two-stage pipeline to improve efficiency and generation quality on FFHQ and ImageNet.
-
Leveraging Verifier-Based Reinforcement Learning in Image Editing
Edit-R1 trains a CoT-based reasoning reward model with GCPO and uses it to boost image editing performance over VLMs and models like FLUX.1-kontext via GRPO.
-
VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations
VibeToken enables autoregressive image generation at arbitrary resolutions using 64 tokens for 1024x1024 images with 3.94 gFID, constant 179G FLOPs, and better efficiency than diffusion or fixed AR baselines.
-
CineAGI: Character-Consistent Movie Creation through LLM-Orchestrated Multi-Modal Generation and Cross-Scene Integration
CineAGI is a multi-agent LLM framework that generates multi-scene movies with improved character consistency, narrative coherence, and audio-visual alignment.
-
CityRAG: Stepping Into a City via Spatially-Grounded Video Generation
CityRAG generates minutes-long 3D-consistent videos of real-world cities by grounding outputs in geo-registered data and using temporally unaligned training to disentangle fixed scenes from transient elements like weather.
-
How Far Are Video Models from True Multimodal Reasoning?
Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.
-
Identifying Ethical Biases in Action Recognition Models
The authors create a synthetic video auditing framework that detects statistically significant skin color biases in popular human action recognition models even when actions are identical.
-
VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects
VEFX-Bench releases a large human-labeled video editing dataset, a multi-dimensional reward model, and a standardized benchmark that better matches human judgments than generic evaluators.
-
Why Open Source? A Game-Theoretic Analysis of the AI Race
A game-theoretic R&D race model shows that pure Nash equilibria for open-sourcing decisions exist and are computationally tractable in both discrete and continuous settings.
-
ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling
ControlFoley introduces a unified framework for controllable video-to-audio generation using joint visual encoding, temporal-timbre decoupling, and robust multimodal training to handle cross-modal conflicts.
-
Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction
Re2Pix decomposes video prediction into semantic feature forecasting followed by representation-conditioned diffusion synthesis, with nested dropout and mixed supervision to handle prediction errors.
-
Latent-Compressed Variational Autoencoder for Video Diffusion Models
A frequency-based latent compression method for video VAEs yields higher reconstruction quality than channel-reduction baselines at fixed compression ratios.
-
INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling
INSPATIO-WORLD is a real-time framework for high-fidelity 4D scene generation and navigation from monocular videos via STAR architecture with implicit caching, explicit geometric constraints, and distribution-matching...
-
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion
Self Forcing trains autoregressive video diffusion models by performing autoregressive rollout with KV caching during training to close the exposure bias gap, using a holistic video-level loss and few-step diffusion f...
-
MAGI-1: Autoregressive Video Generation at Scale
MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.
-
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness
VBench-2.0 is a benchmark suite that automatically evaluates video generative models on five dimensions of intrinsic faithfulness: Human Fidelity, Controllability, Creativity, Physics, and Commonsense using VLMs, LLMs...
-
Improving Video Generation with Human Feedback
A human preference dataset and VideoReward model enable Flow-DPO and Flow-NRG to produce smoother, better-aligned videos from text prompts in flow-based generators.
-
LTX-Video: Realtime Video Latent Diffusion
LTX-Video integrates Video-VAE and transformer for 1:192 latent compression and real-time video diffusion by moving patchifying to the VAE and letting the decoder finish denoising in pixel space.
-
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
π₀ is a vision-language-action flow model trained on diverse multi-platform robot data that supports zero-shot task performance, language instruction following, and efficient fine-tuning for dexterous tasks.
-
Scaling Properties of Continuous Diffusion Spoken Language Models
Continuous diffusion spoken language models follow scaling laws for loss and phoneme divergence and generate emotive multi-speaker speech at 16B scale, though long-form coherence stays difficult.
-
RF-HiT: Rectified Flow Hierarchical Transformer for General Medical Image Segmentation
RF-HiT uses rectified flow and a multi-scale hierarchical transformer to reach 91.27% Dice on ACDC and 87.40% on BraTS 2021 with only 10.14 GFLOPs, 13.6M parameters, and three inference steps.
-
Motif-Video 2B: Technical Report
Motif-Video 2B achieves 83.76% VBench score, beating a 14B-parameter baseline with 7x fewer parameters and substantially less training data through shared cross-attention and a three-part backbone.
-
Neural Computers
Neural Computers are introduced as a new machine form where computation, memory, and I/O are unified in a learned runtime state, with initial video-model experiments showing acquisition of basic interface primitives f...
-
TIGFlow-GRPO: Trajectory Forecasting via Interaction-Aware Flow Matching and Reward-Guided Optimization
TIGFlow-GRPO uses a Trajectory-Interaction-Graph in conditional flow matching plus Flow-GRPO optimization to produce more accurate, socially compliant, and physically feasible trajectory forecasts on ETH/UCY and SDD datasets.
-
World Simulation with Video Foundation Models for Physical AI
Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.
-
Cosmos World Foundation Model Platform for Physical AI
The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.
-
Flow Matching Guide and Code
Flow Matching is a generative modeling framework with mathematical foundations, design choices, extensions, and open-source PyTorch code for applications like image and text generation.
Reference graph
Works this paper leans on
-
[1]
69 Jie An, Songyang Zhang, Harry Yang, Sonal Gupta, Jia-Bin Huang, Jiebo Luo, and Xi Yin. Latent-Shift: Latent diffusion with temporal shift for efficient text-to-video generation.arXiv preprint arXiv:2304.08477,
-
[2]
ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers
Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, Tero Karras, and Ming-Yu Liu. eDiff-I: Text-to-image diffusion models with an ensemble of expert denoisers.arXiv preprint arXiv:2211.01324,
-
[3]
Lumiere: A space-time diffusion model for video generation,
Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Yuanzhen Li, Tomer Michaeli, Oliver Wang, Deqing Sun, Tali Dekel, and Inbar Mosseri. Lumiere: A space-time diffusion model for video generation.arXiv preprint arXiv:2401.12945,
-
[4]
Shane Barratt and Rishi Sharma. A note on the Inception score.arXiv preprint arXiv:1801.01973,
-
[5]
Meta Open Compute Project, Grand Teton AI platform.https://engineering
Jeremy Baumgartner and Matt Bowman. Meta Open Compute Project, Grand Teton AI platform.https://engineering. fb.com/2022/10/18/open-source/ocp-summit-2022-grand-teton/ ,
work page 2022
-
[6]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Black Forest Labs. FLUX, 2024.https://blackforestlabs.ai/. Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023a. Andreas Blattmann, R...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
Language Models are Few-Shot Learners
https://openai.com/research/video-generation-models-as-world-simulators. Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.arXiv preprint arXiv:2005.14165,
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[8]
Hila Chefer, Shiran Zada, Roni Paiss, Ariel Ephrat, Omer Tov, Michael Rubinstein, Lior Wolf, Tali Dekel, Tomer Michaeli, and Inbar Mosseri. Still-moving: Customized video generation without customized video data.arXiv preprint arXiv:2407.08674,
-
[9]
VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, Chao Weng, and Ying Shan. VideoCrafter1: Open diffusion models for high-quality video generation. arXiv:2310.19512, 2023a. Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. VGGSound: A large-scale audio-visual data...
work page internal anchor Pith review arXiv
-
[10]
Li Chen, Mengyi Zhao, Yiheng Liu, Mingxu Ding, Yangyang Song, Shizun Wang, Xu Wang, Hao Yang, Jing Liu, Kang Du, et al. PhotoVerse: Tuning-free image customization with text-to-image diffusion models.arXiv preprint arXiv:2309.05793, 2023b. 70 Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fa...
-
[11]
Emu: Enhanc- ing image generation models using photogenic needles in a haystack
Xiaoliang Dai, Ji Hou, Chih-Yao Ma, Sam Tsai, Jialiang Wang, Rui Wang, Peizhao Zhang, Simon Vandenhende, Xiaofang Wang, Abhimanyu Dubey, et al. Emu: Enhancing image generation models using photogenic needles in a haystack. arXiv preprint arXiv:2309.15807,
-
[12]
Vision Transformers Need Registers
Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers.arXiv preprint arXiv:2309.16588,
work page internal anchor Pith review arXiv
-
[13]
Diffusion Models Beat GANs on Image Synthesis
Prafulla Dhariwal and Alex Nichol. Diffusion models beat GANs on image synthesis.arXiv preprint arXiv:2105.05233,
work page internal anchor Pith review arXiv
-
[14]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
High Fidelity Neural Audio Compression
Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression.arXiv preprint arXiv:2210.13438,
work page internal anchor Pith review arXiv
-
[16]
Understanding back-translation at scale
Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. Understanding back-translation at scale.arXiv preprint arXiv:1808.09381,
-
[17]
Structure and content-guided video synthesis with diffusion models
71 Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models.arXiv preprint arXiv:2302.03011,
-
[18]
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion
FFmpeg Developers. FFmpeg.https://ffmpeg.org/. Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion.arXiv preprint arXiv:2208.01618,
work page internal anchor Pith review arXiv
-
[19]
Tokenflow: Con- sistent diffusion features for consistent video editing,
Michal Geyer, Omer Bar-Tal, Shai Bagon, and Tali Dekel. TokenFlow: Consistent diffusion features for consistent video editing. arXiv preprint arXiv:2307.10373,
-
[20]
Deepanway Ghosal, Navonil Majumder, Ambuj Mehrish, and Soujanya Poria. Text-to-audio generation using instruction-tuned LLM and latent diffusion model.arXiv preprint arXiv:2304.13731,
-
[21]
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, and Bo Dai. AnimateDiff: Animate your personalized text-to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Photorealistic video generation with diffusion models,
Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei, Irfan Essa, Lu Jiang, and José Lezama. Photorealistic video generation with diffusion models.arXiv preprint arXiv:2312.06662,
-
[23]
Id-animator: Zero-shot identity- preserving human video generation,
Xuanhua He, Quande Liu, Shengju Qian, Xin Wang, Tao Hu, Ke Cao, Keyu Yan, Man Zhou, and Jie Zhang. ID-Animator: Zero-shot identity-preserving human video generation.arXiv preprint arXiv:2404.15275, 2024a. Zecheng He, Bo Sun, Felix Juefei-Xu, Haoyu Ma, Ankit Ramchandani, Vincent Cheung, Siddharth Shah, Anmol Kalia, Harihar Subramanyam, Alireza Zareian, Li ...
-
[24]
Imagen Video: High Definition Video Generation with Diffusion Models
Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, and Tim Salimans. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022a. Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and...
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
72 Susung Hong, Junyoung Seo, Heeseong Shin, Sunghwan Hong, and Seungryong Kim. Direct2v: Large language models are frame-level directors for zero-shot text-to-video generation.arXiv preprint arXiv:2305.14330,
-
[26]
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. CogVideo: Large-scale pre-training for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
LoRA: Low-Rank Adaptation of Large Language Models
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
Qingqing Huang, Daniel S Park, Tao Wang, Timo I Denk, Andy Ly, Nanxin Chen, Zhengdong Zhang, Zhishuai Zhang, Jiahui Yu, Christian Frank, et al. Noise2Music: Text-conditioned music generation with diffusion models.arXiv preprint arXiv:2302.03917,
-
[29]
Ideogram v2, 2024.https://ideogram.ai/
Ideogram. Ideogram v2, 2024.https://ideogram.ai/. Jaeyong Kang, Soujanya Poria, and Dorien Herremans. Video2Music: Suitable music generation from videos using an affective multimodal transformer model.arXiv preprint arXiv:2311.00968,
-
[30]
Ozgur Kara, Bariscan Kurtkaya, Hidir Yesiltepe, James M Rehg, and Pinar Yanardag. Rave: Randomized noise shuffling for fast and consistent video editing with diffusion models.arXiv preprint arXiv:2312.04524,
-
[31]
Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2Video-Zero: Text-to-image diffusion models are zero-shot video gen- erators. arXiv preprint arXiv:2303.13439,
-
[32]
Fr\’echet audio distance: A metric for evaluating music enhancement algo- rithms,
Kevin Kilgour, Mauricio Zuluaga, Dominik Roblek, and Matthew Sharifi. Fréchet audio distance: A metric for evaluating music enhancement algorithms.arXiv preprint arXiv:1812.08466,
-
[33]
FIFO-Diffusion: Generating infinite videos from text without training
Jihwan Kim, Junoh Kang, Jinyoung Choi, and Bohyung Han. FIFO-Diffusion: Generating infinite videos from text without training. arXiv preprint arXiv:2405.11473,
-
[34]
Auto-Encoding Variational Bayes
Diederik P Kingma. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114,
work page internal anchor Pith review Pith/arXiv arXiv
-
[35]
Kling AI, 2024.https://klingai.com/
KlingAI. Kling AI, 2024.https://klingai.com/. Jonas Kohler, Albert Pumarola, Edgar Schönfeld, Artsiom Sanakoyeu, Roshan Sumbaly, Peter Vajda, and Ali Thabet. Imagine Flash: Accelerating Emu diffusion models with backward distillation.arXiv preprint arXiv:2405.05224,
-
[36]
Videopoet: A large language model for zero-shot video generation.arXiv:2312.14125,
Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Rachel Hornung, Hartwig Adam, Hassan Akbari, Yair Alon, Vighnesh Birodkar, et al. VideoPoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125,
-
[37]
Audiogen: Textually guided audio generation,
Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre Défossez, Jade Copet, Devi Parikh, Yaniv Taigman, and Yossi Adi. AudioGen: Textually guided audio generation.arXiv preprint arXiv:2209.15352,
-
[38]
High fidelity text-guided music generation and editing via single-stage flow matching
Gael Le Lan, Bowen Shi, Zhaoheng Ni, Sidd Srinivasan, Anurag Kumar, Brian Ellis, David Kant, Varun Nagaraja, Ernie Chang, Wei-Ning Hsu, et al. High fidelity text-guided music generation and editing via single-stage flow matching. arXiv preprint arXiv:2407.03648,
-
[39]
Voicebox: Text-guided multilingual universal speech generation at scale
73 Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, and Wei-Ning Hsu. Voicebox: Text-guided multilingual universal speech generation at scale. arXiv preprint arXiv:2306.15687,
-
[40]
Bigvgan: A universal neural vocoder with large-scale training
Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. Autoregressive image generation using residual quantization. InCVPR, 2022a. Sang-gil Lee, Wei Ping, Boris Ginsburg, Bryan Catanzaro, and Sungroh Yoon. BigVGAN: A universal neural vocoder with large-scale training.arXiv preprint arXiv:2206.04658, 2022b. Shenggui Li, Fuzhao Xue, Chaitanya Ba...
-
[41]
Xin Li, Wenqing Chu, Ye Wu, Weihang Yuan, Fanglong Liu, Qi Zhang, Fu Li, Haocheng Feng, Errui Ding, and Jingdong Wang. VideoGen: A reference-guided latent diffusion approach for high definition text-to-video generation. arXiv preprint arXiv:2309.00398, 2023a. Xirui Li, Chao Ma, Xiaokang Yang, and Ming-Hsuan Yang. VidToMe: Video token merging for zero-shot...
-
[42]
Zhen Li, Mingdeng Cao, Xintao Wang, Zhongang Qi, Ming-Ming Cheng, and Ying Shan. PhotoMaker: Customizing realistic human photos via stacked ID embedding.arXiv preprint arXiv:2312.04461, 2023c. Feng Liang, Bichen Wu, Jialiang Wang, Licheng Yu, Kunpeng Li, Yinan Zhao, Ishan Misra, Jia-Bin Huang, Peizhao Zhang, Péter Vajda, and Diana Marculescu. FlowVid: Tam...
-
[43]
Ring Attention with Blockwise Transformers for Near-Infinite Context
Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring attention with blockwise transformers for near-infinite context.arXiv preprint arXiv:2310.01889, 2023a. Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, and Mark D Plumbley. AudioLDM: Text-to-audio generation with latent diffusion models.arXiv preprint arXiv:2301.12503, 2023b....
work page internal anchor Pith review arXiv
-
[44]
Dream Machine, 2024.https://lumalabs.ai/dream-machine
LumaLabs. Dream Machine, 2024.https://lumalabs.ai/dream-machine. Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning.Neurocomputing, 508:293–304,
work page 2024
-
[45]
Videofusion: Decomposed diffusion models for high-quality video generation
Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Huang, Liang Wang, Yujun Shen, Deli Zhao, Jingren Zhou, and Tieniu Tan. Videofusion: Decomposed diffusion models for high-quality video generation.arXiv preprint arXiv:2303.08320,
-
[46]
Latte: Latent Diffusion Transformer for Video Generation
Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024a. 74 Ze Ma, Daquan Zhou, Chun-Hsiao Yeh, Xue-She Wang, Xiuyu Li, Huanrui Yang, Zhen Dong, Kurt Keutzer, and Jiashi Feng. Magic-me: Identity-specific video customi...
work page internal anchor Pith review arXiv
-
[47]
Tango 2: Aligning diffusion-based text-to-audio generations through direct preference optimization
Navonil Majumder, Chia-Yu Hung, Deepanway Ghosal, Wei-Ning Hsu, Rada Mihalcea, and Soujanya Poria. Tango 2: Aligning diffusion-based text-to-audio generations through direct preference optimization. arXiv preprint arXiv:2404.09956,
-
[48]
FoleyGen: Visually-guided audio generation.arXiv preprint arXiv:2309.10537,
Xinhao Mei, Varun Nagaraja, Gael Le Lan, Zhaoheng Ni, Ernie Chang, Yangyang Shi, and Vikas Chandra. FoleyGen: Visually-guided audio generation.arXiv preprint arXiv:2309.10537,
-
[49]
Midjourney, 2024.https://www.midjourney.com/
Midjourney. Midjourney, 2024.https://www.midjourney.com/. Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. Efficient large-scale language model training on GPU clusters using Megatron-LM. InSC,
work page 2024
-
[50]
Dall-E 3, 2024.https://openai.com/index/dall-e-3/
OpenAI. Dall-E 3, 2024.https://openai.com/index/dall-e-3/. OpenAI. Video generation models as world simulators,
work page 2024
-
[51]
Daniil Ostashev, Yuwei Fang, Sergey Tulyakov, Kfir Aberman, et al
https://openai.com/index/ video-generation-models-as-world-simulators/. Daniil Ostashev, Yuwei Fang, Sergey Tulyakov, Kfir Aberman, et al. MoA: Mixture-of-attention for subject-context disentanglement in personalized image generation.arXiv preprint arXiv:2404.11565,
-
[52]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rom- bach. SDXL: improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952,
work page internal anchor Pith review Pith/arXiv arXiv
-
[53]
Instructvid2vid: Controllable video editing with natural language instructions,
PySceneDetect Developers. PySceneDetect.https://www.scenedetect.com/. Bosheng Qin, Juncheng Li, Siliang Tang, Tat-Seng Chua, and Yueting Zhuang. InstructVid2Vid: Controllable video editing with natural language instructions.arXiv preprint arXiv:2305.12328,
-
[54]
Haonan Qiu, Menghan Xia, Yong Zhang, Yingqing He, Xintao Wang, Ying Shan, and Ziwei Liu. FreeNoise: Tuning-free longer video diffusion via noise rescheduling.arXiv preprint arXiv:2310.15169,
-
[55]
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. ZeRO: Memory optimizations toward training trillion parameter models.arXiv preprint arXiv:1910.02054,
-
[56]
SelfEval: Leveraging the discriminative nature of generative models for evaluation
Sai Saketh Rambhatla and Ishan Misra. SelfEval: Leveraging the discriminative nature of generative models for evaluation. arXiv preprint arXiv:2311.10708,
-
[57]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with CLIP latents.arXiv preprint arXiv:2204.06125,
work page internal anchor Pith review Pith/arXiv arXiv
-
[58]
SAM 2: Segment Anything in Images and Videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feichtenhofer. SAM 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714,
work page internal anchor Pith review Pith/arXiv arXiv
-
[59]
arXiv:2101.06840 [cs.DC]https://arxiv.org/abs/2101.06840
Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. ZeRO-Offload: Democratizing billion-scale model training.arXiv preprint arXiv:2101.06840,
-
[60]
DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation
Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation. InCVPR, 2023a. Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Wei Wei, Tingbo Hou, Yael Pritch, Neal Wadhwa, Michael Rubinstein, and Kfir Aberman. HyperDreamBooth: HyperNetworks for...
-
[61]
GLU Variants Improve Transformer
Noam Shazeer. GLU variants improve transformer.arXiv preprint arXiv:2002.05202,
work page internal anchor Pith review Pith/arXiv arXiv 2002
-
[62]
Kai Shen, Zeqian Ju, Xu Tan, Yanqing Liu, Yichong Leng, Lei He, Tao Qin, Sheng Zhao, and Jiang Bian. NaturalSpeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers.arXiv preprint arXiv:2304.09116,
-
[63]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron- LM: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053,
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[64]
Denoising Diffusion Implicit Models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502,
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[65]
Yi Tay, Mostafa Dehghani, Vinh Q Tran, Xavier Garcia, Jason Wei, Xuezhi Wang, Hyung Won Chung, Siamak Shakeri, Dara Bahri, Tal Schuster, et al. UL2: Unifying language learning paradigms.arXiv preprint arXiv:2205.05131,
-
[66]
Gemini: A Family of Highly Capable Multimodal Models
Team Gemini. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,
work page internal anchor Pith review Pith/arXiv arXiv
-
[67]
Zeyue Tian, Zhaoyang Liu, Ruibin Yuan, Jiahao Pan, Xiaoqiang Huang, Qi fei Liu, Xu Tan, Qifeng Chen, Wei Xue, and Yi-Ting Guo. VidMuse: A simple video-to-music generation framework with long-short-term modeling.arXiv preprint arXiv:2406.04321,
-
[68]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and finetuned chat models. arXiv preprint arXiv:2307.09288,
work page internal anchor Pith review Pith/arXiv arXiv
-
[69]
Audiobox: Unified audio generation with natural language prompts.arXiv preprint arXiv:2312.15821,
Apoorv Vyas, Bowen Shi, Matthew Le, Andros Tjandra, Yi-Chiao Wu, Baishan Guo, Jiemin Zhang, Xinyue Zhang, Robert Adkins, William Ngan, et al. Audiobox: Unified audio generation with natural language prompts.arXiv preprint arXiv:2312.15821,
-
[70]
ModelScope Text-to-Video Technical Report
Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. ModelScope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023a. Qixun Wang, Xu Bai, Haofan Wang, Zekui Qin, and Anthony Chen. InstantID: Zero-shot identity-preserving generation in seconds. arXiv preprint arXiv:2401.07519, 2024a. Tan Wang, Linjie Li, Kevi...
work page internal anchor Pith review arXiv
-
[71]
Lavie: High-quality video gener- ation with cascaded latent diffusion models
Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. LaVie: High-quality video generation with cascaded latent diffusion models.arXiv preprint arXiv:2309.15103, 2023d. 77 Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility...
-
[72]
Fairy: Fast parallelized instruction-guided video-to-video synthesis,
Bichen Wu, Ching-Yao Chuang, Xiaoyan Wang, Yichen Jia, Kapil Krishnakumar, Tong Xiao, Feng Liang, Licheng Yu, andPéterVajda. Fairy: Fastparallelizedinstruction-guidedvideo-to-videosynthesis. arXiv preprint arXiv:2312.13834, 2023a. Chenfei Wu, Jian Liang, Lei Ji, Fan Yang, Yuejian Fang, Daxin Jiang, and Nan Duan. Nüwa: Visual synthesis pre-training for neu...
-
[73]
Tune-A-Video: One-shot tuning of image diffusion models for text-to-video generation
Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-A-Video: One-shot tuning of image diffusion models for text-to-video generation. In ICCV, 2023b. Jay Zhangjie Wu, Xiuyu Li, Difei Gao, Zhen Dong, Jinbin Bai, Aishani Singh, Xiaoyu Xiang, Youzeng Li, Zuwei Huang, Yuan...
-
[74]
Dynamicrafter: Animating open-domain images with video diffusion priors,
Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. Large-scale contrastive language-audio pre-training with feature fusion and keyword-to-caption augmentation. InICASSP, 2023d. Jinbo Xing, Menghan Xia, Yong Zhang, Haoxin Chen, Xintao Wang, Tien-Tsin Wong, and Ying Shan. DynamiCrafter: Animating open-domain images wit...
-
[75]
Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demystifying CLIP data.arXiv preprint arXiv:2309.16671,
-
[76]
Video-to-audio generation with hidden alignment.arXiv preprint arXiv:2407.07464, 2024a
Manjie Xu, Chenxing Li, Yong Ren, Rilin Chen, Yu Gu, Wei Liang, and Dong Yu. Video-to-audio generation with hidden alignment.arXiv preprint arXiv:2407.07464, 2024a. Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Hanshu Yan, Jia-Wei Liu, Chenxu Zhang, Jiashi Feng, and Mike Zheng Shou. MagicAnimate: Temporally consistent human image animation using diffusion m...
-
[77]
VideoGPT: Video Generation using VQ-VAE and Transformers
Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. VideoGPT: Video generation using VQ-VAE and transformers. arXiv preprint arXiv:2104.10157,
work page internal anchor Pith review arXiv
-
[78]
Motion-conditioned image animation for video editing.arXiv preprint arXiv:2311.18827,
Wilson Yan, Andrew Brown, Pieter Abbeel, Rohit Girdhar, and Samaneh Azadi. Motion-conditioned image animation for video editing.arXiv preprint arXiv:2311.18827,
-
[79]
Danah Yatim, Rafail Fridman, Omer Bar Tal, Yoni Kasten, and Tali Dekel. Space-time diffusion features for zero-shot text-driven motion transfer.arXiv preprint arXiv:2311.17009,
-
[80]
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. IP-Adapter: Text compatible image prompt adapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721,
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.