Recognition: 3 theorem links
· Lean TheoremStable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Pith reviewed 2026-05-10 22:53 UTC · model grok-4.3
The pith
Three training stages on a curated large dataset turn latent diffusion models into competitive video generators.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present Stable Video Diffusion, a latent video diffusion model for high-resolution text-to-video and image-to-video generation. We identify three stages for successful training: text-to-image pretraining, video pretraining on a well-curated dataset, and high-quality video finetuning. A systematic curation process of captioning and filtering is required to produce high-quality videos. The resulting base model is competitive with closed-source text-to-video systems, provides a strong motion representation for image-to-video tasks, and supplies a multi-view 3D prior that allows finetuning into a feedforward multi-view diffusion model outperforming image-based methods at a fraction of the 3D-
What carries the argument
The three-stage training pipeline of text-to-image pretraining, video pretraining on a curated dataset, and high-quality video finetuning, together with the dataset curation process of captioning and filtering.
If this is right
- The base model adapts to image-to-video generation and to camera-motion control through low-rank adaptation modules.
- It can be further finetuned into a multi-view diffusion model that jointly generates multiple object views in one forward pass.
- The overall training strategy produces video generation quality that matches closed-source systems.
- Releasing the trained weights and code makes the approach available for community use on related video tasks.
Where Pith is reading between the lines
- The same staged curation and training pattern may transfer to scaling diffusion models for other sequential data such as audio or 3D motion sequences.
- The learned multi-view prior could support downstream tasks like video depth estimation or novel-view synthesis with little extra supervision.
- Widespread use of this open recipe may reduce dependence on proprietary video datasets in the broader field of generative modeling.
Load-bearing premise
That the three identified training stages plus systematic captioning and filtering of the pretraining data are necessary and sufficient for high-quality video output.
What would settle it
Training an otherwise identical model on the same videos but without the described captioning and filtering steps, or with the stages reordered or collapsed, and measuring whether text-to-video quality remains competitive.
read the original abstract
We present Stable Video Diffusion - a latent video diffusion model for high-resolution, state-of-the-art text-to-video and image-to-video generation. Recently, latent diffusion models trained for 2D image synthesis have been turned into generative video models by inserting temporal layers and finetuning them on small, high-quality video datasets. However, training methods in the literature vary widely, and the field has yet to agree on a unified strategy for curating video data. In this paper, we identify and evaluate three different stages for successful training of video LDMs: text-to-image pretraining, video pretraining, and high-quality video finetuning. Furthermore, we demonstrate the necessity of a well-curated pretraining dataset for generating high-quality videos and present a systematic curation process to train a strong base model, including captioning and filtering strategies. We then explore the impact of finetuning our base model on high-quality data and train a text-to-video model that is competitive with closed-source video generation. We also show that our base model provides a powerful motion representation for downstream tasks such as image-to-video generation and adaptability to camera motion-specific LoRA modules. Finally, we demonstrate that our model provides a strong multi-view 3D-prior and can serve as a base to finetune a multi-view diffusion model that jointly generates multiple views of objects in a feedforward fashion, outperforming image-based methods at a fraction of their compute budget. We release code and model weights at https://github.com/Stability-AI/generative-models .
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents Stable Video Diffusion (SVD), a latent video diffusion model for high-resolution text-to-video and image-to-video generation. It identifies three training stages—text-to-image pretraining, video pretraining on a systematically curated large dataset (with captioning and filtering), and high-quality finetuning—and claims this pipeline produces models competitive with closed-source systems. The work further shows the base model provides strong motion priors for image-to-video tasks and camera-motion LoRAs, and serves as an effective starting point for finetuning a multi-view diffusion model that jointly generates multiple object views in a feedforward manner, outperforming image-based methods at lower compute. Code and weights are released publicly.
Significance. If the results hold, the work is significant as one of the first detailed, large-scale open efforts to scale latent video diffusion models, providing a reproducible recipe for data curation and staged training that addresses the field's lack of consensus on video data strategies. The public release of code/weights and the demonstration of the model's utility as a 3D prior for multi-view generation (at a fraction of prior compute) are clear strengths that can accelerate downstream research in generative video and 3D vision.
major comments (2)
- [§3 and §4] §3 (Training stages) and §4 (Experiments): The central claim that the three-stage pipeline plus systematic curation (captioning + filtering) is necessary for competitive performance lacks load-bearing ablations. No quantitative comparisons (FVD, CLIP similarity, or human preference) are reported for an otherwise identical model trained on uncurated or randomly subsampled data of equal size, or for variants that skip one stage (e.g., video pretraining without image pretraining). This makes it impossible to isolate whether headline results are driven by the claimed pipeline versus model scale and the base image LDM.
- [§4.3] §4.3 (Multi-view generation): The claim that the finetuned multi-view model outperforms image-based methods at a fraction of compute is load-bearing for the 3D-prior contribution, yet the section provides no exact baseline details (number of views, resolution, or total FLOPs) or error bars on the reported metrics, preventing verification of the efficiency advantage.
minor comments (2)
- [Figure 2] Figure 2 and associated text: Qualitative video examples would benefit from explicit mention of sampling parameters (guidance scale, number of frames, inference steps) to allow reproduction.
- [§2] Notation in §2 (Preliminaries): The temporal layer insertion into the U-Net is described at a high level; adding a short equation or diagram for the 3D convolution / attention modification would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our work. We appreciate the emphasis on strengthening the experimental claims through additional ablations and details. Below we respond point-by-point to the major comments, indicating where revisions will be made.
read point-by-point responses
-
Referee: [§3 and §4] §3 (Training stages) and §4 (Experiments): The central claim that the three-stage pipeline plus systematic curation (captioning + filtering) is necessary for competitive performance lacks load-bearing ablations. No quantitative comparisons (FVD, CLIP similarity, or human preference) are reported for an otherwise identical model trained on uncurated or randomly subsampled data of equal size, or for variants that skip one stage (e.g., video pretraining without image pretraining). This makes it impossible to isolate whether headline results are driven by the claimed pipeline versus model scale and the base image LDM.
Authors: We acknowledge that direct ablations comparing the full pipeline against uncurated data of equal size or ablated stages would provide stronger isolation of each component's contribution. However, each full-scale training run on our dataset size requires substantial compute resources that were not available for multiple parallel experiments. The staged approach builds directly on established practices from large-scale image latent diffusion models, where text-to-image pretraining has been shown to be critical for high-quality generation. Our results demonstrate that the complete pipeline yields competitive performance with closed-source systems, and the public release of code and weights enables the community to conduct further controlled ablations. In the revision we will add an explicit discussion of this limitation and the practical constraints that prevented exhaustive ablations. revision: partial
-
Referee: [§4.3] §4.3 (Multi-view generation): The claim that the finetuned multi-view model outperforms image-based methods at a fraction of compute is load-bearing for the 3D-prior contribution, yet the section provides no exact baseline details (number of views, resolution, or total FLOPs) or error bars on the reported metrics, preventing verification of the efficiency advantage.
Authors: We agree that precise baseline specifications and uncertainty estimates are necessary to substantiate the efficiency claims. In the revised manuscript we will expand §4.3 to include the exact number of views, output resolution, and estimated total FLOPs for both our multi-view model and the compared image-based methods. We will also report standard deviations or error bars on the quantitative metrics where multiple runs or samples permit. revision: yes
Circularity Check
No circularity: empirical training pipeline with external benchmarks
full rationale
The paper reports results from training a latent video diffusion model via three sequential stages (text-to-image pretraining, video pretraining on curated data, high-quality finetuning) and evaluates performance on standard external metrics and tasks such as text-to-video generation, image-to-video, and multi-view synthesis. No equations, fitted parameters, or predictions are presented as independent derivations; claims rest on experimental outcomes compared to closed-source baselines and prior image-based methods. Self-citations to the authors' earlier Stable Diffusion work describe the base architecture but do not form a load-bearing circular chain, as the video-specific contributions (curation process, stage ordering, LoRA adaptations) are validated through new training runs and downstream evaluations rather than reducing to definitions or prior self-referential results.
Axiom & Free-Parameter Ledger
free parameters (2)
- training hyperparameters across stages
- data filtering thresholds
axioms (2)
- domain assumption Latent diffusion models trained on images can be extended to video by inserting temporal layers and continued training.
- domain assumption Well-curated large-scale video data improves generative quality over smaller or unfiltered sets.
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DimensionForcingdimension_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
our model provides a strong multi-view 3D-prior and can serve as a base to finetune a multi-view diffusion model
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 60 Pith papers
-
TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking
TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.
-
PhysInOne: Visual Physics Learning and Reasoning in One Suite
PhysInOne is a new dataset of 2 million videos across 153,810 dynamic 3D scenes covering 71 physical phenomena, shown to improve AI performance on physics-aware video generation, prediction, property estimation, and m...
-
TeDiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion
TeDiO regularizes temporal diagonals in diffusion transformer attention maps to produce smoother video motion while keeping per-frame quality intact.
-
R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow
R-DMesh generates high-fidelity 4D meshes aligned to video by disentangling base mesh, motion, and a learned rectification jump offset inside a VAE, then using Triflow Attention and rectified-flow diffusion.
-
OP4KSR: One-Step Patch-Free 4K Super-Resolution with Periodic Artifact Suppression
OP4KSR enables efficient one-step 4K super-resolution without patches by adapting Flux with RoPE rescaling and periodicity loss to suppress artifacts.
-
JEDI: Joint Embedding Diffusion World Model for Online Model-Based Reinforcement Learning
JEDI is the first online end-to-end latent diffusion world model that trains latents from denoising loss rather than reconstruction, achieving competitive Atari100k results with 43% less VRAM and over 3x faster sampli...
-
OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation
OmniNFT introduces modality-wise advantage routing, layer-wise gradient surgery, and region-wise loss reweighting in an online diffusion RL framework to improve audio-video quality, alignment, and synchronization.
-
From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation
MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.
-
MoCam: Unified Novel View Synthesis via Structured Denoising Dynamics
MoCam unifies static and dynamic novel view synthesis by temporally decoupling geometric alignment and appearance refinement within the diffusion denoising process.
-
MoCam: Unified Novel View Synthesis via Structured Denoising Dynamics
MoCam uses structured denoising dynamics in diffusion models to temporally decouple geometric alignment from appearance refinement, enabling unified novel view synthesis that outperforms prior methods on imperfect poi...
-
RevealLayer: Disentangling Hidden and Visible Layers via Occlusion-Aware Image Decomposition
RevealLayer decomposes natural images into multiple RGBA layers using diffusion models with region-aware attention, occlusion-guided adaptation, and a composite loss, outperforming prior methods on a new benchmark dataset.
-
Single-Shot HDR Recovery via a Video Diffusion Prior
Single-shot HDR is achieved by conditioning a video diffusion model on an LDR input to generate an exposure bracket and fusing the bracket with per-pixel weights from a lightweight UNet.
-
HorizonDrive: Self-Corrective Autoregressive World Model for Long-horizon Driving Simulation
HorizonDrive enables stable long-horizon autoregressive driving simulation via anti-drifting teacher training with scheduled rollout recovery and teacher rollout distillation.
-
DCR: Counterfactual Attractor Guidance for Rare Compositional Generation
DCR uses a counterfactual attractor and projection-based repulsion to suppress default completion bias in diffusion models, improving fidelity for rare compositional prompts while preserving quality.
-
FreeSpec: Training-Free Long Video Generation via Singular-Spectrum Reconstruction
FreeSpec uses SVD-based spectral reconstruction to fuse global low-rank and local high-rank features, reducing content drift and preserving temporal dynamics in long video generation.
-
Eulerian Motion Guidance: Robust Image Animation via Bidirectional Geometric Consistency
Eulerian adjacent-frame motion guidance plus bidirectional geometric consistency improves training speed, temporal coherence, and artifact reduction in diffusion-based image animation.
-
Eulerian Motion Guidance: Robust Image Animation via Bidirectional Geometric Consistency
Eulerian adjacent-frame motion fields with bidirectional cycle consistency checks enable faster parallel training and fewer artifacts in diffusion model image animation compared to initial-frame Lagrangian guidance.
-
EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields
EA-WM generates more accurate robot world rollouts by projecting actions as structured visual fields in camera space and using event-aware bidirectional fusion to better capture interaction dynamics.
-
Sparse-to-Complete: From Sparse Image Captures to Complete 3D Scenes
S2C-3D reconstructs complete high-fidelity 3D scenes from as few as 6-8 images by finetuning a diffusion model on scene data, applying consistency-conditioned sampling, and planning trajectories for full coverage.
-
A unified Benchmark for Multi-Frame Image Restoration under Severe Refractive Warping
Presents the first large-scale benchmark for multi-frame geometric distortion removal in videos under severe refractive warping, using real and synthetic data across four distortion levels and evaluating classical and...
-
Generative Modeling with Orbit-Space Particle Flow Matching
OGPP is a particle flow-matching method using orbit-space canonicalization and geometric paths that achieves lower error and fewer steps than prior approaches on 3D benchmarks.
-
TrajShield: Trajectory-Level Safety Mediation for Defending Text-to-Video Models Against Jailbreak Attacks
TrajShield is a training-free defense that reduces jailbreak success rates by 52.44% on average in text-to-video models by localizing and neutralizing risks through trajectory simulation and causal intervention.
-
VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation
VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.
-
Action Agent: Agentic Video Generation Meets Flow-Constrained Diffusion
Action Agent pairs LLM-driven video generation with a flow-constrained diffusion transformer to produce velocity commands, raising video success to 86% and delivering 64.7% real-world navigation on a Unitree G1 humanoid.
-
AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation
AsymTalker maintains identity consistency in long-term diffusion talking-head videos by encoding temporal references from a static image and training a student model under inference-like conditions via asymmetric dist...
-
TransVLM: A Vision-Language Framework and Benchmark for Detecting Any Shot Transitions
TransVLM formalizes Shot Transition Detection as identifying full temporal transition segments rather than single cut points and introduces a VLM that injects optical flow as a motion prior via simple feature fusion, ...
-
$Z^2$-Sampling: Zero-Cost Zigzag Trajectories for Semantic Alignment in Diffusion Models
Z²-Sampling implicitly realizes zero-cost zigzag trajectories for curvature-aware semantic alignment in diffusion models by reducing multi-step paths via operator dualities and temporal caching while synthesizing a di...
-
Latent Space Probing for Adult Content Detection in Video Generative Models
Latent space probing on CogVideoX achieves 97.29% F1 for adult content detection on a new 11k-clip dataset with 4-6ms overhead.
-
WorldMark: A Unified Benchmark Suite for Interactive Video World Models
WorldMark is the first public benchmark that standardizes scenes, trajectories, and control interfaces across heterogeneous interactive image-to-video world models.
-
Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Diffusion Video Generation
Sparse Forcing adds a native trainable sparsity mechanism and PBSA kernel to autoregressive diffusion video models, yielding higher VBench scores and 1.1-1.27x speedups on 5s to 1min generations.
-
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.
-
DreamStereo: Towards Real-Time Stereo Inpainting for HD Videos
DreamStereo uses GAPW, PBDP, and SASI to enable real-time stereo video inpainting at 25 FPS for HD videos by reducing over 70% redundant computation while maintaining quality.
-
LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling
LangFlow is the first continuous diffusion language model to rival discrete diffusion on perplexity and generative perplexity while exceeding autoregressive baselines on several zero-shot tasks.
-
Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale
A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.
-
Tracking High-order Evolutions via Cascading Low-rank Fitting
Cascading low-rank fitting approximates successive high-order derivatives in diffusion models via a shared base function with sequentially added low-rank components, accompanied by theorems proving monotonic non-incre...
-
Immune2V: Image Immunization Against Dual-Stream Image-to-Video Generation
Immune2V immunizes images against dual-stream I2V generation by enforcing temporally balanced latent divergence and aligning generative features to a precomputed collapse trajectory, yielding stronger persistent degra...
-
Envisioning the Future, One Step at a Time
An autoregressive diffusion model on sparse point trajectories predicts multi-modal future scene dynamics from single images with orders-of-magnitude faster sampling than dense video simulators while matching accuracy.
-
RewardFlow: Generate Images by Optimizing What You Reward
RewardFlow unifies differentiable rewards including a new VQA-based one and uses a prompt-aware adaptive policy with Langevin dynamics to achieve state-of-the-art image editing and compositional generation.
-
Novel View Synthesis as Video Completion
Video diffusion models can be adapted into permutation-invariant generators for sparse novel view synthesis by treating the problem as video completion and removing temporal order cues.
-
DiV-INR: Extreme Low-Bitrate Diffusion Video Compression with INR Conditioning
DiV-INR integrates implicit neural representations as conditioning signals for diffusion models to achieve better perceptual quality than HEVC, VVC, and prior neural codecs at extremely low bitrates under 0.05 bpp.
-
ViVa: A Video-Generative Value Model for Robot Reinforcement Learning
ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.
-
SEM-ROVER: Semantic Voxel-Guided Diffusion for Large-Scale Driving Scene Generation
SEM-ROVER generates large multiview-consistent 3D urban driving scenes via semantic-conditioned diffusion on Σ-Voxfield voxel grids with progressive outpainting and deferred rendering.
-
OmniCamera: A Unified Framework for Multi-task Video Generation with Arbitrary Camera Control
OmniCamera disentangles video content and camera motion for multi-task generation with arbitrary camera control via the OmniCAM hybrid dataset and Dual-level Curriculum Co-Training.
-
HumANDiff: Articulated Noise Diffusion for Motion-Consistent Human Video Generation
HumANDiff improves motion consistency in human video generation by sampling diffusion noise on an articulated human body template and adding joint appearance-motion prediction plus a geometric consistency loss.
-
A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens
Delta tokens compress VFM feature differences into single tokens, enabling a lightweight generative world model that predicts diverse futures with far lower compute than existing approaches.
-
UENR-600K: A Large-Scale Physically Grounded Dataset for Nighttime Video Deraining
UENR-600K is a 600,000-frame synthetic dataset for nighttime video deraining that uses 3D rain particle simulation in Unreal Engine to enable better generalization to real scenes.
-
Not All Frames Deserve Full Computation: Accelerating Autoregressive Video Generation via Selective Computation and Predictive Extrapolation
SCOPE accelerates autoregressive video diffusion up to 4.73x by using a tri-modal cache-predict-recompute scheduler with Taylor extrapolation and selective active-frame computation while preserving output quality.
-
ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs
ENEC delivers 3.43X higher throughput than DietGPU and 1.12X better compression ratio than nvCOMP for lossless model weight compression on Ascend NPUs, yielding up to 6.3X end-to-end inference speedup.
-
GVCC: Zero-Shot Video Compression via Codebook-Driven Stochastic Rectified Flow
GVCC achieves the lowest LPIPS on UVG at bitrates down to 0.003 bpp by encoding stochastic innovations in a marginal-preserving stochastic process derived from a pretrained rectified-flow video model, with 65% LPIPS r...
-
ChopGrad: Pixel-Wise Losses for Latent Video Diffusion via Truncated Backpropagation
ChopGrad truncates backpropagation to local frame windows in video diffusion models, reducing memory from linear in frame count to constant while enabling pixel-wise loss fine-tuning.
-
FrameDiT: Diffusion Transformer with Matrix Attention for Efficient Video Generation
FrameDiT proposes Matrix Attention for DiTs to achieve SOTA video generation with improved temporal coherence and efficiency comparable to local factorized attention.
-
OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation
OpenVid-1M supplies 1 million high-quality text-video pairs and introduces MVDiT to improve text-to-video generation by better using both visual structure and text semantics.
-
Quantitative Video World Model Evaluation for Geometric-Consistency
PDI-Bench computes 3D projective residuals from segmented and tracked points to quantify geometric inconsistency in AI-generated videos.
-
Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity
Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.
-
Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm
V2V-Zero adapts frozen VLMs for visual conditioning via hidden states from specification pages, scoring 0.85 on GenEval and 32.7 on a new seven-task benchmark while revealing capability hierarchies in attribute bindin...
-
UniFixer: A Universal Reference-Guided Fixer for Diffusion-Based View Synthesis
UniFixer is a universal reference-guided framework that fixes spatial, temporal, and backbone-related degradations in diffusion-based view synthesis via coarse-to-fine modules and achieves zero-shot SOTA results on no...
-
OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation
OmniHumanoid factorizes transferable motion learning from embodiment-specific adaptation to enable scalable cross-embodiment video generation without paired data for new humanoids.
-
VidSplat: Gaussian Splatting Reconstruction with Geometry-Guided Video Diffusion Priors
VidSplat iteratively synthesizes novel views with geometry-guided video diffusion to enable robust Gaussian splatting reconstruction from sparse or single-image inputs.
-
GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth
GemDepth predicts inter-frame camera poses to inject geometric embeddings into a spatio-temporal transformer, yielding state-of-the-art 3D-consistent video depth.
-
GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth
GemDepth embeds predicted camera poses into a spatio-temporal transformer to achieve state-of-the-art 3D-consistent video depth estimation.
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2304.08477 (2023)
Jie An, Songyang Zhang, Harry Yang, Sonal Gupta, Jia- Bin Huang, Jiebo Luo, and Xi Yin. Latent-shift: Latent diffusion with temporal shift for efficient text-to-video gen- eration. arXiv preprint arXiv:2304.08477, 2023. 3
-
[2]
Renderdiffusion: Image diffusion for 3d reconstruction, in- painting and generation
Titas Anciukevi ˇcius, Zexiang Xu, Matthew Fisher, Paul Henderson, Hakan Bilen, Niloy J Mitra, and Paul Guerrero. Renderdiffusion: Image diffusion for 3d reconstruction, in- painting and generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 12608–12618, 2023. 15
work page 2023
-
[3]
A general language assistant as a laboratory for alignment, 2021
Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Jackson Kernion, Ka- mal Ndousse, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, and Jared Kaplan. A general language assistant as a labo...
work page 2021
-
[4]
Mohammad Babaeizadeh, Chelsea Finn, Dumitru Erhan, Roy H. Campbell, and Sergey Levine. Stochastic vari- ational video prediction. In International Conference on Learning Representations, 2018. 15
work page 2018
-
[5]
Character region awareness for text de- tection
Youngmin Baek, Bado Lee, Dongyoon Han, Sangdoo Yun, and Hwalsuk Lee. Character region awareness for text de- tection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9365–9374,
-
[6]
Training a helpful and harm- less assistant with reinforcement learning from human feed- back, 2022
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...
work page 2022
-
[7]
Frozen in time: A joint video and image encoder for end-to-end retrieval, 2022
Max Bain, Arsha Nagrani, G ¨ul Varol, and Andrew Zisser- man. Frozen in time: A joint video and image encoder for end-to-end retrieval, 2022. 3, 5, 15
work page 2022
-
[8]
ipoke: Poking a still image for con- trolled stochastic video synthesis
Andreas Blattmann, Timo Milbich, Michael Dorkenwald, and Bj ¨orn Ommer. ipoke: Poking a still image for con- trolled stochastic video synthesis. In 2021 IEEE/CVF In- ternational Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, 2021. 15
work page 2021
-
[9]
Align your latents: High-resolution video synthesis with latent diffusion models, 2023
Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your Latents: High-Resolution Video Synthe- sis with Latent Diffusion Models. arXiv:2304.08818, 2023. 2, 3, 4, 5, 6, 7, 15, 19, 20, 23, 25
-
[10]
Generating long videos of dynamic scenes
Tim Brooks, Janne Hellsten, Miika Aittala, Ting-Chun Wang, Timo Aila, Jaakko Lehtinen, Ming-Yu Liu, Alexei A Efros, and Tero Karras. Generating long videos of dynamic scenes. In NeurIPS, 2022. 3, 15, 24
work page 2022
-
[11]
Quo vadis, action recognition? a new model and the kinetics dataset
Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017. 15, 24
work page 2017
-
[12]
Im- proved conditional vrnns for video prediction
Lluis Castrejon, Nicolas Ballas, and Aaron Courville. Im- proved conditional vrnns for video prediction. In The IEEE International Conference on Computer Vision (ICCV) ,
-
[13]
Emu: Enhancing image generation models using photogenic nee- dles in a haystack, 2023
Xiaoliang Dai, Ji Hou, Chih-Yao Ma, Sam Tsai, Jialiang Wang, Rui Wang, Peizhao Zhang, Simon Vandenhende, Xi- aofang Wang, Abhimanyu Dubey, Matthew Yu, Abhishek Kadian, Filip Radenovic, Dhruv Mahajan, Kunpeng Li, Yue Zhao, Vladan Petrovic, Mitesh Kumar Singh, Simran Mot- wani, Yi Wen, Yiwen Song, Roshan Sumbaly, Vignesh Ra- manathan, Zijian He, Peter Vajda...
work page 2023
-
[14]
Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Chris- tian Laforte, Vikram V oleti, Samir Yitzhak Gadre, et al. Objaverse-XL: A universe of 10m+ 3d objects. arXiv preprint arXiv:2307.05663, 2023. 2, 5, 7, 8
-
[15]
Objaverse: A universe of annotated 3d objects
Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142–13153, 2023. 7
work page 2023
-
[16]
Nerdi: Single-view nerf synthesis with language-guided diffusion as general image priors
Congyue Deng, Chiyu Jiang, Charles R Qi, Xinchen Yan, Yin Zhou, Leonidas Guibas, Dragomir Anguelov, et al. Nerdi: Single-view nerf synthesis with language-guided diffusion as general image priors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20637–20647, 2023. 15
work page 2023
-
[17]
Stochastic video genera- tion with a learned prior
Emily Denton and Rob Fergus. Stochastic video genera- tion with a learned prior. In Proceedings of the 35th In- ternational Conference on Machine Learning, ICML 2018, Stockholmsm¨assan, Stockholm, Sweden, July 10-15, 2018 ,
work page 2018
-
[18]
Diffusion Models Beat GANs on Image Synthesis
Prafulla Dhariwal and Alex Nichol. Diffusion Models Beat GANs on Image Synthesis. arXiv:2105.05233, 2021. 25 9
work page internal anchor Pith review arXiv 2021
-
[19]
Michael Dorkenwald, Timo Milbich, Andreas Blattmann, Robin Rombach, Konstantinos G. Derpanis, and Bj¨orn Om- mer. Stochastic image-to-video synthesis using cinns. In IEEE Conference on Computer Vision and Pattern Recog- nition, CVPR 2021, virtual, June 19-25, 2021, 2021. 15
work page 2021
-
[20]
Google scanned objects: A high-quality dataset of 3d scanned household items
Laura Downs, Anthony Francis, Nate Koenig, Brandon Kinman, Ryan Hickman, Krista Reymann, Thomas B McHugh, and Vincent Vanhoucke. Google scanned objects: A high-quality dataset of 3d scanned household items. In 2022 International Conference on Robotics and Automa- tion (ICRA), pages 2553–2560. IEEE, 2022. 8
work page 2022
-
[21]
Arpad E. Elo. The Rating of Chessplayers, Past and Present. Arco Pub., New York, 1978. 4, 22
work page 1978
- [22]
-
[23]
Structure and content-guided video synthesis with diffusion models,
Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models,
-
[24]
Two-frame motion estimation based on polynomial expansion
Gunnar Farneb ¨ack. Two-frame motion estimation based on polynomial expansion. pages 363–370, 2003. 4, 17
work page 2003
-
[25]
Stylevideogan: A temporal generative model using a pretrained stylegan
Gereon Fox, Ayush Tewari, Mohamed Elgharib, and Chris- tian Theobalt. Stylevideogan: A temporal generative model using a pretrained stylegan. In British Machine Vision Con- ference (BMVC), 2021. 15
work page 2021
-
[26]
Stochastic latent residual video prediction
Jean-Yves Franceschi, Edouard Delasalles, Micka ¨el Chen, Sylvain Lamprier, and Patrick Gallinari. Stochastic latent residual video prediction. In Proceedings of the 37th Inter- national Conference on Machine Learning, 2020. 15
work page 2020
-
[27]
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The Pile: An 800gb dataset of diverse text for lan- guage modeling. arXiv preprint arXiv:2101.00027, 2020. 3
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[28]
Long video generation with time-agnostic vqgan and time- sensitive transformer
Songwei Ge, Thomas Hayes, Harry Yang, Xi Yin, Guan Pang, David Jacobs, Jia-Bin Huang, and Devi Parikh. Long video generation with time-agnostic vqgan and time- sensitive transformer. In Computer Vision – ECCV 2022 , pages 102–118, Cham, 2022. Springer Nature Switzerland. 15
work page 2022
-
[29]
Preserve your own cor- relation: A noise prior for video diffusion models
Songwei Ge, Seungjun Nah, Guilin Liu, Tyler Poon, An- drew Tao, Bryan Catanzaro, David Jacobs, Jia-Bin Huang, Ming-Yu Liu, and Yogesh Balaji. Preserve your own cor- relation: A noise prior for video diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22930–22941, 2023. 2, 3, 6, 15
work page 2023
-
[30]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014. 3
work page 2014
-
[31]
Reuse and diffuse: Iterative denoising for text-to-video generation
Jiaxi Gu, Shicong Wang, Haoyu Zhao, Tianyi Lu, Xing Zhang, Zuxuan Wu, Songcen Xu, Wei Zhang, Yu-Gang Jiang, and Hang Xu. Reuse and diffuse: Iterative denoising for text-to-video generation. arXiv preprint arXiv:2309.03549, 2023. 3
-
[32]
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without spe- cific tuning. arXiv preprint arXiv:2307.04725 , 2023. 2, 3, 7, 15, 20
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
Rv-gan: Recurrent gan for unconditional video generation
Sonam Gupta, Arti Keshari, and Sukhendu Das. Rv-gan: Recurrent gan for unconditional video generation. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 2024– 2033, 2022. 15
work page 2024
-
[34]
Diffusion with offset noise, 2023
Nicholas Guttenberg and CrossLabs. Diffusion with offset noise, 2023. 19, 23
work page 2023
-
[35]
Latent video diffusion models for high- fidelity long video generation, 2023
Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high- fidelity long video generation, 2023. 3
work page 2023
-
[36]
Classifier-free diffusion guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021. 7, 19, 20
work page 2021
-
[37]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-Free Diffusion Guidance. arXiv:2207.12598, 2022. 19, 23
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[38]
Denoising dif- fusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models. In Advances in Neural Infor- mation Processing Systems, 2020. 2, 25
work page 2020
-
[39]
Cascaded diffusion models for high fidelity image generation
Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation. arXiv preprint arXiv:2106.15282, 2021. 7, 15, 20
-
[41]
Imagen Video: High Definition Video Generation with Diffusion Models
Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, and Tim Sali- mans. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022. 2, 3, 4, 15, 20
work page internal anchor Pith review arXiv 2022
-
[42]
Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J. Fleet. Video dif- fusion models. arXiv preprint arXiv:2204.03458, 2022. 2, 15
work page internal anchor Pith review arXiv 2022
-
[43]
Cogvideo: Large-scale pretraining for text-to- video generation via transformers, 2022
Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to- video generation via transformers, 2022. 2, 3, 6, 15
work page 2022
-
[44]
Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. simple diffusion: End-to-end diffusion for high resolution images. arXiv preprint arXiv:2301.11093, 2023. 6
-
[45]
LoRA: Low-Rank Adaptation of Large Language Models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685, 2021. 2
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[46]
Estimation of Non- Normalized Statistical Models by Score Matching
Aapo Hyv ¨arinen and Peter Dayan. Estimation of Non- Normalized Statistical Models by Score Matching. Journal of Machine Learning Research, 6(4), 2005. 18
work page 2005
-
[47]
Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, 10 Vaishaal Shankar, Hongseok Namkoong, John Miller, Han- naneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Open- clip, 2021. 3
work page 2021
-
[48]
Open source computer vision library
Itseez. Open source computer vision library. https:// github.com/itseez/opencv, 2015. 4, 17
work page 2015
-
[49]
Shap-e: Generating condi- tional 3d implicit functions, 2023
Heewoo Jun and Alex Nichol. Shap-e: Generating condi- tional 3d implicit functions, 2023. 15
work page 2023
-
[50]
Lower dimensional kernels for video discriminators
Emmanuel Kahembwe and Subramanian Ramamoorthy. Lower dimensional kernels for video discriminators. Neu- ral Networks, 132:506–520, 2020. 15
work page 2020
-
[51]
Elucidating the Design Space of Diffusion-Based Generative Models
Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the Design Space of Diffusion-Based Genera- tive Models. arXiv:2206.00364, 2022. 3, 6, 18, 19
work page internal anchor Pith review arXiv 2022
-
[52]
Text2video-zero: Text- to-image diffusion models are zero-shot video generators,
Levon Khachatryan, Andranik Movsisyan, Vahram Tade- vosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text- to-image diffusion models are zero-shot video generators,
-
[53]
Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. Advances in neural in- formation processing systems, 34:21696–21707, 2021. 19
work page 2021
- [54]
-
[55]
X., Zhang, R., Ebert, F., Abbeel, P., Finn, C., and Levine, S
Alex X. Lee, Richard Zhang, Frederik Ebert, Pieter Abbeel, Chelsea Finn, and Sergey Levine. Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523, 2018. 15
-
[56]
Shanchuan Lin, Bingchen Liu, Jiashi Li, and Xiao Yang. Common Diffusion Noise Schedules and Sample Steps are Flawed. arXiv:2305.08891, 2023. 19
-
[57]
Zero-1-to-3: Zero-shot one image to 3d object, 2023
Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tok- makov, Sergey Zakharov, and Carl V ondrick. Zero-1-to-3: Zero-shot one image to 3d object, 2023. 2, 5, 8, 15
work page 2023
-
[58]
Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie Liu, Taku Komura, and Wenping Wang. Syncdreamer: Generating multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453, 2023. 2, 5, 7, 8, 15, 16
-
[59]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017. 19, 20, 23
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[60]
Transformation-based adversarial video predic- tion on large-scale data
Pauline Luc, Aidan Clark, Sander Dieleman, Diego de Las Casas, Yotam Doron, Albin Cassirer, and Karen Si- monyan. Transformation-based adversarial video predic- tion on large-scale data. ArXiv, 2020. 15
work page 2020
-
[61]
Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans
Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik P. Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans. On distillation of guided diffusion models, 2023. 15
work page 2023
-
[62]
Point-e: A system for generating 3d point clouds from complex prompts, 2022
Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen. Point-e: A system for generating 3d point clouds from complex prompts, 2022. 15
work page 2022
-
[63]
Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobei- dli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Lau- nay. The RefinedWeb dataset for Falcon LLM: outperform- ing curated corpora with web data, and web data only.arXiv preprint arXiv:2306.01116, 2023. 3
work page internal anchor Pith review arXiv 2023
-
[64]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. SDXL: Improving Latent Dif- fusion Models for High-Resolution Image Synthesis. arXiv:2307.01952, 2023. 2, 3, 5, 24
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[65]
Training contrastive captioners
Giovanni Puccetti, Maciej Kilian, and Romain Beaumont. Training contrastive captioners. LAION blog, 2023. 17
work page 2023
-
[66]
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision. arXiv:2103.00020, 2021. 2, 3, 4, 8, 18
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[67]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.arXiv e-prints, 2019. 3
work page 2019
- [68]
-
[70]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv:2204.06125, 2022. 16
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[72]
High-Resolution Image Synthesis with Latent Diffusion Models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution im- age synthesis with latent diffusion models. arXiv preprint arXiv:2112.10752, 2021. 7, 8
work page Pith review arXiv 2021
-
[73]
U-Net: Convolutional Networks for Biomedical Image Segmentation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- Net: Convolutional Networks for Biomedical Image Seg- mentation. arXiv:1505.04597, 2015. 7, 20, 23
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[74]
Gen-2 by runway, https://research
RunwayML. Gen-2 by runway, https://research. runwayml.com/gen2, 2023. 2, 6, 7, 24
work page 2023
-
[75]
Chitwan Saharia, Jonathan Ho, William Chan, Tim Sal- imans, David J Fleet, and Mohammad Norouzi. Image super-resolution via iterative refinement. arXiv preprint arXiv:2104.07636, 2021. 2
-
[76]
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to- image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022. 2
work page internal anchor Pith review arXiv 2022
-
[77]
Tempo- ral generative adversarial nets with singular value clipping
Masaki Saito, Eiichi Matsumoto, and Shunta Saito. Tempo- ral generative adversarial nets with singular value clipping. In ICCV, 2017. 15
work page 2017
-
[78]
Masaki Saito, Shunta Saito, Masanori Koyama, and Sosuke Kobayashi. Train sparsely, generate densely: Memory- efficient unsupervised training of high-resolution temporal gan. International Journal of Computer Vision, 2020. 15 11
work page 2020
-
[79]
Progressive Distillation for Fast Sampling of Diffusion Models
Tim Salimans and Jonathan Ho. Progressive Distillation for Fast Sampling of Diffusion Models. arXiv preprint arXiv:2202.00512, 2022. 15, 19
work page internal anchor Pith review arXiv 2022
-
[80]
Laion-5b: An open large-scale dataset for train- ing next generation image-text models
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for train- ing next generation image-text models. Advances in Neural Information Processing Systems , 35:25278–25294, 2022. 3, 4, 16, 18
work page 2022
-
[81]
MVDream: Multi-view Diffusion for 3D Generation
Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. Mvdream: Multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512, 2023. 15
work page internal anchor Pith review arXiv 2023
-
[82]
Make-A-Video: Text-to-Video Generation without Text-Video Data
Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taig- man. Make-A-Video: Text-to-Video Generation without Text-Video Data. arXiv:2209.14792, 2022. 2, 3, 4, 5, 6, 15, 20
work page internal anchor Pith review arXiv 2022
-
[83]
Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2
Ivan Skorokhodov, Sergey Tulyakov, and Mohamed Elho- seiny. Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3626–3636, 2022. 15
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.