Recognition: 2 theorem links
· Lean TheoremSora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
Pith reviewed 2026-05-13 13:36 UTC · model grok-4.3
The pith
Sora generates realistic videos from text by simulating physical world scenes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that Sora operates as a world simulator trained to produce videos of realistic or imaginative scenes from text instructions, relying on large vision models whose details are inferred from available reports, with resulting impacts across industries offset by needs for better safety, bias control, and technical refinements in future video generation systems.
What carries the argument
The world simulator, a generative system that produces video sequences consistent with physical dynamics from textual input.
If this is right
- Sora enables automated generation of video scenes for film production without traditional filming.
- It supplies customizable visual explanations and simulations for educational content.
- Marketing can produce tailored video assets from simple text descriptions at scale.
- Widespread use depends on solving issues of bias, safety, and content control in outputs.
- Further progress opens paths to more interactive and productive forms of human-AI video collaboration.
Where Pith is reading between the lines
- The approach could extend to interactive environments where users refine generated videos in real time.
- It connects to broader efforts in building AI systems that respect physical laws for scientific visualization.
- Generated content raises questions about ownership and verification that may require new standards.
- Overcoming current length and consistency limits would test whether the simulator scales to complex multi-shot narratives.
Load-bearing premise
Public technical reports and reverse engineering efforts accurately reflect the model's true internal architecture and training details.
What would settle it
Official technical documentation released by the developers that directly contradicts the architecture and capabilities inferred from public sources.
read the original abstract
Sora is a text-to-video generative AI model, released by OpenAI in February 2024. The model is trained to generate videos of realistic or imaginative scenes from text instructions and show potential in simulating the physical world. Based on public technical reports and reverse engineering, this paper presents a comprehensive review of the model's background, related technologies, applications, remaining challenges, and future directions of text-to-video AI models. We first trace Sora's development and investigate the underlying technologies used to build this "world simulator". Then, we describe in detail the applications and potential impact of Sora in multiple industries ranging from film-making and education to marketing. We discuss the main challenges and limitations that need to be addressed to widely deploy Sora, such as ensuring safe and unbiased video generation. Lastly, we discuss the future development of Sora and video generation models in general, and how advancements in the field could enable new ways of human-AI interaction, boosting productivity and creativity of video generation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript is a review of OpenAI's Sora text-to-video generative model released in February 2024. It traces the model's background and development, examines underlying technologies inferred from public technical reports and reverse-engineering efforts, surveys applications across film-making, education, and marketing, discusses challenges such as ensuring safe and unbiased generation, and outlines future directions for text-to-video models to support new forms of human-AI interaction.
Significance. If the synthesis of publicly available information holds, the paper offers a timely consolidation of knowledge on a leading large vision model in the text-to-video domain. This could serve as a useful reference for the computer vision community by identifying key limitations and opportunities, thereby guiding subsequent research on generative models without relying on proprietary claims.
minor comments (3)
- The abstract and introduction would benefit from explicit statements on the scope of reverse-engineering claims to avoid any implication of direct access to proprietary details (e.g., clarify in the first paragraph of the introduction that all technical descriptions derive solely from public sources).
- Section headings and transitions between technology descriptions and applications could be tightened for better flow; some paragraphs repeat background information already covered earlier.
- Add a short table or bullet list summarizing the main public reports cited for each technological component (e.g., diffusion models, transformer variants) to improve traceability and reader navigation.
Simulated Author's Rebuttal
We thank the referee for the positive evaluation of our manuscript as a timely review synthesizing public information on Sora. We appreciate the recommendation for minor revision and the recognition of its potential value to the computer vision community. Since the report does not list any specific major comments, we interpret the minor revision request as an opportunity to polish the presentation and ensure all claims remain grounded in publicly available sources.
read point-by-point responses
-
Referee: No specific major comments were provided in the report; the review is generally supportive with a minor revision recommendation.
Authors: We will perform a careful pass to address any minor issues such as clarity, citations, or formatting in the revised version. The core synthesis of background, technology inferences, applications, limitations, and opportunities will remain unchanged as it is based on public technical reports and reverse-engineering efforts. revision: yes
Circularity Check
No significant circularity
full rationale
The paper is a literature review that explicitly grounds all claims in external public technical reports and reverse-engineering efforts rather than any internal derivations, fitted parameters, or self-referential predictions. No equations, ansatzes, uniqueness theorems, or load-bearing self-citations appear; the central synthesis draws from outside sources and remains independent of its own inputs.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 26 Pith papers
-
JEDI: Joint Embedding Diffusion World Model for Online Model-Based Reinforcement Learning
JEDI is the first online end-to-end latent diffusion world model that trains latents from denoising loss rather than reconstruction, achieving competitive Atari100k results with 43% less VRAM and over 3x faster sampli...
-
MaMi-HOI: Harmonizing Global Kinematics and Local Geometry for Human-Object Interaction Generation
MaMi-HOI counters geometric forgetting in diffusion models via a Geometry-Aware Proximity Adapter for precise contacts and a Kinematic Harmony Adapter for natural whole-body postures in human-object interactions.
-
MotionGRPO: Overcoming Low Intra-Group Diversity in GRPO-Based Egocentric Motion Recovery
MotionGRPO models diffusion sampling as a Markov decision process optimized with Group Relative Policy Optimization, using hybrid rewards and noise injection to boost sample diversity and local joint precision in egoc...
-
A unified Benchmark for Multi-Frame Image Restoration under Severe Refractive Warping
Presents the first large-scale benchmark for multi-frame geometric distortion removal in videos under severe refractive warping, using real and synthetic data across four distortion levels and evaluating classical and...
-
$Z^2$-Sampling: Zero-Cost Zigzag Trajectories for Semantic Alignment in Diffusion Models
Z²-Sampling implicitly realizes zero-cost zigzag trajectories for curvature-aware semantic alignment in diffusion models by reducing multi-step paths via operator dualities and temporal caching while synthesizing a di...
-
Latent Space Probing for Adult Content Detection in Video Generative Models
Latent space probing on CogVideoX achieves 97.29% F1 for adult content detection on a new 11k-clip dataset with 4-6ms overhead.
-
WorldMark: A Unified Benchmark Suite for Interactive Video World Models
WorldMark is the first public benchmark that standardizes scenes, trajectories, and control interfaces across heterogeneous interactive image-to-video world models.
-
UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs
UniEditBench unifies image and video editing evaluation with a nine-plus-eight operation taxonomy and cost-effective 4B/8B distilled MLLM evaluators that align with human judgments.
-
Immune2V: Image Immunization Against Dual-Stream Image-to-Video Generation
Immune2V immunizes images against dual-stream I2V generation by enforcing temporally balanced latent divergence and aligning generative features to a precomputed collapse trajectory, yielding stronger persistent degra...
-
MotionScape: A Large-Scale Real-World Highly Dynamic UAV Video Dataset for World Models
MotionScape is a large-scale UAV video dataset with highly dynamic 6-DoF motions, geometric trajectories, and semantic annotations to train world models that better simulate complex 3D dynamics under large viewpoint changes.
-
Controllable Generative Video Compression
CGVC uses coded keyframes and per-frame priors to guide controllable generative reconstruction of video frames, outperforming prior perceptual compression methods in both signal fidelity and perceptual quality.
-
DocShield: Towards AI Document Safety via Evidence-Grounded Agentic Reasoning
DocShield presents a new agentic reasoning framework using Cross-Cues-aware Chain of Thought to detect, localize, and explain text-centric forgeries in documents, with reported F1 gains of 41.4% over specialized metho...
-
OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation
OmniHumanoid factorizes transferable motion learning from embodiment-specific adaptation to enable scalable cross-embodiment video generation without paired data for new humanoids.
-
DiffATS: Diffusion in Aligned Tensor Space
DiffATS trains diffusion models directly on aligned Tucker tensor primitives that are proven to be homeomorphisms, delivering efficient unconditional and conditional generation across images, videos, and PDE data with...
-
Camera Control for Text-to-Image Generation via Learning Viewpoint Tokens
Viewpoint tokens learned on a mixed 3D-rendered and photorealistic dataset enable precise camera control in text-to-image generation while factorizing geometry from appearance and transferring to unseen object categories.
-
Long-CODE: Isolating Pure Long-Context as an Orthogonal Dimension in Video Evaluation
Long-CODE isolates long-context video evaluation with a new benchmark dataset and shot-dynamics metric that correlates better with human judgments on narrative richness and global consistency than short-video metrics.
-
DVAR: Adversarial Multi-Agent Debate for Video Authenticity Detection
DVAR turns video authenticity detection into an iterative debate between a generative hypothesis agent and a natural mechanism agent, resolved via minimum description length and a knowledge base for better generalizat...
-
VGA-Bench: A Unified Benchmark and Multi-Model Framework for Video Aesthetics and Generation Quality Evaluation
VGA-Bench creates a three-tier taxonomy, 1,016-prompt dataset of 60k+ videos, and three multi-task neural models (VAQA-Net, VTag-Net, VGQA-Net) that align with human judgments for video aesthetics and generation quality.
-
StoryBlender: Inter-Shot Consistent and Editable 3D Storyboard with Spatial-temporal Dynamics
StoryBlender generates inter-shot consistent editable 3D storyboards using a three-stage pipeline of semantic-spatial grounding, canonical asset materialization, and spatial-temporal dynamics with agent-based verification.
-
MotionGRPO: Overcoming Low Intra-Group Diversity in GRPO-Based Egocentric Motion Recovery
MotionGRPO applies GRPO with noise injection and hybrid rewards to diffusion-based egocentric motion recovery, overcoming vanishing gradients from low intra-group diversity to reach state-of-the-art performance.
-
Do Protective Perturbations Really Protect Portrait Privacy under Real-world Image Transformations?
Pixel-level protective perturbations for portrait privacy are ineffective against common image transformations, and a low-cost purification framework can strip them out.
-
StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement
StableIDM stabilizes inverse dynamics models under manipulator truncation by combining robot-centric masking, directional spatial feature aggregation, and temporal dynamics refinement, yielding 12.1% higher strict act...
-
The Amazing Stability of Flow Matching
Flow matching generative models preserve sample quality, diversity, and latent representations despite pruning 50% of the CelebA-HQ dataset or altering architecture and training configurations.
-
Prompt-to-Gesture: Measuring the Capabilities of Image-to-Video Deictic Gesture Generation
Prompt-driven image-to-video generation produces deictic gestures that match real data visually, add useful variety, and improve downstream recognition models when mixed with human recordings.
-
LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation
This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challe...
-
Generative AI for Video Trailer Synthesis: From Extractive Heuristics to Autoregressive Creativity
The paper surveys the evolution of video trailer generation from extractive heuristics to generative AI methods and proposes a new taxonomy for future systems based on autoregressive and foundation models.
Reference graph
Works this paper leans on
-
[1]
Chatgpt: Get instant answers, find creative inspiration, learn something new
OpenAI, “Chatgpt: Get instant answers, find creative inspiration, learn something new..” https: //openai.com/chatgpt, 2022
work page 2022
- [2]
-
[3]
Sora: Creating video from text
OpenAI, “Sora: Creating video from text.” https://openai.com/sora, 2024
work page 2024
-
[4]
Scalable diffusion models with transformers,
W. Peebles and S. Xie, “Scalable diffusion models with transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4195–4205, 2023
work page 2023
-
[5]
Texture synthesis by non-parametric sampling,
A. A. Efros and T. K. Leung, “Texture synthesis by non-parametric sampling,” in Proceedings of the seventh IEEE international conference on computer vision, vol. 2, pp. 1033–1038, IEEE, 1999
work page 1999
-
[6]
P. S. Heckbert, “Survey of texture mapping,”IEEE computer graphics and applications, vol. 6, no. 11, pp. 56–67, 1986
work page 1986
-
[7]
Generative adversarial networks,
I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio, “Generative adversarial networks,”arXiv, 2014
work page 2014
-
[8]
Auto-Encoding Variational Bayes
D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[9]
NICE: Non-linear Independent Components Estimation
L. Dinh, D. Krueger, and Y . Bengio, “Nice: Non-linear independent components estimation,” arXiv preprint arXiv:1410.8516, 2014
work page internal anchor Pith review arXiv 2014
-
[10]
Generative modeling by estimating gradients of the data distribution,
Y . Song and S. Ermon, “Generative modeling by estimating gradients of the data distribution,” Ad- vances in Neural Information Processing Systems, vol. 32, 2019
work page 2019
-
[11]
Y . Cao, S. Li, Y . Liu, Z. Yan, Y . Dai, P. S. Yu, and L. Sun, “A comprehensive survey of ai-generated content (aigc): A history of generative ai from gan to chatgpt,” arXiv preprint arXiv:2303.04226 , 2023
-
[12]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polo- sukhin, “Attention is all you need,” inAdvances in Neural Information Processing Systems(I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds.), vol. 30, Curran Associates, Inc., 2017
work page 2017
-
[13]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional trans- formers for language understanding,”arXiv preprint arXiv:1810.04805, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[14]
Improving language understanding by generative pre-training,
A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, et al., “Improving language understanding by generative pre-training,” 2018
work page 2018
-
[15]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[16]
Swin transformer: Hierarchical vision transformer using shifted windows,
Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inProceedings of the IEEE/CVF international conference on computer vision, pp. 10012–10022, 2021. 25
work page 2021
-
[17]
U-net: Convolutional networks for biomedical image seg- mentation,
O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image seg- mentation,” in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pp. 234– 241, Springer, 2015
work page 2015
-
[18]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” 2021
work page 2021
-
[19]
High-resolution image synthesis with latent diffusion models,
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684–10695, 2022
work page 2022
-
[20]
Midjourney: Text to image with ai art generator
M. AI, “Midjourney: Text to image with ai art generator.” https://www.midjourneyai.ai/ en, 2023
work page 2023
-
[21]
Improving image generation with better captions,
J. Betker, G. Goh, L. Jing, T. Brooks, J. Wang, L. Li, L. Ouyang, J. Zhuang, J. Lee, Y . Guo, et al., “Improving image generation with better captions,” Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, vol. 2, p. 3, 2023
work page 2023
-
[22]
Pika is the idea-to-video platform that sets your creativity in motion
P. AI, “Pika is the idea-to-video platform that sets your creativity in motion..” https://pika. art/home, 2023
work page 2023
-
[23]
Gen-2: Gen-2: The next step forward for generative ai
R. AI, “Gen-2: Gen-2: The next step forward for generative ai.” https://research. runwayml.com/gen2, 2023
work page 2023
-
[24]
X. Zhai, A. Kolesnikov, N. Houlsby, and L. Beyer, “Scaling vision transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12104–12113, 2022
work page 2022
-
[25]
Scaling vision transformers to 22 billion parameters,
M. Dehghani, J. Djolonga, B. Mustafa, P. Padlewski, J. Heek, J. Gilmer, A. P. Steiner, M. Caron, R. Geirhos, I. Alabdulmohsin, et al., “Scaling vision transformers to 22 billion parameters,” inInter- national Conference on Machine Learning, pp. 7480–7512, PMLR, 2023
work page 2023
-
[26]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable visual models from natural language supervision,” in Interna- tional conference on machine learning, pp. 8748–8763, PMLR, 2021
work page 2021
-
[27]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y . Levi, Z. English, V . V oleti, A. Letts,et al., “Stable video diffusion: Scaling latent video diffusion models to large datasets,”arXiv preprint arXiv:2311.15127, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[28]
Make-a-video: Text-to-video generation without text-video data,
U. Singer, A. Polyak, T. Hayes, X. Yin, J. An, S. Zhang, Q. Hu, H. Yang, O. Ashual, O. Gafni, D. Parikh, S. Gupta, and Y . Taigman, “Make-a-video: Text-to-video generation without text-video data,” 2022
work page 2022
-
[29]
Imagen Video: High Definition Video Generation with Diffusion Models
J. Ho, W. Chan, C. Saharia, J. Whang, R. Gao, A. Gritsenko, D. P. Kingma, B. Poole, M. Norouzi, D. J. Fleet, et al., “Imagen video: High definition video generation with diffusion models,” arXiv preprint arXiv:2210.02303, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[30]
R. Sutton, “The bitter lesson.” http://www.incompleteideas.net/IncIdeas/ BitterLesson.html, March 2019. Accessed: Your Access Date Here. 26
work page 2019
-
[31]
S. Xie, “Take on sora technical report.” https://twitter.com/sainingxie/status/ 1758433676105310543, 2024
work page 2024
-
[32]
Neural discrete representation learning,
A. Van Den Oord, O. Vinyals, et al., “Neural discrete representation learning,” Advances in neural information processing systems, vol. 30, 2017
work page 2017
-
[33]
Masked autoencoders are scalable vision learners,
K. He, X. Chen, S. Xie, Y . Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009, 2022
work page 2022
-
[34]
Preserve your own correlation: A noise prior for video diffusion models,
S. Ge, S. Nah, G. Liu, T. Poon, A. Tao, B. Catanzaro, D. Jacobs, J.-B. Huang, M.-Y . Liu, and Y . Balaji, “Preserve your own correlation: A noise prior for video diffusion models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22930–22941, 2023
work page 2023
-
[35]
arXiv preprint arXiv:2311.17042 (2023)
A. Sauer, D. Lorenz, A. Blattmann, and R. Rombach, “Adversarial diffusion distillation,” arXiv preprint arXiv:2311.17042, 2023
-
[36]
Align your la- tents: High-resolution video synthesis with latent diffusion models,
A. Blattmann, R. Rombach, H. Ling, T. Dockhorn, S. W. Kim, S. Fidler, and K. Kreis, “Align your la- tents: High-resolution video synthesis with latent diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22563–22575, 2023
work page 2023
-
[37]
Tokenlearner: Adaptive space-time tokenization for videos,
M. Ryoo, A. Piergiovanni, A. Arnab, M. Dehghani, and A. Angelova, “Tokenlearner: Adaptive space-time tokenization for videos,” Advances in Neural Information Processing Systems , vol. 34, pp. 12786–12797, 2021
work page 2021
-
[38]
Vivit: A video vision transformer,
A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lu ˇci´c, and C. Schmid, “Vivit: A video vision transformer,”arXiv preprint arXiv:2103.15691, 2021
-
[39]
Flexivit: One model for all patch sizes,
L. Beyer, P. Izmailov, A. Kolesnikov, M. Caron, S. Kornblith, X. Zhai, M. Minderer, M. Tschannen, I. Alabdulmohsin, and F. Pavetic, “Flexivit: One model for all patch sizes,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14496–14506, 2023
work page 2023
-
[40]
Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution,
M. Dehghani, B. Mustafa, J. Djolonga, J. Heek, M. Minderer, M. Caron, A. Steiner, J. Puigcerver, R. Geirhos, I. M. Alabdulmohsin, et al., “Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution,”Advances in Neural Information Processing Systems, vol. 36, 2024
work page 2024
-
[41]
M. M. Krell, M. Kosec, S. P. Perez, and A. Fitzgibbon, “Efficient sequence packing without cross- contamination: Accelerating large language models without impacting performance,” arXiv preprint arXiv:2107.02027, 2021
-
[42]
A-vit: Adaptive tokens for efficient vision transformer,
H. Yin, A. Vahdat, J. M. Alvarez, A. Mallya, J. Kautz, and P. Molchanov, “A-vit: Adaptive tokens for efficient vision transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10809–10818, 2022
work page 2022
-
[43]
Token merging: Your vit but faster,
D. Bolya, C.-Y . Fu, X. Dai, P. Zhang, C. Feichtenhofer, and J. Hoffman, “Token merging: Your vit but faster,” inThe Eleventh International Conference on Learning Representations, 2022
work page 2022
-
[44]
Adaptive token sampling for efficient vision transformers,
M. Fayyaz, S. A. Koohpayegani, F. R. Jafari, S. Sengupta, H. R. V . Joze, E. Sommerlade, H. Pirsi- avash, and J. Gall, “Adaptive token sampling for efficient vision transformers,” inEuropean Confer- ence on Computer Vision, pp. 396–414, Springer, 2022
work page 2022
-
[45]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017. 27
work page 2017
-
[46]
Is space-time attention all you need for video understand- ing?,
G. Bertasius, H. Wang, and L. Torresani, “Is space-time attention all you need for video understand- ing?,” inICML, vol. 2, p. 4, 2021
work page 2021
-
[47]
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
L. Yu, J. Lezama, N. B. Gundavarapu, L. Versari, K. Sohn, D. Minnen, Y . Cheng, A. Gupta, X. Gu, A. G. Hauptmann, et al., “Language model beats diffusion–tokenizer is key to visual generation,” arXiv preprint arXiv:2310.05737, 2023
work page internal anchor Pith review arXiv 2023
-
[48]
Fast transformer decoding: One write-head is all you need,
N. Shazeer, “Fast transformer decoding: One write-head is all you need,” 2019
work page 2019
-
[49]
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
J. Ainslie, J. Lee-Thorp, M. de Jong, Y . Zemlyanskiy, F. Lebrón, and S. Sanghai, “Gqa: Train- ing generalized multi-query transformer models from multi-head checkpoints,” arXiv preprint arXiv:2305.13245, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[50]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” arXiv preprint arXiv:2312.00752, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[51]
Deep Unsupervised Learning using Nonequilibrium Thermodynamics
J. Sohl-Dickstein, E. A. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,”arXiv preprint arXiv:1503.03585, 2015
work page internal anchor Pith review arXiv 2015
-
[52]
Denoising diffusion probabilistic models,
J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in Neural Infor- mation Processing Systems, vol. 33, pp. 6840–6851, 2020
work page 2020
-
[53]
Score-Based Generative Modeling through Stochastic Differential Equations
Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,”arXiv preprint arXiv:2011.13456, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2011
-
[54]
All are worth words: A vit backbone for diffusion models,
F. Bao, S. Nie, K. Xue, Y . Cao, C. Li, H. Su, and J. Zhu, “All are worth words: A vit backbone for diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023
work page 2023
-
[55]
Masked diffusion transformer is a strong image synthe- sizer,
S. Gao, P. Zhou, M.-M. Cheng, and S. Yan, “Masked diffusion transformer is a strong image synthe- sizer,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , pp. 23164– 23173, 2023
work page 2023
-
[56]
Masked diffusion transformer is a strong image synthesizer
S. Gao, P. Zhou, M.-M. Cheng, and S. Yan, “Masked diffusion transformer is a strong image synthe- sizer,”arXiv preprint arXiv:2303.14389, 2023
-
[57]
Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models,
X. Xie, P. Zhou, H. Li, Z. Lin, and S. Yan, “Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models,”arXiv preprint arXiv:2208.06677, 2022
-
[58]
Efficient diffusion training via min-snr weighting strategy,
T. Hang, S. Gu, C. Li, J. Bao, D. Chen, H. Hu, X. Geng, and B. Guo, “Efficient diffusion training via min-snr weighting strategy,” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7441–7451, 2023
work page 2023
-
[59]
Classifier-Free Diffusion Guidance
J. Ho and T. Salimans, “Classifier-free diffusion guidance,” arXiv preprint arXiv:2207.12598, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[60]
Diffit: Diffusion vision transformers for image generation,
A. Hatamizadeh, J. Song, G. Liu, J. Kautz, and A. Vahdat, “Diffit: Diffusion vision transformers for image generation,”arXiv preprint arXiv:2312.02139, 2023
-
[61]
Progressive Distillation for Fast Sampling of Diffusion Models
T. Salimans and J. Ho, “Progressive distillation for fast sampling of diffusion models,”arXiv preprint arXiv:2202.00512, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[62]
Lavie: High-quality video generation with cascaded latent diffusion models,
Y . Wang, X. Chen, X. Ma, S. Zhou, Z. Huang, Y . Wang, C. Yang, Y . He, J. Yu, P. Yang, Y . Guo, T. Wu, C. Si, Y . Jiang, C. Chen, C. C. Loy, B. Dai, D. Lin, Y . Qiao, and Z. Liu, “Lavie: High-quality video generation with cascaded latent diffusion models,” 2023. 28
work page 2023
-
[63]
Roformer: Enhanced transformer with rotary position embedding,
J. Su, M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu, “Roformer: Enhanced transformer with rotary position embedding,”Neurocomputing, vol. 568, p. 127063, 2024
work page 2024
-
[64]
Gentron: Delving deep into diffusion transformers for image and video generation,
S. Chen, M. Xu, J. Ren, Y . Cong, S. He, Y . Xie, A. Sinha, P. Luo, T. Xiang, and J.-M. Perez-Rua, “Gentron: Delving deep into diffusion transformers for image and video generation,” 2023
work page 2023
-
[65]
Photorealistic video generation with diffusion models,
A. Gupta, L. Yu, K. Sohn, X. Gu, M. Hahn, L. Fei-Fei, I. Essa, L. Jiang, and J. Lezama, “Photorealistic video generation with diffusion models,”arXiv preprint arXiv:2312.06662, 2023
-
[66]
Latte: Latent Diffusion Transformer for Video Generation
X. Ma, Y . Wang, G. Jia, X. Chen, Z. Liu, Y .-F. Li, C. Chen, and Y . Qiao, “Latte: Latent diffusion transformer for video generation,”arXiv preprint arXiv:2401.03048, 2024
work page internal anchor Pith review arXiv 2024
-
[67]
Snap video: Scaled spatiotemporal transformers for text-to-video synthesis,
W. Menapace, A. Siarohin, I. Skorokhodov, E. Deyneka, T.-S. Chen, A. Kag, Y . Fang, A. Stoliar, E. Ricci, J. Ren, et al., “Snap video: Scaled spatiotemporal transformers for text-to-video synthesis,” arXiv preprint arXiv:2402.14797, 2024
-
[68]
Elucidating the design space of diffusion-based genera- tive models,
T. Karras, M. Aittala, T. Aila, and S. Laine, “Elucidating the design space of diffusion-based genera- tive models,”Advances in Neural Information Processing Systems, vol. 35, pp. 26565–26577, 2022
work page 2022
-
[69]
Fit: Far-reaching interleaved transformers,
T. Chen and L. Li, “Fit: Far-reaching interleaved transformers,” arXiv preprint arXiv:2305.12689 , 2023
-
[70]
Cascaded diffusion models for high fidelity image generation,
J. Ho, C. Saharia, W. Chan, D. J. Fleet, M. Norouzi, and T. Salimans, “Cascaded diffusion models for high fidelity image generation,”The Journal of Machine Learning Research, vol. 23, no. 1, pp. 2249– 2281, 2022
work page 2022
-
[71]
High-resolution image synthesis with latent diffusion models,
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” 2021
work page 2021
-
[72]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, and R. Rom- bach, “Sdxl: Improving latent diffusion models for high-resolution image synthesis,” arXiv preprint arXiv:2307.01952, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[73]
Language models are few-shot learners,
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., “Language models are few-shot learners,”arXiv, 2020
work page 2020
-
[74]
Conditional prompt learning for vision-language models,
K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Conditional prompt learning for vision-language models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16816– 16825, 2022
work page 2022
-
[75]
Multitask Prompted Training Enables Zero-Shot Task Generalization
V . Sanh, A. Webson, C. Raffel, S. H. Bach, L. Sutawika, Z. Alyafeai, A. Chaffin, A. Stiegler, T. L. Scao, A. Raja, et al. , “Multitask prompted training enables zero-shot task generalization,” arXiv preprint arXiv:2110.08207, 2021
work page internal anchor Pith review arXiv 2021
-
[76]
Finetuned Language Models Are Zero-Shot Learners
J. Wei, M. Bosma, V . Y . Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V . Le, “Finetuned language models are zero-shot learners,”arXiv preprint arXiv:2109.01652, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[77]
Training language models to follow instructions with human feedback,
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al., “Training language models to follow instructions with human feedback,” Advances in Neural Information Processing Systems, vol. 35, pp. 27730–27744, 2022
work page 2022
-
[78]
Scaling up visual and vision-language representation learning with noisy text supervision,
C. Jia, Y . Yang, Y . Xia, Y .-T. Chen, Z. Parekh, H. Pham, Q. Le, Y .-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” inInter- national conference on machine learning, pp. 4904–4916, PMLR, 2021. 29
work page 2021
-
[79]
Coca: Contrastive captioners are image- text foundation models
J. Yu, Z. Wang, V . Vasudevan, L. Yeung, M. Seyedhosseini, and Y . Wu, “Coca: Contrastive captioners are image-text foundation models,”arXiv preprint arXiv:2205.01917, 2022
-
[80]
Video-text modeling with zero-shot transfer from contrastive captioners,
S. Yan, T. Zhu, Z. Wang, Y . Cao, M. Zhang, S. Ghosh, Y . Wu, and J. Yu, “Video-text modeling with zero-shot transfer from contrastive captioners,”arXiv preprint arXiv:2212.04979, 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.