pith. machine review for the scientific record. sign in

arxiv: 2402.17177 · v3 · submitted 2024-02-27 · 💻 cs.CV · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

Chujie Gao, Hanchi Sun, Jianfeng Gao, Kai Zhang, Lichao Sun, Lifang He, Ruoxi Chen, Yixin Liu, Yuan Li, Yue Huang, Zhengqing Yuan, Zhiling Yan

Pith reviewed 2026-05-13 13:36 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords Soratext-to-video generationgenerative AIworld simulatorvideo synthesisAI applicationslimitationsfuture directions
0
0 comments X

The pith

Sora generates realistic videos from text by simulating physical world scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper reviews Sora, a text-to-video generative model, by tracing its development and the technologies behind it using public reports and reverse engineering. It outlines applications in film-making, education, and marketing while examining challenges like safe and unbiased generation. A sympathetic reader would care because the model points toward new ways for AI to create visual content and interact with users in creative and productive tasks.

Core claim

The paper establishes that Sora operates as a world simulator trained to produce videos of realistic or imaginative scenes from text instructions, relying on large vision models whose details are inferred from available reports, with resulting impacts across industries offset by needs for better safety, bias control, and technical refinements in future video generation systems.

What carries the argument

The world simulator, a generative system that produces video sequences consistent with physical dynamics from textual input.

If this is right

  • Sora enables automated generation of video scenes for film production without traditional filming.
  • It supplies customizable visual explanations and simulations for educational content.
  • Marketing can produce tailored video assets from simple text descriptions at scale.
  • Widespread use depends on solving issues of bias, safety, and content control in outputs.
  • Further progress opens paths to more interactive and productive forms of human-AI video collaboration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could extend to interactive environments where users refine generated videos in real time.
  • It connects to broader efforts in building AI systems that respect physical laws for scientific visualization.
  • Generated content raises questions about ownership and verification that may require new standards.
  • Overcoming current length and consistency limits would test whether the simulator scales to complex multi-shot narratives.

Load-bearing premise

Public technical reports and reverse engineering efforts accurately reflect the model's true internal architecture and training details.

What would settle it

Official technical documentation released by the developers that directly contradicts the architecture and capabilities inferred from public sources.

read the original abstract

Sora is a text-to-video generative AI model, released by OpenAI in February 2024. The model is trained to generate videos of realistic or imaginative scenes from text instructions and show potential in simulating the physical world. Based on public technical reports and reverse engineering, this paper presents a comprehensive review of the model's background, related technologies, applications, remaining challenges, and future directions of text-to-video AI models. We first trace Sora's development and investigate the underlying technologies used to build this "world simulator". Then, we describe in detail the applications and potential impact of Sora in multiple industries ranging from film-making and education to marketing. We discuss the main challenges and limitations that need to be addressed to widely deploy Sora, such as ensuring safe and unbiased video generation. Lastly, we discuss the future development of Sora and video generation models in general, and how advancements in the field could enable new ways of human-AI interaction, boosting productivity and creativity of video generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript is a review of OpenAI's Sora text-to-video generative model released in February 2024. It traces the model's background and development, examines underlying technologies inferred from public technical reports and reverse-engineering efforts, surveys applications across film-making, education, and marketing, discusses challenges such as ensuring safe and unbiased generation, and outlines future directions for text-to-video models to support new forms of human-AI interaction.

Significance. If the synthesis of publicly available information holds, the paper offers a timely consolidation of knowledge on a leading large vision model in the text-to-video domain. This could serve as a useful reference for the computer vision community by identifying key limitations and opportunities, thereby guiding subsequent research on generative models without relying on proprietary claims.

minor comments (3)
  1. The abstract and introduction would benefit from explicit statements on the scope of reverse-engineering claims to avoid any implication of direct access to proprietary details (e.g., clarify in the first paragraph of the introduction that all technical descriptions derive solely from public sources).
  2. Section headings and transitions between technology descriptions and applications could be tightened for better flow; some paragraphs repeat background information already covered earlier.
  3. Add a short table or bullet list summarizing the main public reports cited for each technological component (e.g., diffusion models, transformer variants) to improve traceability and reader navigation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation of our manuscript as a timely review synthesizing public information on Sora. We appreciate the recommendation for minor revision and the recognition of its potential value to the computer vision community. Since the report does not list any specific major comments, we interpret the minor revision request as an opportunity to polish the presentation and ensure all claims remain grounded in publicly available sources.

read point-by-point responses
  1. Referee: No specific major comments were provided in the report; the review is generally supportive with a minor revision recommendation.

    Authors: We will perform a careful pass to address any minor issues such as clarity, citations, or formatting in the revised version. The core synthesis of background, technology inferences, applications, limitations, and opportunities will remain unchanged as it is based on public technical reports and reverse-engineering efforts. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a literature review that explicitly grounds all claims in external public technical reports and reverse-engineering efforts rather than any internal derivations, fitted parameters, or self-referential predictions. No equations, ansatzes, uniqueness theorems, or load-bearing self-citations appear; the central synthesis draws from outside sources and remains independent of its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a review paper, no new free parameters, axioms, or invented entities are introduced; the content depends entirely on external public reports and reverse engineering summaries.

pith-pipeline@v0.9.0 · 5508 in / 1019 out tokens · 59410 ms · 2026-05-13T13:36:43.816345+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 26 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. JEDI: Joint Embedding Diffusion World Model for Online Model-Based Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    JEDI is the first online end-to-end latent diffusion world model that trains latents from denoising loss rather than reconstruction, achieving competitive Atari100k results with 43% less VRAM and over 3x faster sampli...

  2. MaMi-HOI: Harmonizing Global Kinematics and Local Geometry for Human-Object Interaction Generation

    cs.RO 2026-05 unverdicted novelty 7.0

    MaMi-HOI counters geometric forgetting in diffusion models via a Geometry-Aware Proximity Adapter for precise contacts and a Kinematic Harmony Adapter for natural whole-body postures in human-object interactions.

  3. MotionGRPO: Overcoming Low Intra-Group Diversity in GRPO-Based Egocentric Motion Recovery

    cs.CV 2026-05 unverdicted novelty 7.0

    MotionGRPO models diffusion sampling as a Markov decision process optimized with Group Relative Policy Optimization, using hybrid rewards and noise injection to boost sample diversity and local joint precision in egoc...

  4. A unified Benchmark for Multi-Frame Image Restoration under Severe Refractive Warping

    cs.CV 2026-05 unverdicted novelty 7.0

    Presents the first large-scale benchmark for multi-frame geometric distortion removal in videos under severe refractive warping, using real and synthetic data across four distortion levels and evaluating classical and...

  5. $Z^2$-Sampling: Zero-Cost Zigzag Trajectories for Semantic Alignment in Diffusion Models

    cs.CV 2026-04 unverdicted novelty 7.0

    Z²-Sampling implicitly realizes zero-cost zigzag trajectories for curvature-aware semantic alignment in diffusion models by reducing multi-step paths via operator dualities and temporal caching while synthesizing a di...

  6. Latent Space Probing for Adult Content Detection in Video Generative Models

    cs.CV 2026-04 unverdicted novelty 7.0

    Latent space probing on CogVideoX achieves 97.29% F1 for adult content detection on a new 11k-clip dataset with 4-6ms overhead.

  7. WorldMark: A Unified Benchmark Suite for Interactive Video World Models

    cs.CV 2026-04 unverdicted novelty 7.0

    WorldMark is the first public benchmark that standardizes scenes, trajectories, and control interfaces across heterogeneous interactive image-to-video world models.

  8. UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs

    cs.CV 2026-04 unverdicted novelty 7.0

    UniEditBench unifies image and video editing evaluation with a nine-plus-eight operation taxonomy and cost-effective 4B/8B distilled MLLM evaluators that align with human judgments.

  9. Immune2V: Image Immunization Against Dual-Stream Image-to-Video Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    Immune2V immunizes images against dual-stream I2V generation by enforcing temporally balanced latent divergence and aligning generative features to a precomputed collapse trajectory, yielding stronger persistent degra...

  10. MotionScape: A Large-Scale Real-World Highly Dynamic UAV Video Dataset for World Models

    cs.CV 2026-04 unverdicted novelty 7.0

    MotionScape is a large-scale UAV video dataset with highly dynamic 6-DoF motions, geometric trajectories, and semantic annotations to train world models that better simulate complex 3D dynamics under large viewpoint changes.

  11. Controllable Generative Video Compression

    cs.CV 2026-04 unverdicted novelty 7.0

    CGVC uses coded keyframes and per-frame priors to guide controllable generative reconstruction of video frames, outperforming prior perceptual compression methods in both signal fidelity and perceptual quality.

  12. DocShield: Towards AI Document Safety via Evidence-Grounded Agentic Reasoning

    cs.CV 2026-04 unverdicted novelty 7.0

    DocShield presents a new agentic reasoning framework using Cross-Cues-aware Chain of Thought to detect, localize, and explain text-centric forgeries in documents, with reported F1 gains of 41.4% over specialized metho...

  13. OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation

    cs.CV 2026-05 unverdicted novelty 6.0

    OmniHumanoid factorizes transferable motion learning from embodiment-specific adaptation to enable scalable cross-embodiment video generation without paired data for new humanoids.

  14. DiffATS: Diffusion in Aligned Tensor Space

    cs.LG 2026-05 unverdicted novelty 6.0

    DiffATS trains diffusion models directly on aligned Tucker tensor primitives that are proven to be homeomorphisms, delivering efficient unconditional and conditional generation across images, videos, and PDE data with...

  15. Camera Control for Text-to-Image Generation via Learning Viewpoint Tokens

    cs.CV 2026-04 unverdicted novelty 6.0

    Viewpoint tokens learned on a mixed 3D-rendered and photorealistic dataset enable precise camera control in text-to-image generation while factorizing geometry from appearance and transferring to unseen object categories.

  16. Long-CODE: Isolating Pure Long-Context as an Orthogonal Dimension in Video Evaluation

    cs.CV 2026-04 unverdicted novelty 6.0

    Long-CODE isolates long-context video evaluation with a new benchmark dataset and shot-dynamics metric that correlates better with human judgments on narrative richness and global consistency than short-video metrics.

  17. DVAR: Adversarial Multi-Agent Debate for Video Authenticity Detection

    cs.CV 2026-04 unverdicted novelty 6.0

    DVAR turns video authenticity detection into an iterative debate between a generative hypothesis agent and a natural mechanism agent, resolved via minimum description length and a knowledge base for better generalizat...

  18. VGA-Bench: A Unified Benchmark and Multi-Model Framework for Video Aesthetics and Generation Quality Evaluation

    cs.CV 2026-04 unverdicted novelty 6.0

    VGA-Bench creates a three-tier taxonomy, 1,016-prompt dataset of 60k+ videos, and three multi-task neural models (VAQA-Net, VTag-Net, VGQA-Net) that align with human judgments for video aesthetics and generation quality.

  19. StoryBlender: Inter-Shot Consistent and Editable 3D Storyboard with Spatial-temporal Dynamics

    cs.CV 2026-04 unverdicted novelty 6.0

    StoryBlender generates inter-shot consistent editable 3D storyboards using a three-stage pipeline of semantic-spatial grounding, canonical asset materialization, and spatial-temporal dynamics with agent-based verification.

  20. MotionGRPO: Overcoming Low Intra-Group Diversity in GRPO-Based Egocentric Motion Recovery

    cs.CV 2026-05 unverdicted novelty 5.0

    MotionGRPO applies GRPO with noise injection and hybrid rewards to diffusion-based egocentric motion recovery, overcoming vanishing gradients from low intra-group diversity to reach state-of-the-art performance.

  21. Do Protective Perturbations Really Protect Portrait Privacy under Real-world Image Transformations?

    cs.CV 2026-04 conditional novelty 5.0

    Pixel-level protective perturbations for portrait privacy are ineffective against common image transformations, and a low-cost purification framework can strip them out.

  22. StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement

    cs.RO 2026-04 unverdicted novelty 5.0

    StableIDM stabilizes inverse dynamics models under manipulator truncation by combining robot-centric masking, directional spatial feature aggregation, and temporal dynamics refinement, yielding 12.1% higher strict act...

  23. The Amazing Stability of Flow Matching

    cs.CV 2026-04 unverdicted novelty 5.0

    Flow matching generative models preserve sample quality, diversity, and latent representations despite pruning 50% of the CelebA-HQ dataset or altering architecture and training configurations.

  24. Prompt-to-Gesture: Measuring the Capabilities of Image-to-Video Deictic Gesture Generation

    cs.CV 2026-04 unverdicted novelty 5.0

    Prompt-driven image-to-video generation produces deictic gestures that match real data visually, add useful variety, and improve downstream recognition models when mixed with human recordings.

  25. LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation

    cs.CV 2026-04 unverdicted novelty 3.0

    This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challe...

  26. Generative AI for Video Trailer Synthesis: From Extractive Heuristics to Autoregressive Creativity

    cs.CV 2026-04 unverdicted novelty 3.0

    The paper surveys the evolution of video trailer generation from extractive heuristics to generative AI methods and proposes a new taxonomy for future systems based on autoregressive and foundation models.

Reference graph

Works this paper leans on

201 extracted references · 201 canonical work pages · cited by 25 Pith papers · 24 internal anchors

  1. [1]

    Chatgpt: Get instant answers, find creative inspiration, learn something new

    OpenAI, “Chatgpt: Get instant answers, find creative inspiration, learn something new..” https: //openai.com/chatgpt, 2022

  2. [2]

    Gpt-4 technical report,

    OpenAI, “Gpt-4 technical report,” 2023

  3. [3]

    Sora: Creating video from text

    OpenAI, “Sora: Creating video from text.” https://openai.com/sora, 2024

  4. [4]

    Scalable diffusion models with transformers,

    W. Peebles and S. Xie, “Scalable diffusion models with transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4195–4205, 2023

  5. [5]

    Texture synthesis by non-parametric sampling,

    A. A. Efros and T. K. Leung, “Texture synthesis by non-parametric sampling,” in Proceedings of the seventh IEEE international conference on computer vision, vol. 2, pp. 1033–1038, IEEE, 1999

  6. [6]

    Survey of texture mapping,

    P. S. Heckbert, “Survey of texture mapping,”IEEE computer graphics and applications, vol. 6, no. 11, pp. 56–67, 1986

  7. [7]

    Generative adversarial networks,

    I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio, “Generative adversarial networks,”arXiv, 2014

  8. [8]

    Auto-Encoding Variational Bayes

    D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013

  9. [9]

    NICE: Non-linear Independent Components Estimation

    L. Dinh, D. Krueger, and Y . Bengio, “Nice: Non-linear independent components estimation,” arXiv preprint arXiv:1410.8516, 2014

  10. [10]

    Generative modeling by estimating gradients of the data distribution,

    Y . Song and S. Ermon, “Generative modeling by estimating gradients of the data distribution,” Ad- vances in Neural Information Processing Systems, vol. 32, 2019

  11. [11]

    A comprehensive survey of ai-generated content (aigc): A history of generative ai from gan to chatgpt,

    Y . Cao, S. Li, Y . Liu, Z. Yan, Y . Dai, P. S. Yu, and L. Sun, “A comprehensive survey of ai-generated content (aigc): A history of generative ai from gan to chatgpt,” arXiv preprint arXiv:2303.04226 , 2023

  12. [12]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polo- sukhin, “Attention is all you need,” inAdvances in Neural Information Processing Systems(I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds.), vol. 30, Curran Associates, Inc., 2017

  13. [13]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional trans- formers for language understanding,”arXiv preprint arXiv:1810.04805, 2018

  14. [14]

    Improving language understanding by generative pre-training,

    A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, et al., “Improving language understanding by generative pre-training,” 2018

  15. [15]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020

  16. [16]

    Swin transformer: Hierarchical vision transformer using shifted windows,

    Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inProceedings of the IEEE/CVF international conference on computer vision, pp. 10012–10022, 2021. 25

  17. [17]

    U-net: Convolutional networks for biomedical image seg- mentation,

    O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image seg- mentation,” in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pp. 234– 241, Springer, 2015

  18. [18]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” 2021

  19. [19]

    High-resolution image synthesis with latent diffusion models,

    R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684–10695, 2022

  20. [20]

    Midjourney: Text to image with ai art generator

    M. AI, “Midjourney: Text to image with ai art generator.” https://www.midjourneyai.ai/ en, 2023

  21. [21]

    Improving image generation with better captions,

    J. Betker, G. Goh, L. Jing, T. Brooks, J. Wang, L. Li, L. Ouyang, J. Zhuang, J. Lee, Y . Guo, et al., “Improving image generation with better captions,” Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, vol. 2, p. 3, 2023

  22. [22]

    Pika is the idea-to-video platform that sets your creativity in motion

    P. AI, “Pika is the idea-to-video platform that sets your creativity in motion..” https://pika. art/home, 2023

  23. [23]

    Gen-2: Gen-2: The next step forward for generative ai

    R. AI, “Gen-2: Gen-2: The next step forward for generative ai.” https://research. runwayml.com/gen2, 2023

  24. [24]

    Scaling vision transformers,

    X. Zhai, A. Kolesnikov, N. Houlsby, and L. Beyer, “Scaling vision transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12104–12113, 2022

  25. [25]

    Scaling vision transformers to 22 billion parameters,

    M. Dehghani, J. Djolonga, B. Mustafa, P. Padlewski, J. Heek, J. Gilmer, A. P. Steiner, M. Caron, R. Geirhos, I. Alabdulmohsin, et al., “Scaling vision transformers to 22 billion parameters,” inInter- national Conference on Machine Learning, pp. 7480–7512, PMLR, 2023

  26. [26]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable visual models from natural language supervision,” in Interna- tional conference on machine learning, pp. 8748–8763, PMLR, 2021

  27. [27]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y . Levi, Z. English, V . V oleti, A. Letts,et al., “Stable video diffusion: Scaling latent video diffusion models to large datasets,”arXiv preprint arXiv:2311.15127, 2023

  28. [28]

    Make-a-video: Text-to-video generation without text-video data,

    U. Singer, A. Polyak, T. Hayes, X. Yin, J. An, S. Zhang, Q. Hu, H. Yang, O. Ashual, O. Gafni, D. Parikh, S. Gupta, and Y . Taigman, “Make-a-video: Text-to-video generation without text-video data,” 2022

  29. [29]

    Imagen Video: High Definition Video Generation with Diffusion Models

    J. Ho, W. Chan, C. Saharia, J. Whang, R. Gao, A. Gritsenko, D. P. Kingma, B. Poole, M. Norouzi, D. J. Fleet, et al., “Imagen video: High definition video generation with diffusion models,” arXiv preprint arXiv:2210.02303, 2022

  30. [30]

    The bitter lesson

    R. Sutton, “The bitter lesson.” http://www.incompleteideas.net/IncIdeas/ BitterLesson.html, March 2019. Accessed: Your Access Date Here. 26

  31. [31]

    Take on sora technical report

    S. Xie, “Take on sora technical report.” https://twitter.com/sainingxie/status/ 1758433676105310543, 2024

  32. [32]

    Neural discrete representation learning,

    A. Van Den Oord, O. Vinyals, et al., “Neural discrete representation learning,” Advances in neural information processing systems, vol. 30, 2017

  33. [33]

    Masked autoencoders are scalable vision learners,

    K. He, X. Chen, S. Xie, Y . Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009, 2022

  34. [34]

    Preserve your own correlation: A noise prior for video diffusion models,

    S. Ge, S. Nah, G. Liu, T. Poon, A. Tao, B. Catanzaro, D. Jacobs, J.-B. Huang, M.-Y . Liu, and Y . Balaji, “Preserve your own correlation: A noise prior for video diffusion models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22930–22941, 2023

  35. [35]

    arXiv preprint arXiv:2311.17042 (2023)

    A. Sauer, D. Lorenz, A. Blattmann, and R. Rombach, “Adversarial diffusion distillation,” arXiv preprint arXiv:2311.17042, 2023

  36. [36]

    Align your la- tents: High-resolution video synthesis with latent diffusion models,

    A. Blattmann, R. Rombach, H. Ling, T. Dockhorn, S. W. Kim, S. Fidler, and K. Kreis, “Align your la- tents: High-resolution video synthesis with latent diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22563–22575, 2023

  37. [37]

    Tokenlearner: Adaptive space-time tokenization for videos,

    M. Ryoo, A. Piergiovanni, A. Arnab, M. Dehghani, and A. Angelova, “Tokenlearner: Adaptive space-time tokenization for videos,” Advances in Neural Information Processing Systems , vol. 34, pp. 12786–12797, 2021

  38. [38]

    Vivit: A video vision transformer,

    A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lu ˇci´c, and C. Schmid, “Vivit: A video vision transformer,”arXiv preprint arXiv:2103.15691, 2021

  39. [39]

    Flexivit: One model for all patch sizes,

    L. Beyer, P. Izmailov, A. Kolesnikov, M. Caron, S. Kornblith, X. Zhai, M. Minderer, M. Tschannen, I. Alabdulmohsin, and F. Pavetic, “Flexivit: One model for all patch sizes,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14496–14506, 2023

  40. [40]

    Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution,

    M. Dehghani, B. Mustafa, J. Djolonga, J. Heek, M. Minderer, M. Caron, A. Steiner, J. Puigcerver, R. Geirhos, I. M. Alabdulmohsin, et al., “Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution,”Advances in Neural Information Processing Systems, vol. 36, 2024

  41. [41]

    Efficient sequence packing without cross- contamination: Accelerating large language models without impacting performance,

    M. M. Krell, M. Kosec, S. P. Perez, and A. Fitzgibbon, “Efficient sequence packing without cross- contamination: Accelerating large language models without impacting performance,” arXiv preprint arXiv:2107.02027, 2021

  42. [42]

    A-vit: Adaptive tokens for efficient vision transformer,

    H. Yin, A. Vahdat, J. M. Alvarez, A. Mallya, J. Kautz, and P. Molchanov, “A-vit: Adaptive tokens for efficient vision transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10809–10818, 2022

  43. [43]

    Token merging: Your vit but faster,

    D. Bolya, C.-Y . Fu, X. Dai, P. Zhang, C. Feichtenhofer, and J. Hoffman, “Token merging: Your vit but faster,” inThe Eleventh International Conference on Learning Representations, 2022

  44. [44]

    Adaptive token sampling for efficient vision transformers,

    M. Fayyaz, S. A. Koohpayegani, F. R. Jafari, S. Sengupta, H. R. V . Joze, E. Sommerlade, H. Pirsi- avash, and J. Gall, “Adaptive token sampling for efficient vision transformers,” inEuropean Confer- ence on Computer Vision, pp. 396–414, Springer, 2022

  45. [45]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017. 27

  46. [46]

    Is space-time attention all you need for video understand- ing?,

    G. Bertasius, H. Wang, and L. Torresani, “Is space-time attention all you need for video understand- ing?,” inICML, vol. 2, p. 4, 2021

  47. [47]

    Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

    L. Yu, J. Lezama, N. B. Gundavarapu, L. Versari, K. Sohn, D. Minnen, Y . Cheng, A. Gupta, X. Gu, A. G. Hauptmann, et al., “Language model beats diffusion–tokenizer is key to visual generation,” arXiv preprint arXiv:2310.05737, 2023

  48. [48]

    Fast transformer decoding: One write-head is all you need,

    N. Shazeer, “Fast transformer decoding: One write-head is all you need,” 2019

  49. [49]

    GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

    J. Ainslie, J. Lee-Thorp, M. de Jong, Y . Zemlyanskiy, F. Lebrón, and S. Sanghai, “Gqa: Train- ing generalized multi-query transformer models from multi-head checkpoints,” arXiv preprint arXiv:2305.13245, 2023

  50. [50]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” arXiv preprint arXiv:2312.00752, 2023

  51. [51]

    Deep Unsupervised Learning using Nonequilibrium Thermodynamics

    J. Sohl-Dickstein, E. A. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,”arXiv preprint arXiv:1503.03585, 2015

  52. [52]

    Denoising diffusion probabilistic models,

    J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in Neural Infor- mation Processing Systems, vol. 33, pp. 6840–6851, 2020

  53. [53]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,”arXiv preprint arXiv:2011.13456, 2020

  54. [54]

    All are worth words: A vit backbone for diffusion models,

    F. Bao, S. Nie, K. Xue, Y . Cao, C. Li, H. Su, and J. Zhu, “All are worth words: A vit backbone for diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  55. [55]

    Masked diffusion transformer is a strong image synthe- sizer,

    S. Gao, P. Zhou, M.-M. Cheng, and S. Yan, “Masked diffusion transformer is a strong image synthe- sizer,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , pp. 23164– 23173, 2023

  56. [56]

    Masked diffusion transformer is a strong image synthesizer

    S. Gao, P. Zhou, M.-M. Cheng, and S. Yan, “Masked diffusion transformer is a strong image synthe- sizer,”arXiv preprint arXiv:2303.14389, 2023

  57. [57]

    Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models,

    X. Xie, P. Zhou, H. Li, Z. Lin, and S. Yan, “Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models,”arXiv preprint arXiv:2208.06677, 2022

  58. [58]

    Efficient diffusion training via min-snr weighting strategy,

    T. Hang, S. Gu, C. Li, J. Bao, D. Chen, H. Hu, X. Geng, and B. Guo, “Efficient diffusion training via min-snr weighting strategy,” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7441–7451, 2023

  59. [59]

    Classifier-Free Diffusion Guidance

    J. Ho and T. Salimans, “Classifier-free diffusion guidance,” arXiv preprint arXiv:2207.12598, 2022

  60. [60]

    Diffit: Diffusion vision transformers for image generation,

    A. Hatamizadeh, J. Song, G. Liu, J. Kautz, and A. Vahdat, “Diffit: Diffusion vision transformers for image generation,”arXiv preprint arXiv:2312.02139, 2023

  61. [61]

    Progressive Distillation for Fast Sampling of Diffusion Models

    T. Salimans and J. Ho, “Progressive distillation for fast sampling of diffusion models,”arXiv preprint arXiv:2202.00512, 2022

  62. [62]

    Lavie: High-quality video generation with cascaded latent diffusion models,

    Y . Wang, X. Chen, X. Ma, S. Zhou, Z. Huang, Y . Wang, C. Yang, Y . He, J. Yu, P. Yang, Y . Guo, T. Wu, C. Si, Y . Jiang, C. Chen, C. C. Loy, B. Dai, D. Lin, Y . Qiao, and Z. Liu, “Lavie: High-quality video generation with cascaded latent diffusion models,” 2023. 28

  63. [63]

    Roformer: Enhanced transformer with rotary position embedding,

    J. Su, M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu, “Roformer: Enhanced transformer with rotary position embedding,”Neurocomputing, vol. 568, p. 127063, 2024

  64. [64]

    Gentron: Delving deep into diffusion transformers for image and video generation,

    S. Chen, M. Xu, J. Ren, Y . Cong, S. He, Y . Xie, A. Sinha, P. Luo, T. Xiang, and J.-M. Perez-Rua, “Gentron: Delving deep into diffusion transformers for image and video generation,” 2023

  65. [65]

    Photorealistic video generation with diffusion models,

    A. Gupta, L. Yu, K. Sohn, X. Gu, M. Hahn, L. Fei-Fei, I. Essa, L. Jiang, and J. Lezama, “Photorealistic video generation with diffusion models,”arXiv preprint arXiv:2312.06662, 2023

  66. [66]

    Latte: Latent Diffusion Transformer for Video Generation

    X. Ma, Y . Wang, G. Jia, X. Chen, Z. Liu, Y .-F. Li, C. Chen, and Y . Qiao, “Latte: Latent diffusion transformer for video generation,”arXiv preprint arXiv:2401.03048, 2024

  67. [67]

    Snap video: Scaled spatiotemporal transformers for text-to-video synthesis,

    W. Menapace, A. Siarohin, I. Skorokhodov, E. Deyneka, T.-S. Chen, A. Kag, Y . Fang, A. Stoliar, E. Ricci, J. Ren, et al., “Snap video: Scaled spatiotemporal transformers for text-to-video synthesis,” arXiv preprint arXiv:2402.14797, 2024

  68. [68]

    Elucidating the design space of diffusion-based genera- tive models,

    T. Karras, M. Aittala, T. Aila, and S. Laine, “Elucidating the design space of diffusion-based genera- tive models,”Advances in Neural Information Processing Systems, vol. 35, pp. 26565–26577, 2022

  69. [69]

    Fit: Far-reaching interleaved transformers,

    T. Chen and L. Li, “Fit: Far-reaching interleaved transformers,” arXiv preprint arXiv:2305.12689 , 2023

  70. [70]

    Cascaded diffusion models for high fidelity image generation,

    J. Ho, C. Saharia, W. Chan, D. J. Fleet, M. Norouzi, and T. Salimans, “Cascaded diffusion models for high fidelity image generation,”The Journal of Machine Learning Research, vol. 23, no. 1, pp. 2249– 2281, 2022

  71. [71]

    High-resolution image synthesis with latent diffusion models,

    R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” 2021

  72. [72]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, and R. Rom- bach, “Sdxl: Improving latent diffusion models for high-resolution image synthesis,” arXiv preprint arXiv:2307.01952, 2023

  73. [73]

    Language models are few-shot learners,

    T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., “Language models are few-shot learners,”arXiv, 2020

  74. [74]

    Conditional prompt learning for vision-language models,

    K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Conditional prompt learning for vision-language models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16816– 16825, 2022

  75. [75]

    Multitask Prompted Training Enables Zero-Shot Task Generalization

    V . Sanh, A. Webson, C. Raffel, S. H. Bach, L. Sutawika, Z. Alyafeai, A. Chaffin, A. Stiegler, T. L. Scao, A. Raja, et al. , “Multitask prompted training enables zero-shot task generalization,” arXiv preprint arXiv:2110.08207, 2021

  76. [76]

    Finetuned Language Models Are Zero-Shot Learners

    J. Wei, M. Bosma, V . Y . Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V . Le, “Finetuned language models are zero-shot learners,”arXiv preprint arXiv:2109.01652, 2021

  77. [77]

    Training language models to follow instructions with human feedback,

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al., “Training language models to follow instructions with human feedback,” Advances in Neural Information Processing Systems, vol. 35, pp. 27730–27744, 2022

  78. [78]

    Scaling up visual and vision-language representation learning with noisy text supervision,

    C. Jia, Y . Yang, Y . Xia, Y .-T. Chen, Z. Parekh, H. Pham, Q. Le, Y .-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” inInter- national conference on machine learning, pp. 4904–4916, PMLR, 2021. 29

  79. [79]

    Coca: Contrastive captioners are image- text foundation models

    J. Yu, Z. Wang, V . Vasudevan, L. Yeung, M. Seyedhosseini, and Y . Wu, “Coca: Contrastive captioners are image-text foundation models,”arXiv preprint arXiv:2205.01917, 2022

  80. [80]

    Video-text modeling with zero-shot transfer from contrastive captioners,

    S. Yan, T. Zhu, Z. Wang, Y . Cao, M. Zhang, S. Ghosh, Y . Wu, and J. Yu, “Video-text modeling with zero-shot transfer from contrastive captioners,”arXiv preprint arXiv:2212.04979, 2022

Showing first 80 references.