pith. machine review for the scientific record. sign in

arxiv: 2412.14169 · v2 · pith:7ZPEYLCNnew · submitted 2024-12-18 · 💻 cs.CV

Autoregressive Video Generation without Vector Quantization

Pith reviewed 2026-05-17 15:02 UTC · model grok-4.3

classification 💻 cs.CV
keywords autoregressive video generationnon-quantized modelingframe-by-frame predictiontext-to-videovector quantizationGPT-style autoregressionvideo synthesis efficiency
0
0 comments X

The pith

Video generation can be done autoregressively without vector quantization by predicting frames sequentially in time and sets spatially within each frame.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a reformulation of video generation as non-quantized autoregressive modeling that combines temporal frame-by-frame prediction with spatial set-by-set prediction inside frames. This keeps the causal property of GPT-style models for flexible conditioning while using bidirectional modeling within frames to improve efficiency. A sympathetic reader would care because the approach removes the discretization step of vector quantization yet still produces coherent videos with higher fidelity and fluency. The resulting NOVA model, at 0.6 billion parameters, reportedly outperforms earlier autoregressive video models in speed and data efficiency and also beats state-of-the-art image diffusion models on text-to-image tasks with lower training cost.

Core claim

By modeling video generation as a non-quantized autoregressive process that performs temporal frame-by-frame prediction and spatial set-by-set prediction, it is possible to maintain causal autoregressive structure while achieving high visual fidelity, fluency, and efficiency without any vector quantization step.

What carries the argument

Non-quantized autoregressive modeling via temporal frame-by-frame prediction and spatial set-by-set prediction, which preserves causality across frames while enabling bidirectional processing inside each frame.

If this is right

  • NOVA achieves better data efficiency and faster inference than prior autoregressive video models despite using far fewer parameters.
  • The same unified model supports generalization to longer videos and diverse zero-shot tasks.
  • It outperforms leading image diffusion models on text-to-image generation at lower training cost.
  • The approach removes the need for a separate quantization stage while retaining GPT-style causal flexibility.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Eliminating vector quantization could reduce reconstruction artifacts that often appear in VQ-based video models.
  • The frame-plus-set prediction pattern might transfer to other continuous-sequence domains such as audio or motion synthesis.
  • Because the model stays causal across time, it could support longer-context video editing or interpolation without retraining.

Load-bearing premise

That continuous visual features can be predicted autoregressively frame by frame and set by set without losing the information needed for coherent video output.

What would settle it

Training the same model on standard video benchmarks and finding that the generated videos show clear drops in temporal coherence or visual detail compared with quantized autoregressive baselines would falsify the claim.

read the original abstract

This paper presents a novel approach that enables autoregressive video generation with high efficiency. We propose to reformulate the video generation problem as a non-quantized autoregressive modeling of temporal frame-by-frame prediction and spatial set-by-set prediction. Unlike raster-scan prediction in prior autoregressive models or joint distribution modeling of fixed-length tokens in diffusion models, our approach maintains the causal property of GPT-style models for flexible in-context capabilities, while leveraging bidirectional modeling within individual frames for efficiency. With the proposed approach, we train a novel video autoregressive model without vector quantization, termed NOVA. Our results demonstrate that NOVA surpasses prior autoregressive video models in data efficiency, inference speed, visual fidelity, and video fluency, even with a much smaller model capacity, i.e., 0.6B parameters. NOVA also outperforms state-of-the-art image diffusion models in text-to-image generation tasks, with a significantly lower training cost. Additionally, NOVA generalizes well across extended video durations and enables diverse zero-shot applications in one unified model. Code and models are publicly available at https://github.com/baaivision/NOVA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes NOVA, a non-quantized autoregressive model for video generation that reformulates the problem as temporal frame-by-frame prediction combined with spatial set-by-set prediction. This maintains the causal property of GPT-style models while enabling bidirectional modeling within frames. The central claims are that a 0.6B-parameter NOVA model surpasses prior autoregressive video models in data efficiency, inference speed, visual fidelity, and video fluency, outperforms state-of-the-art image diffusion models on text-to-image tasks with lower training cost, generalizes to longer video durations, and supports diverse zero-shot applications in a single model.

Significance. If the empirical results hold under rigorous controls, the work would be significant for demonstrating that vector quantization can be eliminated from autoregressive video models without sacrificing (and potentially improving) quality and efficiency. This challenges the prevailing reliance on discrete bottlenecks in prior AR video work and could influence future designs of continuous generative models. The public release of code and models is a clear strength for reproducibility.

major comments (2)
  1. [Abstract and §4] Abstract and experimental sections: performance gains over prior AR video models and diffusion models are asserted without details on experimental controls, exact baselines, metrics (e.g., FVD, FID, CLIP score), training data volume, or statistical significance testing. This prevents assessment of the central claims of superior data efficiency and fidelity with a smaller 0.6B model.
  2. [§3] Method section on set-by-set prediction: the claim that continuous spatial set-by-set autoregressive prediction preserves sufficient intra-frame joint distributions and high-frequency details without discretization is load-bearing for the no-VQ advantage, yet no ablation, density modeling analysis, or comparison to Gaussian NLL/MSE baselines is provided to address why this succeeds where earlier non-quantized attempts failed.
minor comments (2)
  1. [Abstract] The abstract would benefit from explicitly naming the quantitative metrics used for 'visual fidelity' and 'video fluency'.
  2. [Figures] Figure captions and axis labels in qualitative results could be clarified to indicate the exact conditioning (text prompt, previous frames) for each example.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight opportunities to strengthen the presentation of experimental details and methodological analysis. We address each point below and have revised the manuscript to incorporate additional information and studies.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and experimental sections: performance gains over prior AR video models and diffusion models are asserted without details on experimental controls, exact baselines, metrics (e.g., FVD, FID, CLIP score), training data volume, or statistical significance testing. This prevents assessment of the central claims of superior data efficiency and fidelity with a smaller 0.6B model.

    Authors: We agree that expanded experimental documentation is warranted. In the revised version, §4 now includes a dedicated subsection detailing all baselines with citations, the full set of evaluation metrics (FVD, FID, CLIP score, and others), training dataset sizes and compositions, model capacity comparisons, and hardware/training protocols. We have also added multi-seed results for key comparisons to support reproducibility, although formal statistical significance tests were not performed owing to the substantial compute required for video generation; we note this limitation explicitly. revision: yes

  2. Referee: [§3] Method section on set-by-set prediction: the claim that continuous spatial set-by-set autoregressive prediction preserves sufficient intra-frame joint distributions and high-frequency details without discretization is load-bearing for the no-VQ advantage, yet no ablation, density modeling analysis, or comparison to Gaussian NLL/MSE baselines is provided to address why this succeeds where earlier non-quantized attempts failed.

    Authors: We acknowledge the value of explicit supporting analysis. The revised §3 now contains an ablation subsection that directly compares set-by-set continuous prediction against raster-scan ordering and Gaussian NLL/MSE alternatives. We report both quantitative metrics on high-frequency detail retention and qualitative visualizations of intra-frame distributions, together with a brief discussion of why the bidirectional set modeling succeeds where prior fully continuous attempts encountered difficulties. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical evaluation of a new modeling approach

full rationale

The paper reformulates video generation as non-quantized autoregressive modeling via temporal frame-by-frame prediction combined with spatial set-by-set prediction, preserving GPT-style causality while adding intra-frame bidirectionality. Central performance claims (superior data efficiency, speed, fidelity, and fluency for a 0.6B model, plus text-to-image gains) are presented as outcomes of training and benchmarking the resulting NOVA model against prior VQ-based autoregressive and diffusion baselines. No equations, uniqueness theorems, or first-principles derivations appear that reduce by construction to fitted inputs, self-citations, or ansatzes imported from the authors' prior work; the approach is validated externally through reported metrics rather than tautological redefinitions. This is the most common honest finding for an empirical modeling paper whose load-bearing step is the experimental demonstration that the proposed partitioning compensates for the absence of discretization.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that continuous-valued autoregressive prediction can replace quantized token modeling without loss of fidelity, plus standard neural network training assumptions. No new physical entities or free parameters are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Non-quantized autoregressive modeling of video frames can achieve high visual fidelity and fluency.
    Invoked when reformulating the generation problem without vector quantization.

pith-pipeline@v0.9.0 · 5519 in / 1194 out tokens · 59875 ms · 2026-05-17T15:02:54.272971+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    Stream-R1 improves distillation of autoregressive streaming video diffusion models by adaptively weighting supervision with a reward model at both rollout and per-pixel levels.

  2. UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models

    cs.CV 2026-04 unverdicted novelty 7.0

    UDM-GRPO is the first RL integration for uniform discrete diffusion models, using final clean samples as actions and forward-process trajectory reconstruction to raise GenEval accuracy from 69% to 96% and OCR accuracy...

  3. Efficient Video Diffusion Models: Advancements and Challenges

    cs.CV 2026-04 unverdicted novelty 7.0

    A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.

  4. Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation

    cs.RO 2026-02 unverdicted novelty 7.0

    PhysGen uses video models to learn physics for robots, outperforming baselines by up to 13.8% on Libero and matching specialized models in real-world tasks.

  5. SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    SwiftI2V achieves comparable 2K I2V quality to end-to-end models on VBench-I2V while cutting GPU time by 202x through low-resolution motion planning followed by strongly image-conditioned segment-wise high-resolution ...

  6. SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    SwiftI2V matches end-to-end 2K I2V quality on VBench while cutting GPU time by 202x via conditional segment-wise generation that bounds token cost and preserves input fidelity.

  7. Stream-T1: Test-Time Scaling for Streaming Video Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    Stream-T1 is a test-time scaling framework for streaming video generation using scaled noise propagation from history, reward pruning across short and long windows, and feedback-guided memory sinking to improve tempor...

  8. Generative Refinement Networks for Visual Synthesis

    cs.CV 2026-04 unverdicted novelty 6.0

    GRN uses hierarchical binary quantization and entropy-guided refinement to set new ImageNet records of 0.56 rFID for reconstruction and 1.81 gFID for class-conditional generation while releasing code and models.

  9. Long-Horizon Streaming Video Generation via Hybrid Attention with Decoupled Distillation

    cs.CV 2026-04 conditional novelty 6.0

    Hybrid Forcing combines linear temporal attention for long-range retention, block-sparse attention for efficiency, and decoupled distillation to achieve real-time unbounded 832x480 streaming video generation at 29.5 FPS.

  10. MAR-GRPO: Stabilized GRPO for AR-diffusion Hybrid Image Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    MAR-GRPO stabilizes GRPO for AR-diffusion hybrids via multi-trajectory expectation and uncertainty-based token selection, yielding better visual quality, stability, and spatial understanding than baselines.

  11. Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion

    cs.CV 2026-02 unverdicted novelty 6.0

    Rolling Sink is a training-free cache adjustment technique that maintains visual consistency in autoregressive video diffusion models for ultra-long open-ended generation beyond training horizons.

  12. Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation

    cs.CV 2025-12 conditional novelty 6.0

    Reward Forcing combines EMA-Sink tokens and Rewarded Distribution Matching Distillation to deliver state-of-the-art streaming video generation at 23.1 FPS without copying initial frames.

  13. Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

    cs.CV 2025-10 conditional novelty 6.0

    Self-Forcing++ scales autoregressive video diffusion to over 4 minutes by using self-generated segments for guidance, reducing error accumulation and outperforming baselines in fidelity and consistency.

  14. Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation

    cs.CV 2025-05 unverdicted novelty 6.0

    Mogao presents a causal unified model with deep fusion, dual encoders, and interleaved position embeddings that achieves strong performance on multi-modal understanding, text-to-image generation, and coherent interlea...

  15. Unified Video Action Model

    cs.RO 2025-02 unverdicted novelty 6.0

    UVA learns a joint video-action latent representation with decoupled diffusion decoding heads, enabling a single model to perform accurate fast policy learning, forward/inverse dynamics, and video generation without p...

  16. EchoTorrent: Towards Swift, Sustained, and Streaming Multi-Modal Video Generation

    cs.CV 2026-02 unverdicted novelty 4.0

    EchoTorrent combines multi-teacher distillation, adaptive CFG calibration, hybrid long-tail forcing, and VAE decoder refinement to enable few-pass autoregressive streaming video generation with improved temporal consi...

  17. Cosmos World Foundation Model Platform for Physical AI

    cs.CV 2025-01 unverdicted novelty 3.0

    The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · cited by 16 Pith papers · 21 internal anchors

  1. [1]

    PaLM 2 Technical Report

    Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report.arXiv preprint arXiv:2305.10403,

  2. [2]

    arXiv preprint arXiv:2408.07009,

  3. [3]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023a. James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Z...

  4. [4]

    Chameleon: Mixed-Modal Early-Fusion Foundation Models

    ChameleonTeam. Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818,

  5. [5]

    Chang, H

    Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Murphy, William T Freeman, Michael Rubinstein, et al. Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704,

  6. [6]

    Pixart-sigma: Weak-to-strong training of diffusion transformer for 4k text-to-image generation

    Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-sigma: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. In European Conference on Computer Vision, pp. 74–91. Springer, 2024b. 11 Published as a conference paper at ICLR 2025 Tsai-Shien Chen, Al...

  7. [7]

    Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

    P Goyal. Accurate, large minibatch sg d: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677,

  8. [8]

    AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

    Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725,

  9. [9]

    Classifier-Free Diffusion Guidance

    12 Published as a conference paper at ICLR 2025 Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598,

  10. [10]

    CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

    Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868,

  11. [11]

    ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

    Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment. arXiv preprint arXiv:2403.05135,

  12. [12]

    Deep networks with stochastic depth

    Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with stochastic depth. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pp. 646–661. Springer,

  13. [13]

    VideoPoet: A Large Language Model for Zero-Shot Video Generation

    Dan Kondratyuk, Lijun Yu, Xiuye Gu, Jos´e Lezama, Jonathan Huang, Rachel Hornung, Hartwig Adam, Hassan Akbari, Yair Alon, Vighnesh Birodkar, et al. Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125,

  14. [14]

    Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation

    Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi. Playground v2. 5: Three insights towards enhancing aesthetic quality in text-to-image generation. arXiv preprint arXiv:2402.17245, 2024a. Jiachen Li, Weixi Feng, Tsu-Jui Fu, Xinyi Wang, Sugato Basu, Wenhu Chen, and William Yang Wang. T2v-turbo: Breaking the quality bottleneck...

  15. [15]

    Playground v3: Improving text- to-image alignment with deep-fusion large language models

    Bingchen Liu, Ehsan Akhgari, Alexander Visheratin, Aleks Kamko, Linmiao Xu, Shivam Shrirao, Chase Lambert, Joao Souza, Suhail Doshi, and Daiqing Li. Playground v3: Improving text- to-image alignment with deep-fusion large language models. arXiv preprint arXiv:2409.10695,

  16. [16]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov, Frank Hutter, et al. Fixing weight decay regularization in adam. arXiv preprint arXiv:1711.05101,

  17. [17]

    SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

    Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073,

  18. [18]

    Transframer: Arbitrary frame prediction with generative models

    Charlie Nash, Joao Carreira, Jacob Walker, Iain Barr, Andrew Jaegle, Mateusz Malinowski, and Peter Battaglia. Transframer: Arbitrary frame prediction with generative models. arXiv preprint arXiv:2203.09494,

  19. [19]

    GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

    Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741,

  20. [20]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952,

  21. [21]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    14 Published as a conference paper at ICLR 2025 Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text- conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3,

  22. [22]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456,

  23. [23]

    Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

    Keqiang Sun, Junting Pan, Yuying Ge, Hao Li, Haodong Duan, Xiaoshi Wu, Renrui Zhang, Aojun Zhou, Zipeng Qin, Yi Wang, et al. Journeydb: A benchmark for generative image understanding. Advances in Neural Information Processing Systems, 2024a. Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beat...

  24. [24]

    Visual autoregressive modeling: Scalable image generation via next-scale prediction

    Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction. arXiv preprint arXiv:2404.02905,

  25. [25]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv: 2307.09288,

  26. [26]

    Emu3: Next-Token Prediction is All You Need

    Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869, 2024a. 15 Published as a conference paper at ICLR 2025 Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan ...

  27. [27]

    Loong: Generating minute-level long videos with autore- gressive language models,

    Yuqing Wang, Tianwei Xiong, Daquan Zhou, Zhijie Lin, Yang Zhao, Bingyi Kang, Jiashi Feng, and Xihui Liu. Loong: Generating minute-level long videos with autoregressive language models. arXiv preprint arXiv:2410.02757, 2024b. Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. Videogpt: Video generation using vq-vae and transformers. arXiv prepr...

  28. [28]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072,

  29. [29]

    Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

    Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content- rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2(3):5,

  30. [30]

    Show-1: Marrying pixel and latent diffusion models for text-to-video generation

    David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, and Mike Zheng Shou. Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv preprint arXiv:2309.15818, 2023a. Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceeding...

  31. [31]

    Le Zhuo, Ruoyi Du, Han Xiao, Yangguang Li, Dongyang Liu, Rongjie Huang, Wenze Liu, Lirui Zhao, Fu-Yun Wang, Zhanyu Ma, et al

    URL https://github.com/hpcaitech/Open-Sora. Le Zhuo, Ruoyi Du, Han Xiao, Yangguang Li, Dongyang Liu, Rongjie Huang, Wenze Liu, Lirui Zhao, Fu-Yun Wang, Zhanyu Ma, et al. Lumina-next: Making lumina-t2x stronger and faster with next-dit. arXiv preprint arXiv:2406.18583,

  32. [32]

    Here, more implementation details and ablation experiments are organized as follows: • Architecture details of Scaling and Shift layer (Sec

    16 Published as a conference paper at ICLR 2025 APPENDIX We strictly publish our code and pretrained models to improve interpretability and assure reproducibil- ity. Here, more implementation details and ablation experiments are organized as follows: • Architecture details of Scaling and Shift layer (Sec. A) • Normalization configurations (Sec. B) • Video...

  33. [33]

    UpProjectorDownProjector LayerNormScale, Shift <BOV>outputs Temporal outputs Indicator Tokens Figure 11: Scaling and Shift layer

    Specifically, we refer AdaLayerNorm and decompose the motion changes into mean and variance parameters, which are further used to apply the affine transformation on BOV embeddings. UpProjectorDownProjector LayerNormScale, Shift <BOV>outputs Temporal outputs Indicator Tokens Figure 11: Scaling and Shift layer. We reformulate cross-frame motion changes by l...

  34. [34]

    While NOV A is already efficient in text-to-video generation, there is potential for further acceleration in the spatial layers

    In each video, the temporal layers require only 0.03 seconds, compared to 11.97 seconds for the spatial layers, highlighting the exceptional efficiency of the temporal layers. While NOV A is already efficient in text-to-video generation, there is potential for further acceleration in the spatial layers. Table 4: Inference time analysis for different layer...

  35. [35]

    This limitation may be attributed to our reliance on extensive web datasets, such as LAION and DataComp

    While NOV A outperforms most models of comparable size and matches the overall score of state-of-the-art models, we observe that increasing the model scale results in marginal improvements and does not boost the text rendering performance. This limitation may be attributed to our reliance on extensive web datasets, such as LAION and DataComp. In future wo...

  36. [36]

    In the foreground is the detailed, head-and-shoulders portrait of an elderly man with a long white beard

    NOV A can generate images with a maximum resolution of 1024×1024. Our model excels in the domain of text-to-image generation, producing a vast array of high-quality images that accurately reflect the textual descriptions provided. This capability not only spans a wide range of subjects, from realistic landscapes and portraits to imaginative and abstract c...