Autoregressive Video Generation without Vector Quantization
Pith reviewed 2026-05-17 15:02 UTC · model grok-4.3
The pith
Video generation can be done autoregressively without vector quantization by predicting frames sequentially in time and sets spatially within each frame.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By modeling video generation as a non-quantized autoregressive process that performs temporal frame-by-frame prediction and spatial set-by-set prediction, it is possible to maintain causal autoregressive structure while achieving high visual fidelity, fluency, and efficiency without any vector quantization step.
What carries the argument
Non-quantized autoregressive modeling via temporal frame-by-frame prediction and spatial set-by-set prediction, which preserves causality across frames while enabling bidirectional processing inside each frame.
If this is right
- NOVA achieves better data efficiency and faster inference than prior autoregressive video models despite using far fewer parameters.
- The same unified model supports generalization to longer videos and diverse zero-shot tasks.
- It outperforms leading image diffusion models on text-to-image generation at lower training cost.
- The approach removes the need for a separate quantization stage while retaining GPT-style causal flexibility.
Where Pith is reading between the lines
- Eliminating vector quantization could reduce reconstruction artifacts that often appear in VQ-based video models.
- The frame-plus-set prediction pattern might transfer to other continuous-sequence domains such as audio or motion synthesis.
- Because the model stays causal across time, it could support longer-context video editing or interpolation without retraining.
Load-bearing premise
That continuous visual features can be predicted autoregressively frame by frame and set by set without losing the information needed for coherent video output.
What would settle it
Training the same model on standard video benchmarks and finding that the generated videos show clear drops in temporal coherence or visual detail compared with quantized autoregressive baselines would falsify the claim.
read the original abstract
This paper presents a novel approach that enables autoregressive video generation with high efficiency. We propose to reformulate the video generation problem as a non-quantized autoregressive modeling of temporal frame-by-frame prediction and spatial set-by-set prediction. Unlike raster-scan prediction in prior autoregressive models or joint distribution modeling of fixed-length tokens in diffusion models, our approach maintains the causal property of GPT-style models for flexible in-context capabilities, while leveraging bidirectional modeling within individual frames for efficiency. With the proposed approach, we train a novel video autoregressive model without vector quantization, termed NOVA. Our results demonstrate that NOVA surpasses prior autoregressive video models in data efficiency, inference speed, visual fidelity, and video fluency, even with a much smaller model capacity, i.e., 0.6B parameters. NOVA also outperforms state-of-the-art image diffusion models in text-to-image generation tasks, with a significantly lower training cost. Additionally, NOVA generalizes well across extended video durations and enables diverse zero-shot applications in one unified model. Code and models are publicly available at https://github.com/baaivision/NOVA.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes NOVA, a non-quantized autoregressive model for video generation that reformulates the problem as temporal frame-by-frame prediction combined with spatial set-by-set prediction. This maintains the causal property of GPT-style models while enabling bidirectional modeling within frames. The central claims are that a 0.6B-parameter NOVA model surpasses prior autoregressive video models in data efficiency, inference speed, visual fidelity, and video fluency, outperforms state-of-the-art image diffusion models on text-to-image tasks with lower training cost, generalizes to longer video durations, and supports diverse zero-shot applications in a single model.
Significance. If the empirical results hold under rigorous controls, the work would be significant for demonstrating that vector quantization can be eliminated from autoregressive video models without sacrificing (and potentially improving) quality and efficiency. This challenges the prevailing reliance on discrete bottlenecks in prior AR video work and could influence future designs of continuous generative models. The public release of code and models is a clear strength for reproducibility.
major comments (2)
- [Abstract and §4] Abstract and experimental sections: performance gains over prior AR video models and diffusion models are asserted without details on experimental controls, exact baselines, metrics (e.g., FVD, FID, CLIP score), training data volume, or statistical significance testing. This prevents assessment of the central claims of superior data efficiency and fidelity with a smaller 0.6B model.
- [§3] Method section on set-by-set prediction: the claim that continuous spatial set-by-set autoregressive prediction preserves sufficient intra-frame joint distributions and high-frequency details without discretization is load-bearing for the no-VQ advantage, yet no ablation, density modeling analysis, or comparison to Gaussian NLL/MSE baselines is provided to address why this succeeds where earlier non-quantized attempts failed.
minor comments (2)
- [Abstract] The abstract would benefit from explicitly naming the quantitative metrics used for 'visual fidelity' and 'video fluency'.
- [Figures] Figure captions and axis labels in qualitative results could be clarified to indicate the exact conditioning (text prompt, previous frames) for each example.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The comments highlight opportunities to strengthen the presentation of experimental details and methodological analysis. We address each point below and have revised the manuscript to incorporate additional information and studies.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and experimental sections: performance gains over prior AR video models and diffusion models are asserted without details on experimental controls, exact baselines, metrics (e.g., FVD, FID, CLIP score), training data volume, or statistical significance testing. This prevents assessment of the central claims of superior data efficiency and fidelity with a smaller 0.6B model.
Authors: We agree that expanded experimental documentation is warranted. In the revised version, §4 now includes a dedicated subsection detailing all baselines with citations, the full set of evaluation metrics (FVD, FID, CLIP score, and others), training dataset sizes and compositions, model capacity comparisons, and hardware/training protocols. We have also added multi-seed results for key comparisons to support reproducibility, although formal statistical significance tests were not performed owing to the substantial compute required for video generation; we note this limitation explicitly. revision: yes
-
Referee: [§3] Method section on set-by-set prediction: the claim that continuous spatial set-by-set autoregressive prediction preserves sufficient intra-frame joint distributions and high-frequency details without discretization is load-bearing for the no-VQ advantage, yet no ablation, density modeling analysis, or comparison to Gaussian NLL/MSE baselines is provided to address why this succeeds where earlier non-quantized attempts failed.
Authors: We acknowledge the value of explicit supporting analysis. The revised §3 now contains an ablation subsection that directly compares set-by-set continuous prediction against raster-scan ordering and Gaussian NLL/MSE alternatives. We report both quantitative metrics on high-frequency detail retention and qualitative visualizations of intra-frame distributions, together with a brief discussion of why the bidirectional set modeling succeeds where prior fully continuous attempts encountered difficulties. revision: yes
Circularity Check
No significant circularity; claims rest on empirical evaluation of a new modeling approach
full rationale
The paper reformulates video generation as non-quantized autoregressive modeling via temporal frame-by-frame prediction combined with spatial set-by-set prediction, preserving GPT-style causality while adding intra-frame bidirectionality. Central performance claims (superior data efficiency, speed, fidelity, and fluency for a 0.6B model, plus text-to-image gains) are presented as outcomes of training and benchmarking the resulting NOVA model against prior VQ-based autoregressive and diffusion baselines. No equations, uniqueness theorems, or first-principles derivations appear that reduce by construction to fitted inputs, self-citations, or ansatzes imported from the authors' prior work; the approach is validated externally through reported metrics rather than tautological redefinitions. This is the most common honest finding for an empirical modeling paper whose load-bearing step is the experimental demonstration that the proposed partitioning compensates for the absence of discretization.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Non-quantized autoregressive modeling of video frames can achieve high visual fidelity and fluency.
Forward citations
Cited by 17 Pith papers
-
Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation
Stream-R1 improves distillation of autoregressive streaming video diffusion models by adaptively weighting supervision with a reward model at both rollout and per-pixel levels.
-
UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models
UDM-GRPO is the first RL integration for uniform discrete diffusion models, using final clean samples as actions and forward-process trajectory reconstruction to raise GenEval accuracy from 69% to 96% and OCR accuracy...
-
Efficient Video Diffusion Models: Advancements and Challenges
A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.
-
Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation
PhysGen uses video models to learn physics for robots, outperforming baselines by up to 13.8% on Libero and matching specialized models in real-world tasks.
-
SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation
SwiftI2V achieves comparable 2K I2V quality to end-to-end models on VBench-I2V while cutting GPU time by 202x through low-resolution motion planning followed by strongly image-conditioned segment-wise high-resolution ...
-
SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation
SwiftI2V matches end-to-end 2K I2V quality on VBench while cutting GPU time by 202x via conditional segment-wise generation that bounds token cost and preserves input fidelity.
-
Stream-T1: Test-Time Scaling for Streaming Video Generation
Stream-T1 is a test-time scaling framework for streaming video generation using scaled noise propagation from history, reward pruning across short and long windows, and feedback-guided memory sinking to improve tempor...
-
Generative Refinement Networks for Visual Synthesis
GRN uses hierarchical binary quantization and entropy-guided refinement to set new ImageNet records of 0.56 rFID for reconstruction and 1.81 gFID for class-conditional generation while releasing code and models.
-
Long-Horizon Streaming Video Generation via Hybrid Attention with Decoupled Distillation
Hybrid Forcing combines linear temporal attention for long-range retention, block-sparse attention for efficiency, and decoupled distillation to achieve real-time unbounded 832x480 streaming video generation at 29.5 FPS.
-
MAR-GRPO: Stabilized GRPO for AR-diffusion Hybrid Image Generation
MAR-GRPO stabilizes GRPO for AR-diffusion hybrids via multi-trajectory expectation and uncertainty-based token selection, yielding better visual quality, stability, and spatial understanding than baselines.
-
Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion
Rolling Sink is a training-free cache adjustment technique that maintains visual consistency in autoregressive video diffusion models for ultra-long open-ended generation beyond training horizons.
-
Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation
Reward Forcing combines EMA-Sink tokens and Rewarded Distribution Matching Distillation to deliver state-of-the-art streaming video generation at 23.1 FPS without copying initial frames.
-
Self-Forcing++: Towards Minute-Scale High-Quality Video Generation
Self-Forcing++ scales autoregressive video diffusion to over 4 minutes by using self-generated segments for guidance, reducing error accumulation and outperforming baselines in fidelity and consistency.
-
Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation
Mogao presents a causal unified model with deep fusion, dual encoders, and interleaved position embeddings that achieves strong performance on multi-modal understanding, text-to-image generation, and coherent interlea...
-
Unified Video Action Model
UVA learns a joint video-action latent representation with decoupled diffusion decoding heads, enabling a single model to perform accurate fast policy learning, forward/inverse dynamics, and video generation without p...
-
EchoTorrent: Towards Swift, Sustained, and Streaming Multi-Modal Video Generation
EchoTorrent combines multi-teacher distillation, adaptive CFG calibration, hybrid long-tail forcing, and VAE decoder refinement to enable few-pass autoregressive streaming video generation with improved temporal consi...
-
Cosmos World Foundation Model Platform for Physical AI
The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.
Reference graph
Works this paper leans on
-
[1]
Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report.arXiv preprint arXiv:2305.10403,
work page internal anchor Pith review Pith/arXiv arXiv
- [2]
-
[3]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023a. James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Z...
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Chameleon: Mixed-Modal Early-Fusion Foundation Models
ChameleonTeam. Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818,
work page internal anchor Pith review Pith/arXiv arXiv
- [5]
-
[6]
Pixart-sigma: Weak-to-strong training of diffusion transformer for 4k text-to-image generation
Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-sigma: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. In European Conference on Computer Vision, pp. 74–91. Springer, 2024b. 11 Published as a conference paper at ICLR 2025 Tsai-Shien Chen, Al...
-
[7]
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
P Goyal. Accurate, large minibatch sg d: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Classifier-Free Diffusion Guidance
12 Published as a conference paper at ICLR 2025 Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598,
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment. arXiv preprint arXiv:2403.05135,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Deep networks with stochastic depth
Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with stochastic depth. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pp. 646–661. Springer,
work page 2016
-
[13]
VideoPoet: A Large Language Model for Zero-Shot Video Generation
Dan Kondratyuk, Lijun Yu, Xiuye Gu, Jos´e Lezama, Jonathan Huang, Rachel Hornung, Hartwig Adam, Hassan Akbari, Yair Alon, Vighnesh Birodkar, et al. Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation
Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi. Playground v2. 5: Three insights towards enhancing aesthetic quality in text-to-image generation. arXiv preprint arXiv:2402.17245, 2024a. Jiachen Li, Weixi Feng, Tsu-Jui Fu, Xinyi Wang, Sugato Basu, Wenhu Chen, and William Yang Wang. T2v-turbo: Breaking the quality bottleneck...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Playground v3: Improving text- to-image alignment with deep-fusion large language models
Bingchen Liu, Ehsan Akhgari, Alexander Visheratin, Aleks Kamko, Linmiao Xu, Shivam Shrirao, Chase Lambert, Joao Souza, Suhail Doshi, and Daiqing Li. Playground v3: Improving text- to-image alignment with deep-fusion large language models. arXiv preprint arXiv:2409.10695,
-
[16]
Decoupled Weight Decay Regularization
Ilya Loshchilov, Frank Hutter, et al. Fixing weight decay regularization in adam. arXiv preprint arXiv:1711.05101,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations
Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Transframer: Arbitrary frame prediction with generative models
Charlie Nash, Joao Carreira, Jacob Walker, Iain Barr, Andrew Jaegle, Mateusz Malinowski, and Peter Battaglia. Transframer: Arbitrary frame prediction with generative models. arXiv preprint arXiv:2203.09494,
-
[19]
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Hierarchical Text-Conditional Image Generation with CLIP Latents
14 Published as a conference paper at ICLR 2025 Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text- conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3,
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
Score-Based Generative Modeling through Stochastic Differential Equations
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456,
work page internal anchor Pith review Pith/arXiv arXiv 2011
-
[23]
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
Keqiang Sun, Junting Pan, Yuying Ge, Hao Li, Haodong Duan, Xiaoshi Wu, Renrui Zhang, Aojun Zhou, Zipeng Qin, Yi Wang, et al. Journeydb: A benchmark for generative image understanding. Advances in Neural Information Processing Systems, 2024a. Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beat...
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Visual autoregressive modeling: Scalable image generation via next-scale prediction
Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction. arXiv preprint arXiv:2404.02905,
-
[25]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv: 2307.09288,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Emu3: Next-Token Prediction is All You Need
Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869, 2024a. 15 Published as a conference paper at ICLR 2025 Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
Loong: Generating minute-level long videos with autore- gressive language models,
Yuqing Wang, Tianwei Xiong, Daquan Zhou, Zhijie Lin, Yang Zhao, Bingyi Kang, Jiashi Feng, and Xihui Liu. Loong: Generating minute-level long videos with autoregressive language models. arXiv preprint arXiv:2410.02757, 2024b. Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. Videogpt: Video generation using vq-vae and transformers. arXiv prepr...
-
[28]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072,
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content- rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2(3):5,
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
Show-1: Marrying pixel and latent diffusion models for text-to-video generation
David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, and Mike Zheng Shou. Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv preprint arXiv:2309.15818, 2023a. Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceeding...
-
[31]
URL https://github.com/hpcaitech/Open-Sora. Le Zhuo, Ruoyi Du, Han Xiao, Yangguang Li, Dongyang Liu, Rongjie Huang, Wenze Liu, Lirui Zhao, Fu-Yun Wang, Zhanyu Ma, et al. Lumina-next: Making lumina-t2x stronger and faster with next-dit. arXiv preprint arXiv:2406.18583,
-
[32]
16 Published as a conference paper at ICLR 2025 APPENDIX We strictly publish our code and pretrained models to improve interpretability and assure reproducibil- ity. Here, more implementation details and ablation experiments are organized as follows: • Architecture details of Scaling and Shift layer (Sec. A) • Normalization configurations (Sec. B) • Video...
work page 2025
-
[33]
Specifically, we refer AdaLayerNorm and decompose the motion changes into mean and variance parameters, which are further used to apply the affine transformation on BOV embeddings. UpProjectorDownProjector LayerNormScale, Shift <BOV>outputs Temporal outputs Indicator Tokens Figure 11: Scaling and Shift layer. We reformulate cross-frame motion changes by l...
work page 2025
-
[34]
In each video, the temporal layers require only 0.03 seconds, compared to 11.97 seconds for the spatial layers, highlighting the exceptional efficiency of the temporal layers. While NOV A is already efficient in text-to-video generation, there is potential for further acceleration in the spatial layers. Table 4: Inference time analysis for different layer...
work page 2025
-
[35]
While NOV A outperforms most models of comparable size and matches the overall score of state-of-the-art models, we observe that increasing the model scale results in marginal improvements and does not boost the text rendering performance. This limitation may be attributed to our reliance on extensive web datasets, such as LAION and DataComp. In future wo...
work page 2023
-
[36]
NOV A can generate images with a maximum resolution of 1024×1024. Our model excels in the domain of text-to-image generation, producing a vast array of high-quality images that accurately reflect the textual descriptions provided. This capability not only spans a wide range of subjects, from realistic landscapes and portraits to imaginative and abstract c...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.