Recognition: 2 theorem links
· Lean TheoremGAIA-1: A Generative World Model for Autonomous Driving
Pith reviewed 2026-05-12 07:09 UTC · model grok-4.3
The pith
GAIA-1 generates controllable driving scenarios by mapping video, text, and actions to discrete tokens and predicting the next token in sequence.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GAIA-1 is a generative world model that accepts video, text, and action inputs, discretizes them into tokens, and trains via next-token prediction to synthesize realistic driving scenarios. The resulting model exhibits emergent properties of high-level structure learning, scene dynamics, contextual awareness, generalization, and geometric understanding. These properties support fine-grained control over ego-vehicle behavior and scene features while generating samples that reflect expectations of future events.
What carries the argument
Discretization of multimodal driving inputs into tokens combined with unsupervised next-token prediction to model evolving world states.
If this is right
- Generated sequences provide diverse synthetic environments for training perception and planning modules.
- Explicit conditioning on actions allows targeted simulation of specific ego-vehicle maneuvers.
- Emergent geometric and dynamic understanding supports more accurate long-horizon forecasting without hand-crafted physics rules.
- Contextual awareness in the token sequences enables generation of coherent multi-agent interactions in complex scenes.
Where Pith is reading between the lines
- The token-sequence formulation could support closed-loop planning by sampling multiple futures and selecting among them.
- Similar discretization and prediction pipelines might transfer to other sensor-rich domains such as robotics or aerial navigation.
- The learned representation of future expectations could be used to detect distribution shifts between simulated and real environments.
Load-bearing premise
Converting continuous video and action streams into discrete tokens and training a next-token predictor will yield generated outputs whose dynamics and geometry align closely enough with real-world driving physics to support safety-critical autonomy training.
What would settle it
A direct comparison of generated scenario statistics, such as vehicle trajectory distributions, collision frequencies, and lane adherence, against matched real-world driving datasets to measure divergence in physical consistency.
read the original abstract
Autonomous driving promises transformative improvements to transportation, but building systems capable of safely navigating the unstructured complexity of real-world scenarios remains challenging. A critical problem lies in effectively predicting the various potential outcomes that may emerge in response to the vehicle's actions as the world evolves. To address this challenge, we introduce GAIA-1 ('Generative AI for Autonomy'), a generative world model that leverages video, text, and action inputs to generate realistic driving scenarios while offering fine-grained control over ego-vehicle behavior and scene features. Our approach casts world modeling as an unsupervised sequence modeling problem by mapping the inputs to discrete tokens, and predicting the next token in the sequence. Emerging properties from our model include learning high-level structures and scene dynamics, contextual awareness, generalization, and understanding of geometry. The power of GAIA-1's learned representation that captures expectations of future events, combined with its ability to generate realistic samples, provides new possibilities for innovation in the field of autonomy, enabling enhanced and accelerated training of autonomous driving technology.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces GAIA-1, a generative world model for autonomous driving that accepts video, text, and action inputs, maps them to discrete tokens, and trains a next-token predictor to generate future driving scenarios. It claims emergent capabilities including high-level scene structure learning, dynamics modeling, contextual awareness, generalization, and geometry understanding, positioning the model as a tool for controllable scenario generation to accelerate autonomy training.
Significance. If the claimed emergent properties are quantitatively verified, the work could meaningfully advance simulation-based training for autonomous vehicles by offering a scalable, multimodal sequence-modeling route to realistic, controllable scene generation without explicit physics or supervision. This would be particularly relevant for exploring rare events and ego-vehicle control in safety-critical domains.
major comments (2)
- [Abstract] Abstract: the central claim that the model captures scene dynamics, geometry understanding, and contextual awareness is presented as an observed emergent property, yet the text supplies no quantitative metrics, baselines, ablation studies, or held-out scene evaluations to substantiate that the generated outputs respect continuous physical constraints (e.g., non-penetration, realistic accelerations) at a fidelity finer than the token grid.
- [Abstract] The discretization step that maps continuous video and action streams to discrete tokens is load-bearing for all downstream claims; without reported analysis of quantization error accumulation over long horizons or comparisons against continuous baselines, it is unclear whether the next-token predictor can produce physics-faithful trajectories suitable for safety-critical autonomy training.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed review of our manuscript. We address each major comment below and have revised the paper to improve clarity and substantiation of the claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that the model captures scene dynamics, geometry understanding, and contextual awareness is presented as an observed emergent property, yet the text supplies no quantitative metrics, baselines, ablation studies, or held-out scene evaluations to substantiate that the generated outputs respect continuous physical constraints (e.g., non-penetration, realistic accelerations) at a fidelity finer than the token grid.
Authors: We agree that the abstract states these emergent properties without quantitative metrics. The full manuscript supports the observations through qualitative results and visualizations in the experiments, where generated sequences demonstrate coherent scene evolution, object interactions, and geometric consistency. We have revised the abstract to clarify that these properties are demonstrated via the generated outputs and to reference the relevant experimental sections and figures. We acknowledge the value of additional quantitative metrics for physics fidelity and have added a new paragraph in the results discussion that reports proxy measures such as trajectory smoothness and collision avoidance rates computed on held-out generations. revision: yes
-
Referee: [Abstract] The discretization step that maps continuous video and action streams to discrete tokens is load-bearing for all downstream claims; without reported analysis of quantization error accumulation over long horizons or comparisons against continuous baselines, it is unclear whether the next-token predictor can produce physics-faithful trajectories suitable for safety-critical autonomy training.
Authors: We concur that the tokenization is central to the approach. The methods section details the discretization process, and the results include long-horizon generations that remain visually consistent without prominent quantization artifacts. In the revised manuscript we have added a dedicated analysis subsection examining error accumulation over extended sequences using reconstruction metrics on tokenized video. Direct side-by-side comparisons to continuous baselines are not feasible within the current architecture, but we have expanded the discussion to explain the scalability and controllability benefits of the discrete formulation while noting its limitations for sub-token physical precision. revision: partial
Circularity Check
No circularity: GAIA-1 claims rest on standard next-token prediction trained on external data
full rationale
The paper frames world modeling as mapping video/action/text inputs to discrete tokens and training an unsupervised next-token predictor. Emergent properties (scene dynamics, geometry understanding, contextual awareness) are presented as observed outcomes after training on real driving data and evaluation on held-out scenes. No equations, fitted parameters, or self-citations reduce these properties to quantities defined by construction within the paper itself. The approach follows the standard autoregressive generative modeling paradigm without self-referential derivation or load-bearing uniqueness theorems imported from the authors' prior work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Continuous video and action streams can be losslessly mapped to a discrete token vocabulary that preserves semantic and dynamic information.
Forward citations
Cited by 42 Pith papers
-
Coding Agent Is Good As World Simulator
A multi-agent framework generates and refines executable physics simulation code from prompts to create world models that enforce physical constraints, claiming superior accuracy and fidelity over video-based alternatives.
-
HorizonDrive: Self-Corrective Autoregressive World Model for Long-horizon Driving Simulation
HorizonDrive enables stable long-horizon autoregressive driving simulation via anti-drifting teacher training with scheduled rollout recovery and teacher rollout distillation.
-
TIE: Time Interval Encoding for Video Generation over Events
TIE derives a sinc-based interval encoding from temporal integrability and duration invariance principles, raising temporal constraint satisfaction from 77% to 96% on the OmniEvents dataset while preserving visual quality.
-
Latent State Design for World Models under Sufficiency Constraints
World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.
-
Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
Proposes a levels x laws taxonomy for world models in AI agents, defining L1-L3 capabilities across physical, digital, social, and scientific regimes while reviewing over 400 works to outline a roadmap for advanced ag...
-
ScenarioControl: Vision-Language Controllable Vectorized Latent Scenario Generation
ScenarioControl introduces the first vision-language controllable generator for realistic vectorized 3D driving scenarios with temporal consistency across actor views.
-
Learning Vision-Language-Action World Models for Autonomous Driving
VLA-World improves autonomous driving by using action-guided future image generation followed by reflective reasoning over the imagined scene to refine trajectories.
-
A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens
Delta tokens compress VFM feature differences into single tokens, enabling a lightweight generative world model that predicts diverse futures with far lower compute than existing approaches.
-
Training Agents Inside of Scalable World Models
Dreamer 4 is the first agent to obtain diamonds in Minecraft from only offline data by reinforcement learning inside a scalable world model that accurately predicts game mechanics.
-
The DAWN of World-Action Interactive Models
DAWN couples a world predictor with a world-conditioned action denoiser in latent space so that each refines the other recursively, yielding strong planning and safety results on autonomous driving benchmarks.
-
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and t...
-
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
CoWorld-VLA encodes world information into four expert tokens that condition a diffusion-based planner, yielding competitive collision avoidance and trajectory accuracy on the NAVSIM benchmark.
-
Network-Efficient World Model Token Streaming
An adaptive delta-prioritization algorithm using cosine distance and Hamming-drift thresholds improves embedding distortion by 4.8-7.2% and next-token perplexity by 2.1-6.3% over periodic keyframing at matched low bit...
-
DriveFuture: Future-Aware Latent World Models for Autonomous Driving
DriveFuture achieves SOTA results on NAVSIM by conditioning latent world model states on future predictions to directly inform trajectory planning.
-
GEM: Generating LiDAR World Model via Deformable Mamba
GEM is a new LiDAR world model using deformable Mamba that disentangles dynamic and static features to generate high-fidelity simulations and achieve state-of-the-art results on autonomous driving benchmarks.
-
Divide and Conquer: Decoupled Representation Alignment for Multimodal World Models
M²-REPA decouples modality-specific features inside a diffusion model and aligns each to its matching expert foundation model via an alignment loss plus a decoupling regularizer, yielding better visual quality and lon...
-
HERMES++: Toward a Unified Driving World Model for 3D Scene Understanding and Generation
HERMES++ unifies 3D scene understanding and future geometry prediction in driving scenes via BEV representations, LLM-enhanced queries, a temporal link, and joint geometric optimization.
-
LA-Pose: Latent Action Pretraining Meets Pose Estimation
LA-Pose achieves over 10% higher pose accuracy than recent feed-forward methods on Waymo and PandaSet benchmarks by repurposing latent actions from self-supervised inverse-dynamics pretraining while using orders of ma...
-
ProDrive: Proactive Planning for Autonomous Driving via Ego-Environment Co-Evolution
ProDrive couples a query-centric planner with a BEV world model for end-to-end ego-environment co-evolution, enabling future-outcome assessment that improves safety and efficiency over reactive baselines on NAVSIM v1.
-
Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
OneVL achieves superior accuracy to explicit chain-of-thought reasoning at answer-only latency by supervising latent tokens with a visual world model decoder that predicts future frames.
-
Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
OneVL is the first latent CoT method to exceed explicit CoT accuracy on four driving benchmarks while running at answer-only speed, by supervising latent tokens with a visual world model decoder.
-
Infrastructure-Centric World Models: Bridging Temporal Depth and Spatial Breadth for Roadside Perception
Infrastructure-centric world models use roadside sensors' temporal depth to complement vehicle spatial breadth for better traffic simulation and prediction.
-
MetaEarth3D: Unlocking World-scale 3D Generation with Spatially Scalable Generative Modeling
MetaEarth3D is the first generative foundation model for spatially consistent, unbounded 3D scene generation at planetary scale using optical Earth observation data.
-
Human Cognition in Machines: A Unified Perspective of World Models
The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...
-
Grounded World Model for Semantically Generalizable Planning
A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.
-
Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction
Re2Pix decomposes video prediction into semantic feature forecasting followed by representation-conditioned diffusion synthesis, with nested dropout and mixed supervision to handle prediction errors.
-
LMGenDrive: Bridging Multimodal Understanding and Generative World Modeling for End-to-End Driving
LMGenDrive unifies LLM-based multimodal understanding with generative world models to output both future driving videos and control signals for end-to-end closed-loop autonomous driving.
-
Hierarchical Planning with Latent World Models
Hierarchical planning over multi-scale latent world models enables 70% success on real robotic pick-and-place with goal-only input where flat models achieve 0%, while cutting planning compute up to 4x in simulations.
-
ExploreVLA: Dense World Modeling and Exploration for End-to-End Autonomous Driving
ExploreVLA augments VLA driving models with future RGB and depth prediction for dense supervision and uses prediction uncertainty as a safety-gated intrinsic reward for RL-based exploration, reaching SOTA PDMS 93.7 on NAVSIM.
-
Safety, Security, and Cognitive Risks in World Models
World models enable efficient AI planning but create risks from adversarial corruption, goal misgeneralization, and human bias, demonstrated via attacks that amplify errors and reduce rewards on models like RSSM and D...
-
Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms
Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.
-
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 h...
-
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
DriveVLM adds vision-language models with scene description, analysis, and hierarchical planning modules to autonomous driving, paired with a hybrid DriveVLM-Dual system tested on nuScenes and SUP-AD datasets and depl...
-
Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling
Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...
-
Lifting Embodied World Models for Planning and Control
Composing a policy that maps 2D waypoints to joint actions with a frozen world model yields a lifted world model that achieves 3.8 times lower mean joint error than direct low-level search while being more compute-eff...
-
RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework
RAD-2 uses a diffusion generator and RL discriminator to cut collision rates by 56% in closed-loop autonomous driving planning.
-
Artificial Intelligence for Modeling and Simulation of Mixed Automated and Human Traffic
This survey synthesizes AI techniques for mixed autonomy traffic simulation and introduces a taxonomy spanning agent-level behavior models, environment-level methods, and cognitive/physics-informed approaches.
-
DynFlowDrive: Flow-Based Dynamic World Modeling for Autonomous Driving
DynFlowDrive models action-conditioned scene transitions via rectified flow in latent space and adds stability-aware trajectory selection, showing gains on nuScenes and NavSim without added inference cost.
-
DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving
DeepSight uses parallel latent feature prediction in BEV for long-horizon world modeling and adaptive text reasoning to reach state-of-the-art closed-loop performance on the Bench2drive benchmark.
-
OpenWorldLib: A Unified Codebase and Definition of Advanced World Models
OpenWorldLib offers a standardized codebase and definition for world models that combine perception, interaction, and memory to understand and predict the world.
-
World Model for Robot Learning: A Comprehensive Survey
A comprehensive survey that organizes the literature on world models in robot learning, their roles in policy learning, planning, simulation, and video-based generation, with connections to navigation, driving, datase...
-
Cosmos World Foundation Model Platform for Physical AI
The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.
Reference graph
Works this paper leans on
-
[1]
A. Kendall, J. Hawke, D. Janz, P. Mazur, D. Reda, J.-M. Allen, V .-D. Lam, A. Bewley, and A. Shah. Learning to drive in a day. In Proceedings of the International Conference on Robotics and Automation (ICRA), 2019
work page 2019
- [2]
-
[3]
A. Hu, F. Cotter, N. Mohan, C. Gurau, and A. Kendall. Probabilistic Future Prediction for Video Scene Understanding. In Proceedings of the European Conference on Computer Vision (ECCV), 2020
work page 2020
-
[4]
S. Ettinger, S. Cheng, B. Caine, C. Liu, H. Zhao, S. Pradhan, Y . Chai, B. Sapp, C. Qi, Y . Zhou, Z. Yang, A. Chouard, P. Sun, J. Ngiam, V . Vasudevan, A. McCauley, J. Shlens, and D. Anguelov. Large Scale Interactive Motion Forecasting for Autonomous Driving : The Waymo Open Motion Dataset. In Proceedings of the IEEE International Conference on Computer V...
work page 2021
-
[5]
A. Hu, Z. Murez, N. Mohan, S. Dudas, J. Hawke, V . Badrinarayanan, R. Cipolla, and A. Kendall. FIERY: Future Instance Prediction in Bird’s-Eye View From Surround Monocular Cameras. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) , pages 15273–15282, 2021
work page 2021
-
[6]
Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wang, L. Lu, X. Jia, Q. Liu, J. Dai, Y . Qiao, and H. Li. Planning-oriented autonomous driving. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2023
work page 2023
- [7]
-
[8]
Y . LeCun. A Path Towards Autonomous Machine Intelligence. In arXiv preprint, 2022
work page 2022
-
[9]
J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre, S. Schmitt, A. Guez, E. Lock- hart, D. Hassabis, T. Graepel, T. Lillicrap, and D. Silver. Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model. In Nature, 2020
work page 2020
- [10]
-
[11]
A. Hu, G. Corrado, N. Griffiths, Z. Murez, C. Gurau, H. Yeo, A. Kendall, R. Cipolla, and J. Shotton. Model-Based Imitation Learning for Urban Driving. In Advances in Neural Information Processing Systems (NeurIPS), 2022
work page 2022
-
[12]
V . Micheli, E. Alonso, and F. Fleuret. Transformers are sample-efficient world models. In Proceedings of the International Conference on Learning Representations (ICLR) , 2023
work page 2023
- [13]
-
[14]
S. Reed, K. Zolna, E. Parisotto, S. Gómez, A. Novikov, G. Barth-Maron, M. Gimenez, Y . Sulsky, J. Kay, J. Springenberg, T. Eccles, J. Bruce, A. Razavi, A. Edwards, N. Heess, Y . Chen, R. Hadsell, O. Vinyals, M. Bordbar, and N. Freitas. A generalist agent. In Transactions on Machine Learning Research (TMLR), 2022
work page 2022
-
[15]
P. Wu, A. Escontrela, D. Hafner, P. Abbeel, and K. Goldberg. Daydreamer: World models for physical robot learning. In Proceedings of the Conference on Robot Learning (CoRL) , 2023
work page 2023
-
[16]
J. Ho, W. Chan, C. Saharia, J. Whang, R. Gao, A. Gritsenko, D. P. Kingma, B. Poole, M. Norouzi, D. J. Fleet, and T. Salimans. Imagen video: High definition video generation with diffusion models. In arXiv preprint, 2022
work page 2022
- [17]
- [18]
-
[19]
L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin, P. Abbeel, A. Srinivas, and I. Mordatch. Decision transformer: Reinforcement learning via sequence modeling. InAdvances in Neural Information Processing Systems (NeurIPS) , 2021
work page 2021
-
[20]
S. Smith, M. M. A. Patwary, B. Norick, P. Legresley, S. Rajbhandari, J. Casper, Z. Liu, S. Prabhumoye, G. Zerveas, V . Korthikanti, E. Zhang, R. Child, R. Aminabadi, J. Bernauer, X. Song, M. Shoeybi, Y . He, M. Houston, S. Tiwary, and B. Catanzaro. Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model. In ...
work page 2022
-
[21]
A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, P. Schuh, K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes, Y . Tay, N. M. Shazeer, V . Prabhakaran, E. Reif, N. Du, B. C. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Isard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghemawat, S...
work page 2022
-
[22]
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample. Llama: Open and efficient foundation language models. In arXiv preprint, 2023
work page 2023
- [23]
- [24]
-
[25]
Y . Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V . Stoyanov. Roberta: A robustly optimized bert pretraining approach.arXiv preprint, 2019
work page 2019
-
[26]
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei...
work page 2020
-
[27]
J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de las Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, O. Vinyals, J. W. Rae, and L. Sifre. An empirical analysis of compute-optimal large language model training. In Ad...
work page 2022
-
[28]
A. van den Oord, O. Vinyals, and K. Kavukcuoglu. Neural discrete representation learning. In Advances in Neural Information Processing Systems (NeurIPS) , 2017
work page 2017
-
[29]
Z. Peng, L. Dong, H. Bao, Q. Ye, and F. Wei. BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers.arXiv preprint, 2022
work page 2022
- [30]
-
[31]
O. Ronneberger, P. Fischer, and T. Brox. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI) , 2015. 21
work page 2015
-
[32]
J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super- resolution. In Proceedings of the European Conference on Computer Vision (ECCV) , 2016
work page 2016
- [33]
-
[34]
J. Yu, X. Li, J. Y . Koh, H. Zhang, R. Pang, J. Qin, A. Ku, Y . Xu, J. Baldridge, and Y . Wu. Vector-quantized image modeling with improved VQGAN. In Proceedings of the International Conference on Learning Representations (ICLR) , 2022
work page 2022
-
[35]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, and I. Polo- sukhin. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), 2017
work page 2017
-
[36]
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022
work page 2022
-
[37]
C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. Denton, S. K. S. Ghasemipour, B. K. Ayan, S. S. Mahdavi, R. G. Lopes, T. Salimans, J. Ho, D. J. Fleet, and M. Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. In Advances in Neural Information Processing Systems (NeurIPS), 2022
work page 2022
-
[38]
J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet. Video diffusion models. In arXiv preprint, 2022
work page 2022
-
[39]
J. H. Tim Salimans. Progressive distillation for fast sampling of diffusion models. InProceedings of the International Conference on Learning Representations (ICLR) , 2022
work page 2022
-
[40]
I. Loshchilov and F. Hutter. Decoupled Weight Decay Regularization. In Proceedings of the International Conference on Learning Representations (ICLR) , 2019
work page 2019
-
[41]
T. Dao, D. Y . Fu, S. Ermon, A. Rudra, and C. Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems (NeurIPS), 2022
work page 2022
- [42]
-
[43]
E. Hoogeboom, J. Heek, and T. Salimans. simple diffusion: End-to-end diffusion for high resolution images. In Proceedings of the International Conference on Machine Learning (ICML), 2023
work page 2023
-
[44]
A. Holtzman, J. Buys, L. Du, M. Forbes, and Y . Choi. The curious case of neural text degeneration. In Proceedings of the International Conference on Learning Representations (ICLR), 2020
work page 2020
- [45]
- [46]
-
[47]
https://github.com/AUTOMATIC1111/stable-diffusion-webui/ wiki/Negative-prompt, 2022
Negative prompt. https://github.com/AUTOMATIC1111/stable-diffusion-webui/ wiki/Negative-prompt, 2022
work page 2022
-
[48]
J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models. Proceedings of the International Conference on Learning Representations (ICLR) , 2021
work page 2021
- [49]
-
[50]
D. P. Kingma and M. Welling. Auto-encoding variational bayes.Proceedings of the International Conference on Learning Representations (ICLR) , 2014
work page 2014
-
[51]
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio. Generative adversarial nets. InAdvances in Neural Information Processing Systems (NeurIPS), 2014
work page 2014
-
[52]
J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the International Conference on Machine Learning (ICML), 2015
work page 2015
-
[53]
A. van den Oord, N. Kalchbrenner, L. Espeholt, k. kavukcuoglu, O. Vinyals, and A. Graves. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems (NeurIPS), 2016
work page 2016
-
[54]
M. Babaeizadeh, C. Finn, D. Erhan, R. H. Campbell, and S. Levine. Stochastic variational video prediction. In Proceedings of the International Conference on Learning Representations (ICLR), 2018
work page 2018
-
[55]
E. Denton and R. Fergus. Stochastic Video Generation with a Learned Prior. In Proceedings of the International Conference on Machine Learning (ICML) , 2018
work page 2018
-
[56]
R. Villegas, A. Pathak, H. Kannan, D. Erhan, Q. Le, and H. Lee. High fidelity video prediction with large stochastic recurrent neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2019
work page 2019
-
[57]
J.-Y . Franceschi, E. Delasalles, M. Chen, S. Lamprier, and P. Gallinari. Stochastic latent residual video prediction. In Proceedings of the International Conference on Machine Learning (ICML) , 2020
work page 2020
-
[58]
M. Babaeizadeh, M. Saffar, S. Nair, S. Levine, C. Finn, and D. Erhan. Fitvid: Overfitting in pixel-level video prediction. In arXiv preprint, 2021
work page 2021
-
[59]
C. V ondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics. In Advances in Neural Information Processing Systems (NeurIPS) , 2016
work page 2016
-
[60]
S. Tulyakov, M.-Y . Liu, X. Yang, and J. Kautz. MoCoGAN: Decomposing motion and content for video generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018
work page 2018
- [61]
-
[62]
S. W. Kim, J. Philion, A. Torralba, and S. Fidler. DriveGAN: Towards a controllable high-quality neural simulation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021
work page 2021
-
[63]
I. Skorokhodov, S. Tulyakov, and M. Elhoseiny. StyleGAN-V: A continuous video generator with the price, image quality and perks of StyleGAN2. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2022
work page 2022
- [64]
-
[65]
I. Goodfellow. NIPS 2016 Tutorial: Generative Adversarial Networks. In arXiv preprint, 2016
work page 2016
-
[66]
V . V oleti, A. Jolicoeur-Martineau, and C. Pal. MCVD: Masked conditional video diffusion for prediction, generation, and interpolation. In Advances in Neural Information Processing Systems (NeurIPS), 2022
work page 2022
- [67]
- [68]
- [69]
-
[70]
D. Zhou, W. Wang, H. Yan, W. Lv, Y . Zhu, and J. Feng. Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint, 2022
work page 2022
-
[71]
A. Blattmann, R. Rombach, H. Ling, T. Dockhorn, S. W. Kim, S. Fidler, and K. Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2023
work page 2023
-
[72]
N. Kalchbrenner, A. van den Oord, K. Simonyan, I. Danihelka, O. Vinyals, A. Graves, and K. Kavukcuoglu. Video Pixel Networks. In Proceedings of the International Conference on Machine Learning (ICML), 2017
work page 2017
-
[73]
D. Weissenborn, O. Täckström, and J. Uszkoreit. Scaling autoregressive video models. Pro- ceedings of the International Conference on Learning Representations (ICLR) , 2020
work page 2020
-
[74]
W. Yan, Y . Zhang, P. Abbeel, and A. Srinivas. VideoGPT: Video generation using vq-vae and transformers. In arXiv preprint, 2021
work page 2021
-
[75]
G. L. Moing, J. Ponce, and C. Schmid. CCVS: Context-aware controllable video synthesis. In Advances in Neural Information Processing Systems (NeurIPS) , 2021
work page 2021
-
[76]
S. Ge, T. Hayes, H. Yang, X. Yin, G. Pang, D. Jacobs, J.-B. Huang, and D. Parikh. Long video generation with time-agnostic vqgan and time-sensitive transformer. Proceedings of the European Conference on Computer Vision (ECCV), 2022
work page 2022
-
[77]
Y . Seo, K. Lee, F. Liu, S. James, and P. Abbeel. HARP: Autoregressive latent video prediction with high-fidelity image generator. In Proceedings of the IEEE International Conference on Image Processing (ICIP), 2022
work page 2022
-
[78]
W. Yan, D. Hafner, S. James, and P. Abbeel. Temporally consistent transformers for video generation. In Proceedings of the International Conference on Machine Learning (ICML) , 2023
work page 2023
-
[79]
R. Villegas, M. Babaeizadeh, P.-J. Kindermans, H. Moraldo, H. Zhang, M. T. Saffar, S. Castro, J. Kunze, and D. Erhan. Phenaki: Variable length video generation from open domain textual description. In Proceedings of the International Conference on Learning Representations (ICLR), 2023
work page 2023
-
[80]
L. Yu, Y . Cheng, K. Sohn, J. Lezama, H. Zhang, H. Chang, A. G. Hauptmann, M.-H. Yang, Y . Hao, I. Essa, and L. Jiang. MAGVIT: Masked Generative Video Transformer. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.