pith. machine review for the scientific record. sign in

arxiv: 2309.17080 · v1 · submitted 2023-09-29 · 💻 cs.CV · cs.AI· cs.RO

Recognition: 2 theorem links

· Lean Theorem

GAIA-1: A Generative World Model for Autonomous Driving

Authors on Pith no claims yet

Pith reviewed 2026-05-12 07:09 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.RO
keywords generative world modelautonomous drivingsequence modelingdiscrete tokensvideo generationscene dynamicsego-vehicle control
0
0 comments X

The pith

GAIA-1 generates controllable driving scenarios by mapping video, text, and actions to discrete tokens and predicting the next token in sequence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GAIA-1 to address the challenge of predicting possible outcomes in autonomous driving by treating world modeling as an unsupervised sequence modeling task. Inputs from video, text, and vehicle actions are converted into discrete tokens, after which the model learns to forecast the subsequent tokens to produce future scenes. This method yields emergent capabilities including recognition of high-level scene structures, motion dynamics, contextual understanding, generalization across situations, and geometric relations. If the approach holds, it would enable generation of diverse, realistic driving futures under explicit control of the ego-vehicle path and environment elements, opening routes to safer and faster training of autonomy systems through simulation.

Core claim

GAIA-1 is a generative world model that accepts video, text, and action inputs, discretizes them into tokens, and trains via next-token prediction to synthesize realistic driving scenarios. The resulting model exhibits emergent properties of high-level structure learning, scene dynamics, contextual awareness, generalization, and geometric understanding. These properties support fine-grained control over ego-vehicle behavior and scene features while generating samples that reflect expectations of future events.

What carries the argument

Discretization of multimodal driving inputs into tokens combined with unsupervised next-token prediction to model evolving world states.

If this is right

  • Generated sequences provide diverse synthetic environments for training perception and planning modules.
  • Explicit conditioning on actions allows targeted simulation of specific ego-vehicle maneuvers.
  • Emergent geometric and dynamic understanding supports more accurate long-horizon forecasting without hand-crafted physics rules.
  • Contextual awareness in the token sequences enables generation of coherent multi-agent interactions in complex scenes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The token-sequence formulation could support closed-loop planning by sampling multiple futures and selecting among them.
  • Similar discretization and prediction pipelines might transfer to other sensor-rich domains such as robotics or aerial navigation.
  • The learned representation of future expectations could be used to detect distribution shifts between simulated and real environments.

Load-bearing premise

Converting continuous video and action streams into discrete tokens and training a next-token predictor will yield generated outputs whose dynamics and geometry align closely enough with real-world driving physics to support safety-critical autonomy training.

What would settle it

A direct comparison of generated scenario statistics, such as vehicle trajectory distributions, collision frequencies, and lane adherence, against matched real-world driving datasets to measure divergence in physical consistency.

read the original abstract

Autonomous driving promises transformative improvements to transportation, but building systems capable of safely navigating the unstructured complexity of real-world scenarios remains challenging. A critical problem lies in effectively predicting the various potential outcomes that may emerge in response to the vehicle's actions as the world evolves. To address this challenge, we introduce GAIA-1 ('Generative AI for Autonomy'), a generative world model that leverages video, text, and action inputs to generate realistic driving scenarios while offering fine-grained control over ego-vehicle behavior and scene features. Our approach casts world modeling as an unsupervised sequence modeling problem by mapping the inputs to discrete tokens, and predicting the next token in the sequence. Emerging properties from our model include learning high-level structures and scene dynamics, contextual awareness, generalization, and understanding of geometry. The power of GAIA-1's learned representation that captures expectations of future events, combined with its ability to generate realistic samples, provides new possibilities for innovation in the field of autonomy, enabling enhanced and accelerated training of autonomous driving technology.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces GAIA-1, a generative world model for autonomous driving that accepts video, text, and action inputs, maps them to discrete tokens, and trains a next-token predictor to generate future driving scenarios. It claims emergent capabilities including high-level scene structure learning, dynamics modeling, contextual awareness, generalization, and geometry understanding, positioning the model as a tool for controllable scenario generation to accelerate autonomy training.

Significance. If the claimed emergent properties are quantitatively verified, the work could meaningfully advance simulation-based training for autonomous vehicles by offering a scalable, multimodal sequence-modeling route to realistic, controllable scene generation without explicit physics or supervision. This would be particularly relevant for exploring rare events and ego-vehicle control in safety-critical domains.

major comments (2)
  1. [Abstract] Abstract: the central claim that the model captures scene dynamics, geometry understanding, and contextual awareness is presented as an observed emergent property, yet the text supplies no quantitative metrics, baselines, ablation studies, or held-out scene evaluations to substantiate that the generated outputs respect continuous physical constraints (e.g., non-penetration, realistic accelerations) at a fidelity finer than the token grid.
  2. [Abstract] The discretization step that maps continuous video and action streams to discrete tokens is load-bearing for all downstream claims; without reported analysis of quantization error accumulation over long horizons or comparisons against continuous baselines, it is unclear whether the next-token predictor can produce physics-faithful trajectories suitable for safety-critical autonomy training.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed review of our manuscript. We address each major comment below and have revised the paper to improve clarity and substantiation of the claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the model captures scene dynamics, geometry understanding, and contextual awareness is presented as an observed emergent property, yet the text supplies no quantitative metrics, baselines, ablation studies, or held-out scene evaluations to substantiate that the generated outputs respect continuous physical constraints (e.g., non-penetration, realistic accelerations) at a fidelity finer than the token grid.

    Authors: We agree that the abstract states these emergent properties without quantitative metrics. The full manuscript supports the observations through qualitative results and visualizations in the experiments, where generated sequences demonstrate coherent scene evolution, object interactions, and geometric consistency. We have revised the abstract to clarify that these properties are demonstrated via the generated outputs and to reference the relevant experimental sections and figures. We acknowledge the value of additional quantitative metrics for physics fidelity and have added a new paragraph in the results discussion that reports proxy measures such as trajectory smoothness and collision avoidance rates computed on held-out generations. revision: yes

  2. Referee: [Abstract] The discretization step that maps continuous video and action streams to discrete tokens is load-bearing for all downstream claims; without reported analysis of quantization error accumulation over long horizons or comparisons against continuous baselines, it is unclear whether the next-token predictor can produce physics-faithful trajectories suitable for safety-critical autonomy training.

    Authors: We concur that the tokenization is central to the approach. The methods section details the discretization process, and the results include long-horizon generations that remain visually consistent without prominent quantization artifacts. In the revised manuscript we have added a dedicated analysis subsection examining error accumulation over extended sequences using reconstruction metrics on tokenized video. Direct side-by-side comparisons to continuous baselines are not feasible within the current architecture, but we have expanded the discussion to explain the scalability and controllability benefits of the discrete formulation while noting its limitations for sub-token physical precision. revision: partial

Circularity Check

0 steps flagged

No circularity: GAIA-1 claims rest on standard next-token prediction trained on external data

full rationale

The paper frames world modeling as mapping video/action/text inputs to discrete tokens and training an unsupervised next-token predictor. Emergent properties (scene dynamics, geometry understanding, contextual awareness) are presented as observed outcomes after training on real driving data and evaluation on held-out scenes. No equations, fitted parameters, or self-citations reduce these properties to quantities defined by construction within the paper itself. The approach follows the standard autoregressive generative modeling paradigm without self-referential derivation or load-bearing uniqueness theorems imported from the authors' prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that real-world driving dynamics can be captured by next-token prediction over discrete tokens derived from video, text, and actions; no new physical entities or free parameters are introduced in the abstract.

axioms (1)
  • domain assumption Continuous video and action streams can be losslessly mapped to a discrete token vocabulary that preserves semantic and dynamic information.
    The approach explicitly maps inputs to discrete tokens before sequence modeling.

pith-pipeline@v0.9.0 · 5496 in / 1186 out tokens · 57244 ms · 2026-05-12T07:09:42.173438+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 42 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Coding Agent Is Good As World Simulator

    cs.AI 2026-05 unverdicted novelty 7.0

    A multi-agent framework generates and refines executable physics simulation code from prompts to create world models that enforce physical constraints, claiming superior accuracy and fidelity over video-based alternatives.

  2. HorizonDrive: Self-Corrective Autoregressive World Model for Long-horizon Driving Simulation

    cs.CV 2026-05 conditional novelty 7.0

    HorizonDrive enables stable long-horizon autoregressive driving simulation via anti-drifting teacher training with scheduled rollout recovery and teacher rollout distillation.

  3. TIE: Time Interval Encoding for Video Generation over Events

    cs.CV 2026-05 unverdicted novelty 7.0

    TIE derives a sinc-based interval encoding from temporal integrability and duration invariance principles, raising temporal constraint satisfaction from 77% to 96% on the OmniEvents dataset while preserving visual quality.

  4. Latent State Design for World Models under Sufficiency Constraints

    cs.AI 2026-05 unverdicted novelty 7.0

    World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.

  5. Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

    cs.AI 2026-04 unverdicted novelty 7.0

    Proposes a levels x laws taxonomy for world models in AI agents, defining L1-L3 capabilities across physical, digital, social, and scientific regimes while reviewing over 400 works to outline a roadmap for advanced ag...

  6. ScenarioControl: Vision-Language Controllable Vectorized Latent Scenario Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    ScenarioControl introduces the first vision-language controllable generator for realistic vectorized 3D driving scenarios with temporal consistency across actor views.

  7. Learning Vision-Language-Action World Models for Autonomous Driving

    cs.CV 2026-04 unverdicted novelty 7.0

    VLA-World improves autonomous driving by using action-guided future image generation followed by reflective reasoning over the imagined scene to refine trajectories.

  8. A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens

    cs.CV 2026-04 conditional novelty 7.0

    Delta tokens compress VFM feature differences into single tokens, enabling a lightweight generative world model that predicts diverse futures with far lower compute than existing approaches.

  9. Training Agents Inside of Scalable World Models

    cs.AI 2025-09 conditional novelty 7.0

    Dreamer 4 is the first agent to obtain diamonds in Minecraft from only offline data by reinforcement learning inside a scalable world model that accurately predicts game mechanics.

  10. The DAWN of World-Action Interactive Models

    cs.CV 2026-05 unverdicted novelty 6.0

    DAWN couples a world predictor with a world-conditioned action denoiser in latent space so that each refines the other recursively, yielding strong planning and safety results on autonomous driving benchmarks.

  11. CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 6.0

    CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and t...

  12. CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 6.0

    CoWorld-VLA encodes world information into four expert tokens that condition a diffusion-based planner, yielding competitive collision avoidance and trajectory accuracy on the NAVSIM benchmark.

  13. Network-Efficient World Model Token Streaming

    cs.RO 2026-05 unverdicted novelty 6.0

    An adaptive delta-prioritization algorithm using cosine distance and Hamming-drift thresholds improves embedding distortion by 4.8-7.2% and next-token perplexity by 2.1-6.3% over periodic keyframing at matched low bit...

  14. DriveFuture: Future-Aware Latent World Models for Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 6.0

    DriveFuture achieves SOTA results on NAVSIM by conditioning latent world model states on future predictions to directly inform trajectory planning.

  15. GEM: Generating LiDAR World Model via Deformable Mamba

    cs.CV 2026-05 unverdicted novelty 6.0

    GEM is a new LiDAR world model using deformable Mamba that disentangles dynamic and static features to generate high-fidelity simulations and achieve state-of-the-art results on autonomous driving benchmarks.

  16. Divide and Conquer: Decoupled Representation Alignment for Multimodal World Models

    cs.CV 2026-05 unverdicted novelty 6.0

    M²-REPA decouples modality-specific features inside a diffusion model and aligns each to its matching expert foundation model via an alignment loss plus a decoupling regularizer, yielding better visual quality and lon...

  17. HERMES++: Toward a Unified Driving World Model for 3D Scene Understanding and Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    HERMES++ unifies 3D scene understanding and future geometry prediction in driving scenes via BEV representations, LLM-enhanced queries, a temporal link, and joint geometric optimization.

  18. LA-Pose: Latent Action Pretraining Meets Pose Estimation

    cs.CV 2026-04 unverdicted novelty 6.0

    LA-Pose achieves over 10% higher pose accuracy than recent feed-forward methods on Waymo and PandaSet benchmarks by repurposing latent actions from self-supervised inverse-dynamics pretraining while using orders of ma...

  19. ProDrive: Proactive Planning for Autonomous Driving via Ego-Environment Co-Evolution

    cs.RO 2026-04 unverdicted novelty 6.0

    ProDrive couples a query-centric planner with a BEV world model for end-to-end ego-environment co-evolution, enabling future-outcome assessment that improves safety and efficiency over reactive baselines on NAVSIM v1.

  20. Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

    cs.CV 2026-04 unverdicted novelty 6.0

    OneVL achieves superior accuracy to explicit chain-of-thought reasoning at answer-only latency by supervising latent tokens with a visual world model decoder that predicts future frames.

  21. Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

    cs.CV 2026-04 unverdicted novelty 6.0

    OneVL is the first latent CoT method to exceed explicit CoT accuracy on four driving benchmarks while running at answer-only speed, by supervising latent tokens with a visual world model decoder.

  22. Infrastructure-Centric World Models: Bridging Temporal Depth and Spatial Breadth for Roadside Perception

    cs.CV 2026-04 unverdicted novelty 6.0

    Infrastructure-centric world models use roadside sensors' temporal depth to complement vehicle spatial breadth for better traffic simulation and prediction.

  23. MetaEarth3D: Unlocking World-scale 3D Generation with Spatially Scalable Generative Modeling

    cs.CV 2026-04 unverdicted novelty 6.0

    MetaEarth3D is the first generative foundation model for spatially consistent, unbounded 3D scene generation at planetary scale using optical Earth observation data.

  24. Human Cognition in Machines: A Unified Perspective of World Models

    cs.RO 2026-04 unverdicted novelty 6.0

    The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...

  25. Grounded World Model for Semantically Generalizable Planning

    cs.RO 2026-04 conditional novelty 6.0

    A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.

  26. Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction

    cs.CV 2026-04 unverdicted novelty 6.0

    Re2Pix decomposes video prediction into semantic feature forecasting followed by representation-conditioned diffusion synthesis, with nested dropout and mixed supervision to handle prediction errors.

  27. LMGenDrive: Bridging Multimodal Understanding and Generative World Modeling for End-to-End Driving

    cs.CV 2026-04 unverdicted novelty 6.0

    LMGenDrive unifies LLM-based multimodal understanding with generative world models to output both future driving videos and control signals for end-to-end closed-loop autonomous driving.

  28. Hierarchical Planning with Latent World Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Hierarchical planning over multi-scale latent world models enables 70% success on real robotic pick-and-place with goal-only input where flat models achieve 0%, while cutting planning compute up to 4x in simulations.

  29. ExploreVLA: Dense World Modeling and Exploration for End-to-End Autonomous Driving

    cs.CV 2026-04 unverdicted novelty 6.0

    ExploreVLA augments VLA driving models with future RGB and depth prediction for dense supervision and uses prediction uncertainty as a safety-gated intrinsic reward for RL-based exploration, reaching SOTA PDMS 93.7 on NAVSIM.

  30. Safety, Security, and Cognitive Risks in World Models

    cs.CR 2026-04 unverdicted novelty 6.0

    World models enable efficient AI planning but create risks from adversarial corruption, goal misgeneralization, and human bias, demonstrated via attacks that amplify errors and reduce rewards on models like RSSM and D...

  31. Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms

    eess.IV 2026-03 unverdicted novelty 6.0

    Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.

  32. V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    cs.AI 2025-06 unverdicted novelty 6.0

    V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 h...

  33. DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

    cs.CV 2024-02 unverdicted novelty 6.0

    DriveVLM adds vision-language models with scene description, analysis, and hierarchical planning modules to autonomous driving, paired with a hybrid DriveVLM-Dual system tested on nuScenes and SUP-AD datasets and depl...

  34. Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

    cs.CV 2026-04 unverdicted novelty 5.0

    Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...

  35. Lifting Embodied World Models for Planning and Control

    cs.CV 2026-04 unverdicted novelty 5.0

    Composing a policy that maps 2D waypoints to joint actions with a frozen world model yields a lifted world model that achieves 3.8 times lower mean joint error than direct low-level search while being more compute-eff...

  36. RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework

    cs.CV 2026-04 unverdicted novelty 5.0

    RAD-2 uses a diffusion generator and RL discriminator to cut collision rates by 56% in closed-loop autonomous driving planning.

  37. Artificial Intelligence for Modeling and Simulation of Mixed Automated and Human Traffic

    cs.AI 2026-04 unverdicted novelty 5.0

    This survey synthesizes AI techniques for mixed autonomy traffic simulation and introduces a taxonomy spanning agent-level behavior models, environment-level methods, and cognitive/physics-informed approaches.

  38. DynFlowDrive: Flow-Based Dynamic World Modeling for Autonomous Driving

    cs.CV 2026-03 unverdicted novelty 5.0

    DynFlowDrive models action-conditioned scene transitions via rectified flow in latent space and adds stability-aware trajectory selection, showing gains on nuScenes and NavSim without added inference cost.

  39. DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 4.0

    DeepSight uses parallel latent feature prediction in BEV for long-horizon world modeling and adaptive text reasoning to reach state-of-the-art closed-loop performance on the Bench2drive benchmark.

  40. OpenWorldLib: A Unified Codebase and Definition of Advanced World Models

    cs.CV 2026-04 unverdicted novelty 4.0

    OpenWorldLib offers a standardized codebase and definition for world models that combine perception, interaction, and memory to understand and predict the world.

  41. World Model for Robot Learning: A Comprehensive Survey

    cs.RO 2026-04 unverdicted novelty 3.0

    A comprehensive survey that organizes the literature on world models in robot learning, their roles in policy learning, planning, simulation, and video-based generation, with connections to navigation, driving, datase...

  42. Cosmos World Foundation Model Platform for Physical AI

    cs.CV 2025-01 unverdicted novelty 3.0

    The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.

Reference graph

Works this paper leans on

94 extracted references · 94 canonical work pages · cited by 40 Pith papers

  1. [1]

    Kendall, J

    A. Kendall, J. Hawke, D. Janz, P. Mazur, D. Reda, J.-M. Allen, V .-D. Lam, A. Bewley, and A. Shah. Learning to drive in a day. In Proceedings of the International Conference on Robotics and Automation (ICRA), 2019

  2. [2]

    Caesar, V

    H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom. nuScenes: A multimodal dataset for autonomous driving. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2020

  3. [3]

    A. Hu, F. Cotter, N. Mohan, C. Gurau, and A. Kendall. Probabilistic Future Prediction for Video Scene Understanding. In Proceedings of the European Conference on Computer Vision (ECCV), 2020

  4. [4]

    Ettinger, S

    S. Ettinger, S. Cheng, B. Caine, C. Liu, H. Zhao, S. Pradhan, Y . Chai, B. Sapp, C. Qi, Y . Zhou, Z. Yang, A. Chouard, P. Sun, J. Ngiam, V . Vasudevan, A. McCauley, J. Shlens, and D. Anguelov. Large Scale Interactive Motion Forecasting for Autonomous Driving : The Waymo Open Motion Dataset. In Proceedings of the IEEE International Conference on Computer V...

  5. [5]

    A. Hu, Z. Murez, N. Mohan, S. Dudas, J. Hawke, V . Badrinarayanan, R. Cipolla, and A. Kendall. FIERY: Future Instance Prediction in Bird’s-Eye View From Surround Monocular Cameras. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) , pages 15273–15282, 2021

  6. [6]

    Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wang, L. Lu, X. Jia, Q. Liu, J. Dai, Y . Qiao, and H. Li. Planning-oriented autonomous driving. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2023

  7. [7]

    Ha and J

    D. Ha and J. Schmidhuber. Recurrent world models facilitate policy evolution. In Advances in Neural Information Processing Systems (NeurIPS) , 2018

  8. [8]

    Y . LeCun. A Path Towards Autonomous Machine Intelligence. In arXiv preprint, 2022

  9. [9]

    Schrittwieser, I

    J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre, S. Schmitt, A. Guez, E. Lock- hart, D. Hassabis, T. Graepel, T. Lillicrap, and D. Silver. Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model. In Nature, 2020

  10. [10]

    Janner, Q

    M. Janner, Q. Li, and S. Levine. Offline reinforcement learning as one big sequence modeling problem. In Advances in Neural Information Processing Systems (NeurIPS) , 2021

  11. [11]

    A. Hu, G. Corrado, N. Griffiths, Z. Murez, C. Gurau, H. Yeo, A. Kendall, R. Cipolla, and J. Shotton. Model-Based Imitation Learning for Urban Driving. In Advances in Neural Information Processing Systems (NeurIPS), 2022

  12. [12]

    Micheli, E

    V . Micheli, E. Alonso, and F. Fleuret. Transformers are sample-efficient world models. In Proceedings of the International Conference on Learning Representations (ICLR) , 2023

  13. [13]

    Hafner, J

    D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap. Mastering diverse domains through world models. In arXiv preprint, 2023

  14. [14]

    S. Reed, K. Zolna, E. Parisotto, S. Gómez, A. Novikov, G. Barth-Maron, M. Gimenez, Y . Sulsky, J. Kay, J. Springenberg, T. Eccles, J. Bruce, A. Razavi, A. Edwards, N. Heess, Y . Chen, R. Hadsell, O. Vinyals, M. Bordbar, and N. Freitas. A generalist agent. In Transactions on Machine Learning Research (TMLR), 2022

  15. [15]

    P. Wu, A. Escontrela, D. Hafner, P. Abbeel, and K. Goldberg. Daydreamer: World models for physical robot learning. In Proceedings of the Conference on Robot Learning (CoRL) , 2023

  16. [16]

    J. Ho, W. Chan, C. Saharia, J. Whang, R. Gao, A. Gritsenko, D. P. Kingma, B. Poole, M. Norouzi, D. J. Fleet, and T. Salimans. Imagen video: High definition video generation with diffusion models. In arXiv preprint, 2022

  17. [17]

    Harvey, S

    W. Harvey, S. Naderiparizi, V . Masrani, C. Weilbach, and F. Wood. Flexible diffusion modeling of long videos. In Advances in Neural Information Processing Systems (NeurIPS) , 2022. 20

  18. [18]

    Esser, J

    P. Esser, J. Chiu, P. Atighehchian, J. Granskog, and A. Germanidis. Structure and content-guided video synthesis with diffusion models. In arXiv preprint, 2023

  19. [19]

    L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin, P. Abbeel, A. Srinivas, and I. Mordatch. Decision transformer: Reinforcement learning via sequence modeling. InAdvances in Neural Information Processing Systems (NeurIPS) , 2021

  20. [20]

    Smith, M

    S. Smith, M. M. A. Patwary, B. Norick, P. Legresley, S. Rajbhandari, J. Casper, Z. Liu, S. Prabhumoye, G. Zerveas, V . Korthikanti, E. Zhang, R. Child, R. Aminabadi, J. Bernauer, X. Song, M. Shoeybi, Y . He, M. Houston, S. Tiwary, and B. Catanzaro. Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model. In ...

  21. [21]

    Chowdhery, S

    A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, P. Schuh, K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes, Y . Tay, N. M. Shazeer, V . Prabhakaran, E. Reif, N. Du, B. C. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Isard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghemawat, S...

  22. [22]

    Touvron, T

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample. Llama: Open and efficient foundation language models. In arXiv preprint, 2023

  23. [23]

    GPT-4 Technical Report

    OpenAI. GPT-4 Technical Report. In arXiv preprint, 2023

  24. [24]

    Raffel, N

    C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 2020

  25. [25]

    Y . Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V . Stoyanov. Roberta: A robustly optimized bert pretraining approach.arXiv preprint, 2019

  26. [26]

    Brown, B

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei...

  27. [27]

    Hoffmann, S

    J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de las Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, O. Vinyals, J. W. Rae, and L. Sifre. An empirical analysis of compute-optimal large language model training. In Ad...

  28. [28]

    van den Oord, O

    A. van den Oord, O. Vinyals, and K. Kavukcuoglu. Neural discrete representation learning. In Advances in Neural Information Processing Systems (NeurIPS) , 2017

  29. [29]

    Z. Peng, L. Dong, H. Bao, Q. Ye, and F. Wei. BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers.arXiv preprint, 2022

  30. [30]

    Caron, H

    M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2021

  31. [31]

    Ronneberger, P

    O. Ronneberger, P. Fischer, and T. Brox. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI) , 2015. 21

  32. [32]

    Johnson, A

    J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super- resolution. In Proceedings of the European Conference on Computer Vision (ECCV) , 2016

  33. [33]

    Esser, R

    P. Esser, R. Rombach, and B. Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2021

  34. [34]

    J. Yu, X. Li, J. Y . Koh, H. Zhang, R. Pang, J. Qin, A. Ku, Y . Xu, J. Baldridge, and Y . Wu. Vector-quantized image modeling with improved VQGAN. In Proceedings of the International Conference on Learning Representations (ICLR) , 2022

  35. [35]

    Vaswani, N

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, and I. Polo- sukhin. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), 2017

  36. [36]

    Rombach, A

    R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022

  37. [37]

    Saharia, W

    C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. Denton, S. K. S. Ghasemipour, B. K. Ayan, S. S. Mahdavi, R. G. Lopes, T. Salimans, J. Ho, D. J. Fleet, and M. Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. In Advances in Neural Information Processing Systems (NeurIPS), 2022

  38. [38]

    J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet. Video diffusion models. In arXiv preprint, 2022

  39. [39]

    J. H. Tim Salimans. Progressive distillation for fast sampling of diffusion models. InProceedings of the International Conference on Learning Representations (ICLR) , 2022

  40. [40]

    Loshchilov and F

    I. Loshchilov and F. Hutter. Decoupled Weight Decay Regularization. In Proceedings of the International Conference on Learning Representations (ICLR) , 2019

  41. [41]

    T. Dao, D. Y . Fu, S. Ermon, A. Rudra, and C. Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems (NeurIPS), 2022

  42. [42]

    Rasley, S

    J. Rasley, S. Rajbhandari, O. Ruwase, and Y . He. DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , 2020

  43. [43]

    Hoogeboom, J

    E. Hoogeboom, J. Heek, and T. Salimans. simple diffusion: End-to-end diffusion for high resolution images. In Proceedings of the International Conference on Machine Learning (ICML), 2023

  44. [44]

    Holtzman, J

    A. Holtzman, J. Buys, L. Du, M. Forbes, and Y . Choi. The curious case of neural text degeneration. In Proceedings of the International Conference on Learning Representations (ICLR), 2020

  45. [45]

    Ho and T

    J. Ho and T. Salimans. Classifier-free diffusion guidance. arXiv preprint, 2022

  46. [46]

    Chang, H

    H. Chang, H. Zhang, J. Barber, A. Maschinot, J. Lezama, L. Jiang, M.-H. Yang, K. Murphy, W. T. Freeman, M. Rubinstein, et al. Muse: Text-to-image generation via masked generative transformers. arXiv preprint, 2023

  47. [47]

    https://github.com/AUTOMATIC1111/stable-diffusion-webui/ wiki/Negative-prompt, 2022

    Negative prompt. https://github.com/AUTOMATIC1111/stable-diffusion-webui/ wiki/Negative-prompt, 2022

  48. [48]

    J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models. Proceedings of the International Conference on Learning Representations (ICLR) , 2021

  49. [49]

    Kaplan, S

    J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei. Scaling laws for neural language models. In arXiv preprint, 2020. 22

  50. [50]

    D. P. Kingma and M. Welling. Auto-encoding variational bayes.Proceedings of the International Conference on Learning Representations (ICLR) , 2014

  51. [51]

    Goodfellow, J

    I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio. Generative adversarial nets. InAdvances in Neural Information Processing Systems (NeurIPS), 2014

  52. [52]

    Sohl-Dickstein, E

    J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the International Conference on Machine Learning (ICML), 2015

  53. [53]

    van den Oord, N

    A. van den Oord, N. Kalchbrenner, L. Espeholt, k. kavukcuoglu, O. Vinyals, and A. Graves. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems (NeurIPS), 2016

  54. [54]

    Babaeizadeh, C

    M. Babaeizadeh, C. Finn, D. Erhan, R. H. Campbell, and S. Levine. Stochastic variational video prediction. In Proceedings of the International Conference on Learning Representations (ICLR), 2018

  55. [55]

    Denton and R

    E. Denton and R. Fergus. Stochastic Video Generation with a Learned Prior. In Proceedings of the International Conference on Machine Learning (ICML) , 2018

  56. [56]

    Villegas, A

    R. Villegas, A. Pathak, H. Kannan, D. Erhan, Q. Le, and H. Lee. High fidelity video prediction with large stochastic recurrent neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2019

  57. [57]

    Franceschi, E

    J.-Y . Franceschi, E. Delasalles, M. Chen, S. Lamprier, and P. Gallinari. Stochastic latent residual video prediction. In Proceedings of the International Conference on Machine Learning (ICML) , 2020

  58. [58]

    Babaeizadeh, M

    M. Babaeizadeh, M. Saffar, S. Nair, S. Levine, C. Finn, and D. Erhan. Fitvid: Overfitting in pixel-level video prediction. In arXiv preprint, 2021

  59. [59]

    V ondrick, H

    C. V ondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics. In Advances in Neural Information Processing Systems (NeurIPS) , 2016

  60. [60]

    Tulyakov, M.-Y

    S. Tulyakov, M.-Y . Liu, X. Yang, and J. Kautz. MoCoGAN: Decomposing motion and content for video generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018

  61. [61]

    Clark, J

    A. Clark, J. Donahue, and K. Simonyan. Adversarial Video Generation on Complex Datasets. In arXiv preprint, 2019

  62. [62]

    S. W. Kim, J. Philion, A. Torralba, and S. Fidler. DriveGAN: Towards a controllable high-quality neural simulation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021

  63. [63]

    Skorokhodov, S

    I. Skorokhodov, S. Tulyakov, and M. Elhoseiny. StyleGAN-V: A continuous video generator with the price, image quality and perks of StyleGAN2. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2022

  64. [64]

    Brooks, J

    T. Brooks, J. Hellsten, M. Aittala, T.-C. Wang, T. Aila, J. Lehtinen, M.-Y . Liu, A. Efros, and T. Karras. Generating long videos of dynamic scenes. In Advances in Neural Information Processing Systems (NeurIPS), 2022

  65. [65]

    Goodfellow

    I. Goodfellow. NIPS 2016 Tutorial: Generative Adversarial Networks. In arXiv preprint, 2016

  66. [66]

    V oleti, A

    V . V oleti, A. Jolicoeur-Martineau, and C. Pal. MCVD: Masked conditional video diffusion for prediction, generation, and interpolation. In Advances in Neural Information Processing Systems (NeurIPS), 2022

  67. [67]

    Höppe, A

    T. Höppe, A. Mehrjou, S. Bauer, D. Nielsen, and A. Dittadi. Diffusion models for video prediction and infilling. In Transactions on Machine Learning Research (TMLR), 2022. 23

  68. [68]

    Singer, A

    U. Singer, A. Polyak, T. Hayes, X. Yin, J. An, S. Zhang, Q. Hu, H. Yang, O. Ashual, O. Gafni, D. Parikh, S. Gupta, and Y . Taigman. Make-A-Video: Text-to-video generation without text- video data. In arXiv preprint, 2022

  69. [69]

    Molad, E

    E. Molad, E. Horwitz, D. Valevski, A. R. Acha, Y . Matias, Y . Pritch, Y . Leviathan, and Y . Hoshen. Dreamix: Video diffusion models are general video editors. arXiv preprint, 2023

  70. [70]

    D. Zhou, W. Wang, H. Yan, W. Lv, Y . Zhu, and J. Feng. Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint, 2022

  71. [71]

    Blattmann, R

    A. Blattmann, R. Rombach, H. Ling, T. Dockhorn, S. W. Kim, S. Fidler, and K. Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2023

  72. [72]

    Kalchbrenner, A

    N. Kalchbrenner, A. van den Oord, K. Simonyan, I. Danihelka, O. Vinyals, A. Graves, and K. Kavukcuoglu. Video Pixel Networks. In Proceedings of the International Conference on Machine Learning (ICML), 2017

  73. [73]

    Weissenborn, O

    D. Weissenborn, O. Täckström, and J. Uszkoreit. Scaling autoregressive video models. Pro- ceedings of the International Conference on Learning Representations (ICLR) , 2020

  74. [74]

    W. Yan, Y . Zhang, P. Abbeel, and A. Srinivas. VideoGPT: Video generation using vq-vae and transformers. In arXiv preprint, 2021

  75. [75]

    G. L. Moing, J. Ponce, and C. Schmid. CCVS: Context-aware controllable video synthesis. In Advances in Neural Information Processing Systems (NeurIPS) , 2021

  76. [76]

    S. Ge, T. Hayes, H. Yang, X. Yin, G. Pang, D. Jacobs, J.-B. Huang, and D. Parikh. Long video generation with time-agnostic vqgan and time-sensitive transformer. Proceedings of the European Conference on Computer Vision (ECCV), 2022

  77. [77]

    Y . Seo, K. Lee, F. Liu, S. James, and P. Abbeel. HARP: Autoregressive latent video prediction with high-fidelity image generator. In Proceedings of the IEEE International Conference on Image Processing (ICIP), 2022

  78. [78]

    W. Yan, D. Hafner, S. James, and P. Abbeel. Temporally consistent transformers for video generation. In Proceedings of the International Conference on Machine Learning (ICML) , 2023

  79. [79]

    Villegas, M

    R. Villegas, M. Babaeizadeh, P.-J. Kindermans, H. Moraldo, H. Zhang, M. T. Saffar, S. Castro, J. Kunze, and D. Erhan. Phenaki: Variable length video generation from open domain textual description. In Proceedings of the International Conference on Learning Representations (ICLR), 2023

  80. [80]

    L. Yu, Y . Cheng, K. Sohn, J. Lezama, H. Zhang, H. Chang, A. G. Hauptmann, M.-H. Yang, Y . Hao, I. Essa, and L. Jiang. MAGVIT: Masked Generative Video Transformer. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2023

Showing first 80 references.