Introduces textual belief states and factorized GRPO to enforce strict latent state mediation in text-based world models, yielding preserved prediction accuracy with large gains in representation quality and rollout performance on TextWorld and ScienceWorld.
super hub Canonical reference
World Models
Canonical reference. 88% of citing Pith papers cite this work as background.
abstract
We explore building generative neural network models of popular reinforcement learning environments. Our world model can be trained quickly in an unsupervised manner to learn a compressed spatial and temporal representation of the environment. By using features extracted from the world model as inputs to an agent, we can train a very compact and simple policy that can solve the required task. We can even train our agent entirely inside of its own hallucinated dream generated by its world model, and transfer this policy back into the actual environment. An interactive version of this paper is available at https://worldmodels.github.io/
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We explore building generative neural network models of popular reinforcement learning environments. Our world model can be trained quickly in an unsupervised manner to learn a compressed spatial and temporal representation of the environment. By using features extracted from the world model as inputs to an agent, we can train a very compact and simple policy that can solve the required task. We can even train our agent entirely inside of its own hallucinated dream generated by its world model, and transfer this policy back into the actual environment. An interactive version of this paper is
authors
co-cited works
representative citing papers
LeJEPA achieves linear identifiability of latent variables uniquely when the latents are Gaussian in worlds with stationary additive-noise transitions.
Task structure is identifiable across time steps and task-relevant representations are identifiable within steps in a nonparametric setting under sparsity regularization.
EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.
AIQI is the first model-free universal AI agent proven asymptotically ε-optimal in general RL by inducing over distributional Q-functions instead of policies or environments.
SWAM jointly generates intermediate RGB-D sequences and action trajectories from monocular RGB start/goal observations for embodied navigation.
GILP trains a parameterized backbone for valid actions and state predictions, then uses a consistency gate with LLM drafts to reduce hallucinated-state rate from 0.176 to 0.035 on GPT-4o-mini while raising success from 0.668 to 0.838.
MemoBench is a new diagnostic benchmark with automated and VQA metrics that evaluates memory consistency in video models under disappear-and-reappear in dynamic environments.
Equilibrium World Models are a deep-learning solver that enforces exact equilibrium conditions on broad model-generated state distributions to globally solve dynamic stochastic models featuring rare disasters, binding constraints, and counterfactual states.
Hypernetworks distill modular reservoir connectivity via a genomic bottleneck to generate sparse recurrent networks solving difficult temporal tasks with minimal training and maintained robustness.
VLWMs learn variable-length action-conditioned dynamics in latent space with curriculum training, yielding 13% average gains over prior latent world models on long-horizon tasks.
PreAct compiles successful agent executions into verifiable state-machine programs for 8.5-13x faster replay on repeated tasks, with an independent evaluator check before storing each program.
VISA improves closed-set 3D occupancy mIoU on nuScenes by using VLM instance audits as reliability-weighted semantic supervisors during training of existing world models.
Self-distillation from a caption-conditioned video diffusion model to an image-and-prompt-conditioned executor, enhanced by RL from VLM feedback, enables task solving in world models.
Dream.exe evaluates 8 video generation models on 101 manipulation tasks by converting generated videos into executable robot trajectories in a simulator, finding measurable success rates that visual metrics do not predict.
SVI-Bench provides 35K hours of sports video with 9 tasks across four cognitive levels, revealing models drop from ~74% on action QA to 5% on agentic evidence integration.
YoCausal benchmark shows video diffusion models detect the arrow of time but lack genuine causal understanding relative to humans.
FlatSounds benchmark shows state-of-the-art V2A models rely more on text captions than visual input for physical and semantic accuracy, with captions improving correctness but degrading temporal alignment.
A sleep mechanism with N offline recurrent passes consolidates context into fast weights, improving performance on reasoning tasks where standard transformers fail.
WBench is a benchmark with 289 test cases and 1,058 turns for evaluating interactive world models using 22 automated metrics validated against human judgments.
UWM-JEPA uses a density-matrix latent and unitary predictor in JEPA to preserve joint-state spectrum during blind rollouts, achieving 0.77 accuracy on a five-step hidden-velocity task versus 0.53 for an LSTM baseline.
JEPA guidance steers diffusion models toward low-density regions under an implicit density from a world model, producing minority samples with improved fidelity and semantic validity over generator-centric baselines.
SliceWorld introduces a world-state model for CT report generation that uses predictive and factor-aware objectives on axial slice sequences.
CRONOS benchmark shows recent open-source video generators fail to preserve physical consistency under controlled changes to viewpoint, scene, object category, and appearance.
citing papers explorer
-
Valdi: Value Diffusion World Models
Valdi pairs a latent diffusion dynamics model with end-to-end MPC training and reports that one diffusion step matches an MLP baseline on CarRacing while exposing a multimodality-control trade-off.
-
RetailSMV: Exocentric vs. Egocentric Adaptation of Foundation Video World Models in Retail
Exocentric-only LoRA adaptation of Cosmos3-Nano on a new synchronized retail video dataset matches or exceeds combined ego+exo training on most held-out metrics.
-
3D Point World Models: Point Completion Enables More Accurate Dynamics Learning
3DPWM completes partial point clouds then learns dynamics on the completed 3D scenes to produce reliable long-horizon rollouts for model-based robotic planning.
-
Revealing Safety-Critical Scenarios for UTM via Transformer
Transformer RL with a Policy Model and Action Sampler finds UTM safety vulnerabilities 8x more efficiently than expert testing in 700-hour simulations.
-
Self-Evolving World Models for LLM Agent Planning
WorldEvolver uses episodic memory, semantic memory, and selective foresight to self-evolve world models at test time, achieving top prediction accuracy and agent success on ALFWorld and ScienceWorld benchmarks.
-
A Path-Space Formulation of Prediction in World Models: From a Single Action to Prediction, Planning, and Irreversibility
Path-space formulation of world-model prediction via Onsager-Machlup action, with attention-based models acquiring asymmetry proportional to data irreversibility.
-
World Value Models for Robotic Manipulation
World Value Model (WVM) integrates world models with value estimation to achieve SOTA Value-Order Correlation on expert and suboptimal robotic data and improves downstream policy performance.
-
Self-Evolving Cognitive Framework via Causal World Modeling for Embodied Scientific Intelligence
Proposes a self-evolving cognitive framework integrating causal world modeling, intervention-driven reasoning, and continual refinement for embodied scientific intelligence.
-
MV-WAM: Manifold-Aware World Action Model with Value Augmentation
MV-WAM reports 55.7% simulation and 77.5% real-world success rates by aligning heterogeneous visual and action manifolds through causal masking and value-guided rollback.
-
Direct Advantage Estimation for Scalable and Sample-efficient Deep Reinforcement Learning
Extends DAE theory to POMDPs with minimal changes and introduces discrete latent dynamics to cut computational cost, with ALE experiments showing scalability and retained sample efficiency.
-
Physics-IQ Verified
Physics-IQ Verified refines 57.6% of samples and 34.8% of prompts from the original benchmark and produces moderate ranking shifts (Kendall's τ = 0.46) across six image-to-video models.
-
PAIWorld: A 3D-Consistent World Foundation Model for Robotic Manipulation
PAIWorld adds explicit geometric cross-view mechanisms and 3D distillation to DiT world models to achieve multi-view 3D consistency in robotic manipulation benchmarks.
-
BrainWorld: A Structural-Prior-Conditioned Generative Model for Whole-Brain 4D fMRI Dynamics
BrainWorld is a structural-prior-conditioned generative model that produces stable whole-brain 4D fMRI trajectories up to 400 frames, augments downstream tasks, and learns transferable multimodal representations across 22 datasets.
-
Geometry-Consistent Endoscopic Representations for Image-Guided Navigation via Structured Foundation Model Adaptation
A synthetic-data-driven, hierarchy-aware adaptation of foundation models produces geometry-consistent representations that improve pose estimation and monocular depth in endoscopy.
-
Kairos: A Native World Model Stack for Physical AI
Kairos is a native world model stack using cross-embodiment pretraining, hybrid linear temporal attention with theoretical error bounds, and deployment-aware co-design, reporting top performance on embodied benchmarks.
-
How Should World Models Be Evaluated for Embodied Decision-Making? A Decision-Making-Centric Position
The paper proposes an L0-L7 evidential ladder for evaluating world models in embodied decision-making, prioritizing interventional action fidelity and policy optimization utility over visual plausibility.
-
Some Essential Constructive Foundations for Systems and Control
Develops Bishop-style constructive apparatus for geometric sets, integration, extremum theorems, selectors, differential inclusions, Markov chains, and densities in systems and control.
-
Unifying Object-Centric World Models and Diffusion Policy: A Hierarchical Framework for Multi-Stage Robotic Tasks
WorldDP combines a high-level object-centric world model for subgoal planning with a low-level diffusion policy for execution, claiming better performance than baselines on multi-stage robotic manipulation benchmarks.
-
PRISM: PRior-guided Imagination Sampling in world Models
PRISM derives a state-conditioned action prior from a world model's encoder and integrates it into sampling-based planning via product-of-Gaussians fusion, claiming 35 and 32 percentage point gains on Cube and PushT tasks.
-
Instrumented data for causal scientific machine learning
Instrumented data augments observations with mechanistic models, uncertainty, and counterfactuals to enable causal interventions via Pearl's do-operator in scientific machine learning.
-
World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis
WLA models use an autoregressive Transformer to jointly predict textual subtasks, subgoal images, and robot actions from instructions, images, and states, reporting SOTA success rates on RoboTwin2.0 and RMBench.
-
Representation Learning Enables Scalable Multitask Deep Reinforcement Learning
MR.Q combines predictive auxiliary tasks with high-capacity value functions in a model-free architecture to achieve strong multitask RL performance without planning.
-
MIRAGE: Mobile Agents with Implicit Reasoning and Generative World Models
MIRAGE compresses explicit chain-of-thought into latent vectors and adds a generative world model to predict future interface states, matching explicit reasoning performance with 3-5x fewer tokens on Android benchmarks.
-
UniCanvas: A Diffusion-base Unified Model for Text-in-Image Joint Generation
UniCanvas introduces a diffusion-based approach for unified multimodal generation by embedding text as visual patterns within images on a shared canvas.
-
MPMWorlds: Material-Point-Method Simulations for Inferring and Extrapolating Physical Dynamics
Assembles MPM simulation dataset and compares code generation versus video diffusion for inferring physical parameters and extrapolating dynamics from videos.
-
TERRA: Task-Embedded Reasoning and Representation Architecture for Cross-Domain Applications
TERRA formalizes cross-domain transfer for action-conditioned latent predictors via controlled Markov processes and bisimulation metrics, states a falsifiable Structured-State Transfer Hypothesis, and outlines a preregistered experimental program with no empirical results presented.
-
OptiWorld: Optimal Control for Video World Generation under Physical Constraints
OptiWorld inserts a classical optimal-control layer that extracts a world state, plans an optimal trajectory on a geometric manifold under physical constraints, and renders the video conditioned on that trajectory.
-
Physical Object Understanding with a Physically Controllable World Model
Autoregressive probabilistic world models trained on raw videos yield emergent object segmentation, 3D controllability, and physical relationship inference via multi-future motion correlation analysis.
-
Physically Viable World Models: A Case for Query-Conditioned Embodied AI
Embodied AI requires query-conditioned world models that select the simplest physical abstraction sufficient to answer intervention queries.
-
Chreode: A Cell World Model for One-Step Temporal Dynamics and Perturbation Prediction
Chreode introduces a pretrained one-step dynamics model using a structured residual operator that improves perturbation prediction transfer from developmental trajectories to CRISPR data.
-
The Sensation Modulating Network:Haltability as the architectural ground for object-directed phenomenology
The Sensation Modulating Network (SMN) is proposed as an embodied architecture in which haltability, arising from opponent dynamics at all scales, supplies the structural basis for object-directed intentionality.
-
Scaling World-Model Reinforcement Learning Through Diffusion Policy Optimization
MBDPO reformulates policy optimization as a diffusion process over searched trajectories in latent world models to reduce misalignment between search and value learning.
-
Back to Parsimonious Latents: Learning Task-Centric World Models from Visual Foundations
TC-WM converts foundation-model visual embeddings into parsimonious task-sufficient world model latents via linear projection, contrastive physical-state alignment, and embedding reconstruction, with a theoretical identification guarantee.
-
WorldCraft: From Camera Navigation to Object Manipulation in Interactive Video World Models
WorldCraft introduces NWT, SP-LoRA, and TASP to enable object trajectory control in video-based world models while preserving camera navigation.
-
Reinforcement Learning for Laser Additive Manufacturing Scan-Order Optimisation: A Bilevel Proxy--FEA Diagnostic Framework for Reward and World-Model Diagnosis
A bilevel Proxy-FEA diagnostic framework is introduced and tested on a simplified LDED32 stripe benchmark to reveal proxy misalignment with FEA labels and a stress-distortion trade-off in RL-guided scan-order optimization.
-
LVDrive: Latent Visual Representation Enhanced Vision-Language-Action Autonomous Driving Model
LVDrive improves closed-loop driving on Bench2Drive by adding latent future scene prediction to VLA models via unified embedding space processing and two-stage trajectory decoding.
-
ChronoMedicalWorld: A Medical World Model for Learning Patient Trajectories from Longitudinal Care Data
CMWM is a recurrent latent world model for forecasting patient trajectories like annual eGFR in CKD, reporting 7.28% lower MAE than a tuned GPT-5.5 baseline on a 2232-patient cohort with gains from dialogue data.
-
stable-worldmodel: A Platform for Reproducible World Modeling Research and Evaluation
The paper presents stable-worldmodel (swm), a platform with high-performance data layer, modern world model baselines, planning solvers, and extended environments for reproducible research and generalization evaluation.
-
Xiaomi Auto World Model: A Joint World Model Integrating Reconstruction and Generation for Autonomous Driving
A unified system integrating sparse-query 3D Gaussian reconstruction with multi-stage causal video generation for autonomous driving world models.
-
ECG-WM: A Physiology-Informed ECG World Model for Clinical Intervention Simulation
ECG-WM combines ODE physiological priors with latent diffusion models to generate intervention-conditioned ECG trajectories and uses diffusion stochasticity for uncertainty-aware clinical risk assessment.
-
SWoMo: Neuro-Symbolic World Model for Cataract Surgery Simulation
SWoMo decouples symbolic rule-based motion modeling via scene graphs from visual realism via diffusion models, trained through inverse pairing of real cataract surgery videos reconstructed in the simulator for sim-to-real translation.
-
Mind Dreamer: Untethering Imagination via Active Causal Intervention on Latent Manifolds
Mind Dreamer uses active causal intervention via an adversarial initial-state generator and relay value functions to untether imagination in MBRL, claiming 1.67x average and up to 8.8x sparse-reward speedups over DreamerV3.
-
SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer
SANA-WM is a 2.6B-parameter efficient world model that synthesizes minute-scale 720p videos with 6-DoF camera control, trained on 213K public clips in 15 days on 64 H100s and runnable on single GPUs at 36x higher throughput than prior open baselines.
-
PROWL: Prioritized Regret-Driven Optimization for World Model Learning
PROWL introduces a KL-constrained adversarial curriculum and prioritized adversarial trajectory buffer to actively discover and correct rare failure modes in action-conditioned video world models.
-
Probing the Impact of Scale on Data-Efficient, Generalist Transformer World Models for Atari
Transformer world models on Atari exhibit game-specific scaling regimes, but joint training on 26 environments produces consistent monotonic gains that improve downstream control policies to a median normalized score of 0.770.
-
Reconstruction or Semantics? What Makes a Latent Space Useful for Robotic World Models
Semantic latent spaces from pretrained encoders outperform reconstruction-based spaces for robotic world models on planning and downstream policy performance.
-
CKT-WAM: Parameter-Efficient Context Knowledge Transfer Between World Action Models
CKT-WAM transfers teacher WAM knowledge to students via compressed text-embedding contexts using LQCA and adapters, reaching 86.1% success on LIBERO-Plus with 1.17% trainable parameters and 83.3% in real-world tasks.
-
FAAST: Forward-Only Associative Learning via Closed-Form Fast Weights for Test-Time Supervised Adaptation
FAAST performs test-time supervised adaptation by analytically deriving fast weights from examples in one forward pass, matching backprop performance with over 90% less adaptation time and up to 95% memory savings versus memory-based methods.
-
HDFlow: Hierarchical Diffusion-Flow Planning for Long-horizon Tasks
HDFlow pairs a high-level diffusion planner for strategic subgoals with a low-level rectified flow planner for efficient trajectories, claiming superior performance on furniture assembly and other long-horizon robotic benchmarks.
-
Physically Native World Models: A Hamiltonian Perspective on Generative World Modeling
The paper introduces Hamiltonian World Models by encoding observations into structured latent phase space and evolving states via Hamiltonian-inspired dynamics for physically meaningful rollouts in embodied AI.