Introduces textual belief states and factorized GRPO to enforce strict latent state mediation in text-based world models, yielding preserved prediction accuracy with large gains in representation quality and rollout performance on TextWorld and ScienceWorld.
super hub Canonical reference
World Models
Canonical reference. 88% of citing Pith papers cite this work as background.
abstract
We explore building generative neural network models of popular reinforcement learning environments. Our world model can be trained quickly in an unsupervised manner to learn a compressed spatial and temporal representation of the environment. By using features extracted from the world model as inputs to an agent, we can train a very compact and simple policy that can solve the required task. We can even train our agent entirely inside of its own hallucinated dream generated by its world model, and transfer this policy back into the actual environment. An interactive version of this paper is available at https://worldmodels.github.io/
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We explore building generative neural network models of popular reinforcement learning environments. Our world model can be trained quickly in an unsupervised manner to learn a compressed spatial and temporal representation of the environment. By using features extracted from the world model as inputs to an agent, we can train a very compact and simple policy that can solve the required task. We can even train our agent entirely inside of its own hallucinated dream generated by its world model, and transfer this policy back into the actual environment. An interactive version of this paper is
authors
co-cited works
representative citing papers
LeJEPA achieves linear identifiability of latent variables uniquely when the latents are Gaussian in worlds with stationary additive-noise transitions.
Task structure is identifiable across time steps and task-relevant representations are identifiable within steps in a nonparametric setting under sparsity regularization.
EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.
AIQI is the first model-free universal AI agent proven asymptotically ε-optimal in general RL by inducing over distributional Q-functions instead of policies or environments.
SWAM jointly generates intermediate RGB-D sequences and action trajectories from monocular RGB start/goal observations for embodied navigation.
GILP trains a parameterized backbone for valid actions and state predictions, then uses a consistency gate with LLM drafts to reduce hallucinated-state rate from 0.176 to 0.035 on GPT-4o-mini while raising success from 0.668 to 0.838.
MemoBench is a new diagnostic benchmark with automated and VQA metrics that evaluates memory consistency in video models under disappear-and-reappear in dynamic environments.
Equilibrium World Models are a deep-learning solver that enforces exact equilibrium conditions on broad model-generated state distributions to globally solve dynamic stochastic models featuring rare disasters, binding constraints, and counterfactual states.
Hypernetworks distill modular reservoir connectivity via a genomic bottleneck to generate sparse recurrent networks solving difficult temporal tasks with minimal training and maintained robustness.
VLWMs learn variable-length action-conditioned dynamics in latent space with curriculum training, yielding 13% average gains over prior latent world models on long-horizon tasks.
PreAct compiles successful agent executions into verifiable state-machine programs for 8.5-13x faster replay on repeated tasks, with an independent evaluator check before storing each program.
VISA improves closed-set 3D occupancy mIoU on nuScenes by using VLM instance audits as reliability-weighted semantic supervisors during training of existing world models.
Self-distillation from a caption-conditioned video diffusion model to an image-and-prompt-conditioned executor, enhanced by RL from VLM feedback, enables task solving in world models.
Dream.exe evaluates 8 video generation models on 101 manipulation tasks by converting generated videos into executable robot trajectories in a simulator, finding measurable success rates that visual metrics do not predict.
SVI-Bench provides 35K hours of sports video with 9 tasks across four cognitive levels, revealing models drop from ~74% on action QA to 5% on agentic evidence integration.
YoCausal benchmark shows video diffusion models detect the arrow of time but lack genuine causal understanding relative to humans.
FlatSounds benchmark shows state-of-the-art V2A models rely more on text captions than visual input for physical and semantic accuracy, with captions improving correctness but degrading temporal alignment.
A sleep mechanism with N offline recurrent passes consolidates context into fast weights, improving performance on reasoning tasks where standard transformers fail.
WBench is a benchmark with 289 test cases and 1,058 turns for evaluating interactive world models using 22 automated metrics validated against human judgments.
UWM-JEPA uses a density-matrix latent and unitary predictor in JEPA to preserve joint-state spectrum during blind rollouts, achieving 0.77 accuracy on a five-step hidden-velocity task versus 0.53 for an LSTM baseline.
JEPA guidance steers diffusion models toward low-density regions under an implicit density from a world model, producing minority samples with improved fidelity and semantic validity over generator-centric baselines.
SliceWorld introduces a world-state model for CT report generation that uses predictive and factor-aware objectives on axial slice sequences.
CRONOS benchmark shows recent open-source video generators fail to preserve physical consistency under controlled changes to viewpoint, scene, object category, and appearance.
citing papers explorer
-
Physically Native World Models: A Hamiltonian Perspective on Generative World Modeling
The paper introduces Hamiltonian World Models by encoding observations into structured latent phase space and evolving states via Hamiltonian-inspired dynamics for physically meaningful rollouts in embodied AI.
-
SPLICE: Latent Diffusion over JEPA Embeddings for Conformal Time-Series Inpainting
SPLICE couples JEPA-based latent diffusion with adaptive conformal inference to deliver accurate time-series inpainting with 93-95% empirical coverage on load datasets.
-
Toward a Science of Intent: Closure Gaps and Delegation Envelopes for Open-World AI Agents
Intent compilation turns vague human goals into verifiable artifacts, using closure-gap vectors and delegation envelopes to separate open-world agent challenges from closed-world solvers and to benchmark closure fixes against extra search.
-
Cortex 2.0: Grounding World Models in Real-World Industrial Deployment
Cortex 2.0 introduces world-model-based planning that generates and scores future trajectories to outperform reactive vision-language-action baselines on industrial robotic tasks including pick-and-place, sorting, and unpacking.
-
CausalVAE as a Plug-in for World Models: Towards Reliable Counterfactual Dynamics
CausalVAE plug-in for world models preserves factual prediction and boosts counterfactual retrieval, with large gains on physics benchmarks and recovered physical interaction trends.
-
Neural Computers
Neural Computers are introduced as a new machine form where computation, memory, and I/O are unified in a learned runtime state, with initial video-model experiments showing acquisition of basic interface primitives from traces.
-
Designing Digital Humans with Ambient Intelligence
Integrating ambient intelligence with digital humans creates context-aware virtual agents capable of anticipatory assistance based on the user's surroundings.
-
A Model of Understanding in Deep Learning Systems
Deep learning systems achieve systematic understanding through internal models tracking regularities but exhibit fractured understanding due to symbolic misalignment, lack of explicit reduction, and weak unification.
-
UI-Oceanus: Scaling GUI Agents with Synthetic Environmental Dynamics
UI-Oceanus shows that continual pre-training on forward dynamics predictions from synthetic GUI exploration improves agent success rates by 7% offline and 16.8% online, with gains scaling by data volume.
-
Cloning Deterministic Worlds: The Critical Role of Latent Geometry in Long-Horizon World Models
GRWM uses temporal contrastive learning to geometrically regularize latent spaces in world models for high-fidelity cloning of deterministic 3D worlds.
-
Bio-Inspired Topological Autonomous Navigation with Active Inference in Robotics
An active-inference agent builds real-time topological maps and plans adaptive trajectories for exploration and goal-reaching in robotics without pre-training.
-
Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event Plausibility
Language models encode modal categories via linear difference vectors in their activations that predict fine-grained human plausibility judgments better than prior reports suggested.
-
WorldVLA: Towards Autoregressive Action World Model
WorldVLA unifies VLA and world models in one autoregressive system, shows they boost each other, and adds an attention mask to stop error buildup when generating action chunks.
-
EvolvingAgent: Curriculum Self-evolving Agent with Continual World Model for Long-Horizon Tasks
EvolvingAgent autonomously completes long-horizon tasks via a closed-loop planner-controller-reflector system with continual world model updates, reporting 111.74% higher success rates than baselines in Minecraft and human-level Atari performance.
-
The Platonic Representation Hypothesis
Representations learned by large AI models are converging toward a shared statistical model of reality.
-
World Model on Million-Length Video And Language With Blockwise RingAttention
Presents open-source 7B models for million-token video and language understanding via Blockwise RingAttention, setting new benchmarks in retrieval and long video tasks.
-
Supervise Thyself: Examining Self-Supervised Representations in Interactive Environments
Empirical comparison finds that self-supervised representations vary in capturing agent state and generalizing to new levels or textures depending on environment visuals and dynamics.
-
DynoPlan: Combining Motion Planning and Deep Neural Network based Controllers for Safe HRL
DynoPlan adds dynamics models and a demonstration-derived heuristic to the options framework so that hierarchical RL can switch between motion planning and DNN controllers via short-horizon model-predictive evaluation.
-
Shaping Belief States with Generative Environment Models for RL
Multi-step predictive generative models form stable belief states capturing environment layout and agent pose, yielding higher data efficiency on RL tasks than model-free agents.
-
Bridge-WA: Predicting Where and How the World Changes for Robotic Action
Bridge-WA introduces a lightweight distillation-based world-action model that uses future-change priors to improve robotic task success and robustness without deployment-time dense rollouts.
-
Evolving Intelligent Complex Systems via Intellicise Networks: Architecture, Technologies, and Pathways
Proposes a cross-layer intellicise network architecture grounded in multiple theories to support intelligent complex systems, with reviews of enabling technologies and a case study.
-
Understanding Rollout Error in Graph World Models
Develops graph rollout bounds separating topology and model error sources and proposes Error-Aware GWM with spectral regularization and consistency terms for dynamic graphs.
-
Risk-Aware Selective Multimodal Driver Monitoring with Driver-State World Modeling
A cost-aware selective inference framework combines a lightweight multimodal student model and driver-state world modeling to reduce unsafe false negatives in driver monitoring while keeping low latency.
-
A Compositional Framework for Open-ended Intelligence
Open-ended intelligence is formalized as the compositional closure L(P,C) of primitives P under operators C, with next primitive prediction proposed as an objective to acquire reusable primitives and grammar for lifelong adaptation.
-
Detecting Explanatory Insufficiency in Learned Representations: A Framework for Representational Vigilance
Proposes the VER framework as a diagnostic sequence for identifying explanatory insufficiency in learned representations, distinguishing it from standard errors and shifts.
-
EWAM: An Enhanced World Action Model for Closed-Loop Online Adaptation in Embodied Intelligence
EWAM adds four integrated neural modules for inference-time co-reasoning and anomaly handling in a frozen Cosmos3 backbone to reduce deployment data needs under zero-shot protocols.
-
InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning
InternVideo3 introduces Multimodal Contextual Reasoning and M^2LA attention to enable closed-loop evidence accumulation in long-video understanding and agentic tool use, reporting strong benchmark results.
-
Bootstrap Theory of Representational Emergence: Explanatory Insufficiency as a Driver of Representation Learning and World Models
TBER describes representational emergence as a five-stage bootstrap process triggered by explanatory insufficiency in AI, biology, and science.
-
Towards World Models in Biomedical Research
Proposes biomedical world models that learn latent states and intervention-conditioned dynamics to enable simulation of future biological trajectories for discovery in virtual cells, organoids, patients, and surgery.
-
Behavior-Invariant Task Representation Learning with Transformer-based World Models for Offline Meta-Reinforcement Learning
The work introduces behavior-invariant latent task representations via information-theoretic learning in a Transformer world model plus conservative penalties on imagined rollouts to improve generalization in offline meta-RL.
-
Toward AI That Understands Self and Others: A World-Model Theory of Cognitive Diversity and Alignment
The paper introduces the Multi-Phase Inference Assumption and Mechanism to frame cognitive diversity as arising from constrained construction of sufficient statistics and defines alignment as processability between heterogeneous world models via alignment maps and transformation loss.
-
Affective Music Recommendation: A Rollout-Based World Model for Offline Preference Optimization
AMRS deploys a rollout-based causal transformer world model for offline DPO-based affective music recommendation under cold-start conditions on health platforms.
-
Can Predicted Dynamics Exist in the Physical World?
Physical admissibility is defined as a prediction-control interface using kinematic, dynamic, and composed-horizon conditions to reject invalid dynamics proposals, with AUC 0.957 on LeRobot PushT and 87-89% prevention of invalid actions in interventions.
-
LASAR: Towards Spatio-temporal Reasoning with Latent Cognitive Map
LASAR pairs a dual-memory system with spatio-temporal contrastive learning to induce latent cognitive maps, reporting 2-3.5% zero-shot gains on VLN-CE and VSI-Bench plus high map self-consistency.
-
EfficientTDMPC: Improved MPC Objectives for Sample-Efficient Continuous Control
EfficientTDMPC extends the TD-MPC family with model ensembles, return averaging, and uncertainty penalties to reach SOTA sample efficiency on hard continuous control benchmarks in low-data regimes.
-
Agentifying Patient Dynamics within LLMs through Interacting with Clinical World Model
SepsisAgent is a world-model-augmented LLM agent trained via supervised fine-tuning, behavior cloning, and agentic RL that outperforms RL and LLM baselines on MIMIC-IV sepsis trajectories in off-policy value and safety metrics.
-
Why We Need World Models for AGI: Where LLMs Fail and How World Models May Outperform
In the Flux environment, RL agents with explicit latent state access achieve ~79% win rate versus ~11% for LLMs on long-horizon tasks, illustrating limitations of sequence prediction for dynamic reasoning.
-
Position: agentic AI orchestration should be Bayes-consistent
Agentic AI orchestration should apply Bayesian principles for belief maintenance, updating from interactions, and utility-based action selection.
-
A Co-Evolutionary Theory of Human-AI Coexistence: Mutualism, Governance, and Dynamics in Complex Societies
Human-AI coexistence is best modeled as conditional mutualism under governance, formalized as a multiplex dynamical system whose simulations show stable high-coexistence equilibria only under balanced institutional oversight.
-
The Global Neural World Model: Spatially Grounded Discrete Topologies for Action-Conditioned Planning
GNWM maps environments to a discrete 2D grid with snapping to stabilize autoregressive planning and learns generalized dynamics from maximum-entropy random walks.
-
Dyadic Partnership(DP): A Missing Link Towards Full Autonomy in Medical Robotics
The paper introduces Dyadic Partnership (DP) as an intermediate paradigm for robot-clinician collaboration that uses foundation models and multi-modal interfaces to enable safer gradual progress toward autonomous medical robotics.
-
Event-Centric World Modeling with Memory-Augmented Retrieval for Embodied Decision-Making
An event-centric framework encodes environments as semantic events and retrieves weighted prior maneuvers from a knowledge bank to enable interpretable, physics-aware decision-making for UAVs.
-
Advancing Open-source World Models
LingBot-World is presented as an open-source world model that delivers high-fidelity simulation, minute-level contextual consistency, and real-time interactivity under one second latency.
-
World Simulation with Video Foundation Models for Physical AI
Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.
-
Edge Case Detection in Automated Driving: Methods, Challenges and Future Directions
The paper delivers a two-level hierarchical classification of edge case detection methods in automated driving, covering AV modules and methodologies, plus evaluation metrics and open challenges.
-
Convolutional Reservoir Computing for World Models
RCRC uses untrained random CNNs and reservoir computing plus evolution strategies to reach claimed state-of-the-art scores in reinforcement learning tasks while avoiding data storage and heavy training.
-
Situation Perception: A Necessary Primitive to Artificial Superintelligence
Situation perception is proposed as a necessary primitive for artificial superintelligence, requiring abstract prediction, long-term compressed memory, and objective-guided active learning.
-
DreamForge-World 0.1 Preview: A Low-Compute Real-Time Controllable World Model
A preview system demonstrates real-time controllable world modeling at 14-15 FPS on RTX 4090 by adapting open video backbones with action pathways for keyboard/mouse control and multimodal features.
-
Building a Scalable, Reproducible, Evaluatable, and Closed-Loop Simulation Environment Foundation for Embodied Intelligence
Presents a four-layer cloud-native framework for scalable, reproducible simulation-based training and evaluation in embodied AI.
-
World Models: A Comprehensive Survey of Architectures, Methodologies, Reasoning Paradigms, and Applications
The paper delivers a multi-axis taxonomy for world models that maps architectures, training families, reasoning strategies, and domains from early cognitive foundations through systems such as Dreamer, MuZero, and Sora while noting evaluation gaps.