super hub Canonical reference

World Models

David Ha · 2018 · cs.LG · arXiv 1803.10122

Canonical reference. 88% of citing Pith papers cite this work as background.

180 Pith papers citing it

Background 88% of classified citations

open full Pith review browse 180 citing papers more from David Ha arXiv PDF

abstract

We explore building generative neural network models of popular reinforcement learning environments. Our world model can be trained quickly in an unsupervised manner to learn a compressed spatial and temporal representation of the environment. By using features extracted from the world model as inputs to an agent, we can train a very compact and simple policy that can solve the required task. We can even train our agent entirely inside of its own hallucinated dream generated by its world model, and transfer this policy back into the actual environment. An interactive version of this paper is available at https://worldmodels.github.io/

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 36 method 3 other 1

citation-polarity summary

background 35 use method 3 unclear 2

claims ledger

abstract We explore building generative neural network models of popular reinforcement learning environments. Our world model can be trained quickly in an unsupervised manner to learn a compressed spatial and temporal representation of the environment. By using features extracted from the world model as inputs to an agent, we can train a very compact and simple policy that can solve the required task. We can even train our agent entirely inside of its own hallucinated dream generated by its world model, and transfer this policy back into the actual environment. An interactive version of this paper is

authors

David Ha J\"urgen Schmidhuber

co-cited works

representative citing papers

Textual Belief States for World Models: Identifiable Representation Learning Under Strict Mediation

cs.LG · 2026-06-26 · unverdicted · novelty 8.0

Introduces textual belief states and factorized GRPO to enforce strict latent state mediation in text-based world models, yielding preserved prediction accuracy with large gains in representation quality and rollout performance on TextWorld and ScienceWorld.

From Generalist to Specialist Representation

cs.LG · 2026-05-12 · unverdicted · novelty 8.0

Task structure is identifiable across time steps and task-relevant representations are identifiable within steps in a nonparametric setting under sparsity regularization.

EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.

A Model-Free Universal AI

cs.AI · 2026-02-26 · unverdicted · novelty 8.0

AIQI is the first model-free universal AI agent proven asymptotically ε-optimal in general RL by inducing over distributional Q-functions instead of policies or environments.

Pondering the Way: Spatial-perceiving World Action Model for Embodied Navigation

cs.RO · 2026-06-29 · unverdicted · novelty 7.0

SWAM jointly generates intermediate RGB-D sequences and action trajectories from monocular RGB start/goal observations for embodied navigation.

Distilling a Modular Reservoir Through a Genomic Bottleneck

cs.NE · 2026-06-20 · unverdicted · novelty 7.0

Hypernetworks distill modular reservoir connectivity via a genomic bottleneck to generate sparse recurrent networks solving difficult temporal tasks with minimal training and maintained robustness.

YoCausal: How Far is Video Generation from World Model? A Causality Perspective

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

YoCausal benchmark shows video diffusion models detect the arrow of time but lack genuine causal understanding relative to humans.

Benchmarking Single-Factor Physical Video-to-Audio Generation

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

FlatSounds benchmark shows state-of-the-art V2A models rely more on text captions than visual input for physical and semantic accuracy, with captions improving correctness but degrading temporal alignment.

Do Language Models Need Sleep? Offline Recurrence for Improved Online Inference

cs.CL · 2026-05-25 · unverdicted · novelty 7.0

A sleep mechanism with N offline recurrent passes consolidates context into fast weights, improving performance on reasoning tasks where standard transformers fail.

WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation

cs.CV · 2026-05-25 · unverdicted · novelty 7.0

WBench is a benchmark with 289 test cases and 1,058 turns for evaluating interactive world models using 22 automated metrics validated against human judgments.

UWM-JEPA: Predictive World Models That Imagine in Belief Space

cs.LG · 2026-05-25 · unverdicted · novelty 7.0

UWM-JEPA uses a density-matrix latent and unitary predictor in JEPA to preserve joint-state spectrum during blind rollouts, achieving 0.77 accuracy on a five-step hidden-velocity task versus 0.53 for an LSTM baseline.

Beyond Generative Priors: Minority Sampling with JEPA-Guided Diffusion

cs.LG · 2026-05-23 · unverdicted · novelty 7.0

JEPA guidance steers diffusion models toward low-density regions under an implicit density from a world model, producing minority samples with improved fidelity and semantic validity over generator-centric baselines.

SliceWorld: A Predictive and Controllable World-State Model for CT Report Generation

cs.CV · 2026-05-23 · unverdicted · novelty 7.0

SliceWorld introduces a world-state model for CT report generation that uses predictive and factor-aware objectives on axial slice sequences.

CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

CRONOS benchmark shows recent open-source video generators fail to preserve physical consistency under controlled changes to viewpoint, scene, object category, and appearance.

MemGym: a Long-Horizon Memory Environment for LLM Agents

cs.CL · 2026-05-20 · unverdicted · novelty 7.0

MemGym unifies agent gyms into a memory benchmark with isolated scoring across tool-use, research, coding, and computer-use regimes plus a lightweight reward model for tractable coding evaluation.

Demo-JEPA: Joint-Embedding Predictive Architecture for One-shot Cross-Embodiment Imitation

cs.RO · 2026-05-20 · unverdicted · novelty 7.0

Demo-JEPA enables one-shot cross-embodiment imitation by mapping visual demonstrations to shared latent future trajectories that serve as subgoals for the target agent's own forward dynamics planning.

Baba in Wonderland: Online Self-Supervised Dynamics Discovery for Executable World Models

cs.AI · 2026-05-16 · unverdicted · novelty 7.0

Alice uses preservation conflicts from failed candidate updates to create class-stratified hypotheses and guide exploration, improving executable world-model learning under prior misalignment.

Learning POMDP World Models from Observations with Language-Model Priors

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

Pinductor leverages language-model priors to learn POMDP world models from limited trajectories, matching privileged-access methods in performance and exceeding tabular baselines in sample efficiency.

JEDI: Joint Embedding Diffusion World Model for Online Model-Based Reinforcement Learning

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

JEDI is the first online end-to-end latent diffusion world model that trains latents from denoising loss rather than reconstruction, achieving competitive Atari100k results with 43% less VRAM and over 3x faster sampling than pixel diffusion baselines.

Runtime Monitoring of Perception-Based Autonomous Systems via Embedding Temporal Logic

cs.LG · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

Embedding Temporal Logic (ETL) performs runtime monitoring directly in learned embedding spaces using distance-based predicates composed with temporal operators, supported by conformal calibration for reliable predicate evaluation.

Support-Safe Variational Hybrid Filtering for Contact-Mode and Sparse-Law Recovery

cs.RO · 2026-05-12 · unverdicted · novelty 7.0

VHYDRO is a support-safe variational hybrid filter that jointly recovers continuous latent states, discrete contact modes, and sparse port-Hamiltonian laws per regime while preventing loss of feasible transitions.

The Gordian Knot for VLMs: Diagrammatic Knot Reasoning as a Hard Benchmark

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

KnotBench benchmark shows state-of-the-art VLMs perform near random on diagrammatic knot reasoning tasks and lack ability to simulate structural moves.

ACWM-Phys: Investigating Generalized Physical Interaction in Action-Conditioned Video World Models

cs.CV · 2026-05-09 · unverdicted · novelty 7.0 · 2 refs

ACWM-Phys is a controllable simulator benchmark with in- and out-of-distribution protocols for evaluating action-conditioned world models across rigid, kinematic, deformable, and particle dynamics.

SYNCR: A Cross-Video Reasoning Benchmark with Synthetic Grounding

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

SYNCR benchmark shows leading MLLMs reach only 52.5% average accuracy on cross-video reasoning tasks against an 89.5% human baseline, with major weaknesses in physical and spatial reasoning.

citing papers explorer

Showing 27 of 27 citing papers after filters.

Pondering the Way: Spatial-perceiving World Action Model for Embodied Navigation cs.RO · 2026-06-29 · unverdicted · none · ref 12 · internal anchor
SWAM jointly generates intermediate RGB-D sequences and action trajectories from monocular RGB start/goal observations for embodied navigation.
Demo-JEPA: Joint-Embedding Predictive Architecture for One-shot Cross-Embodiment Imitation cs.RO · 2026-05-20 · unverdicted · none · ref 45 · internal anchor
Demo-JEPA enables one-shot cross-embodiment imitation by mapping visual demonstrations to shared latent future trajectories that serve as subgoals for the target agent's own forward dynamics planning.
Support-Safe Variational Hybrid Filtering for Contact-Mode and Sparse-Law Recovery cs.RO · 2026-05-12 · unverdicted · none · ref 2 · internal anchor
VHYDRO is a support-safe variational hybrid filter that jointly recovers continuous latent states, discrete contact modes, and sparse port-Hamiltonian laws per regime while preventing loss of feasible transitions.
LLM-Guided Future Hypotheses for Horizon-Aware Exploration in Multi-Step Robot Manipulation cs.RO · 2026-05-28 · unverdicted · none · ref 6 · internal anchor
FEC conditions policies on LLM-guided short-horizon future videos via a three-stage pipeline, yielding performance gains for BC+RL over no-future baselines on RoboCasa and CALVIN while mismatched futures degrade results.
Network-Efficient World Model Token Streaming cs.RO · 2026-05-11 · unverdicted · none · ref 1 · internal anchor
An adaptive delta-prioritization algorithm using cosine distance and Hamming-drift thresholds improves embedding distortion by 4.8-7.2% and next-token perplexity by 2.1-6.3% over periodic keyframing at matched low bitrates for tokenized driving world models.
Latent Geometry Beyond Search: Amortizing Planning in World Models cs.RO · 2026-05-09 · unverdicted · none · ref 3 · 2 links · internal anchor
A Goal-Conditioned Inverse Dynamics Model amortizes planning in pretrained world model latents, matching or exceeding CEM in seven of eight settings at 100-130x lower per-decision cost.
RAY-TOLD: Ray-Based Latent Dynamics for Dense Dynamic Obstacle Avoidance with TDMPC cs.RO · 2026-04-30 · unverdicted · none · ref 15 · internal anchor
RAY-TOLD combines ray-based latent dynamics from LiDAR with MPPI control and a learned policy prior via mixture sampling to lower collision rates in high-density dynamic obstacle environments compared to standard MPPI.
Human Cognition in Machines: A Unified Perspective of World Models cs.RO · 2026-04-17 · unverdicted · none · ref 60 · internal anchor
The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and proposes Epistemic World Models as a new category for scientific discovery agents.
Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models cs.RO · 2026-04-14 · unverdicted · none · ref 17 · internal anchor
Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.
Veo-Act: How Far Can Frontier Video Models Advance Generalizable Robot Manipulation? cs.RO · 2026-04-06 · unverdicted · none · ref 18 · internal anchor
Veo-3 video predictions enable approximate task-level robot trajectories in zero-shot settings but require hierarchical integration with low-level VLA policies for reliable manipulation performance.
Behavior-Constrained Reinforcement Learning with Receding-Horizon Credit Assignment for High-Performance Control cs.RO · 2026-04-03 · unverdicted · none · ref 12 · internal anchor
A behavior-constrained RL framework with receding-horizon credit assignment learns high-performance control policies that stay aligned with expert behavior in race car simulation.
HAIC: Humanoid Agile Object Interaction Control via Dynamics-Aware World Model cs.RO · 2026-02-12 · unverdicted · none · ref 10 · internal anchor
HAIC enables robust humanoid interactions with underactuated objects by predicting their dynamics from proprioceptive history and using a world model for adaptive control.
Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation cs.RO · 2025-08-07 · unverdicted · none · ref 13 · internal anchor
Genie Envisioner unifies robotic policy learning, simulation, and evaluation inside one instruction-conditioned video diffusion framework using GE-Base, GE-Act, and GE-Sim.
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation cs.RO · 2024-10-08 · unverdicted · none · ref 64 · internal anchor
GR-2 pre-trains on web-scale videos then fine-tunes on robot data to reach 97.7% average success across over 100 manipulation tasks with strong generalization to new scenes and objects.
CKT-WAM: Parameter-Efficient Context Knowledge Transfer Between World Action Models cs.RO · 2026-05-07 · unverdicted · none · ref 1 · internal anchor
CKT-WAM transfers teacher WAM knowledge to students via compressed text-embedding contexts using LQCA and adapters, reaching 86.1% success on LIBERO-Plus with 1.17% trainable parameters and 83.3% in real-world tasks.
HDFlow: Hierarchical Diffusion-Flow Planning for Long-horizon Tasks cs.RO · 2026-05-06 · unverdicted · none · ref 2 · 2 links · internal anchor
HDFlow pairs a high-level diffusion planner for strategic subgoals with a low-level rectified flow planner for efficient trajectories, claiming superior performance on furniture assembly and other long-horizon robotic benchmarks.
Cortex 2.0: Grounding World Models in Real-World Industrial Deployment cs.RO · 2026-04-22 · unverdicted · none · ref 24 · internal anchor
Cortex 2.0 introduces world-model-based planning that generates and scores future trajectories to outperform reactive vision-language-action baselines on industrial robotic tasks including pick-and-place, sorting, and unpacking.
Bio-Inspired Topological Autonomous Navigation with Active Inference in Robotics cs.RO · 2025-08-10 · unverdicted · none · ref 25 · internal anchor
An active-inference agent builds real-time topological maps and plans adaptive trajectories for exploration and goal-reaching in robotics without pre-training.
WorldVLA: Towards Autoregressive Action World Model cs.RO · 2025-06-26 · unverdicted · none · ref 12 · internal anchor
WorldVLA unifies VLA and world models in one autoregressive system, shows they boost each other, and adds an attention mask to stop error buildup when generating action chunks.
EvolvingAgent: Curriculum Self-evolving Agent with Continual World Model for Long-Horizon Tasks cs.RO · 2025-02-09 · unverdicted · none · ref 5 · internal anchor
EvolvingAgent autonomously completes long-horizon tasks via a closed-loop planner-controller-reflector system with continual world model updates, reporting 111.74% higher success rates than baselines in Minecraft and human-level Atari performance.
DynoPlan: Combining Motion Planning and Deep Neural Network based Controllers for Safe HRL cs.RO · 2019-06-24 · unverdicted · none · ref 4 · internal anchor
DynoPlan adds dynamics models and a demonstration-derived heuristic to the options framework so that hierarchical RL can switch between motion planning and DNN controllers via short-horizon model-predictive evaluation.
Risk-Aware Selective Multimodal Driver Monitoring with Driver-State World Modeling cs.RO · 2026-06-25 · unverdicted · none · ref 11 · internal anchor
A cost-aware selective inference framework combines a lightweight multimodal student model and driver-state world modeling to reduce unsafe false negatives in driver monitoring while keeping low latency.
Can Predicted Dynamics Exist in the Physical World? cs.RO · 2026-05-23 · unverdicted · none · ref 1 · internal anchor
Physical admissibility is defined as a prediction-control interface using kinematic, dynamic, and composed-horizon conditions to reject invalid dynamics proposals, with AUC 0.957 on LeRobot PushT and 87-89% prevention of invalid actions in interventions.
Dyadic Partnership(DP): A Missing Link Towards Full Autonomy in Medical Robotics cs.RO · 2026-04-13 · unverdicted · none · ref 16 · internal anchor
The paper introduces Dyadic Partnership (DP) as an intermediate paradigm for robot-clinician collaboration that uses foundation models and multi-modal interfaces to enable safer gradual progress toward autonomous medical robotics.
Edge Case Detection in Automated Driving: Methods, Challenges and Future Directions cs.RO · 2024-10-11 · unverdicted · none · ref 133 · internal anchor
The paper delivers a two-level hierarchical classification of edge case detection methods in automated driving, covering AV modules and methodologies, plus evaluation metrics and open challenges.
3D Generation for Embodied AI and Robotic Simulation: A Survey cs.RO · 2026-04-29 · unverdicted · none · ref 64 · 3 links · internal anchor
The paper surveys 3D generation techniques for embodied AI and robotics, categorizing them into data generation, simulation environments, and sim-to-real bridging while identifying bottlenecks in physical validity and transfer.
Learning to Feel the Future: DreamTacVLA for Contact-Rich Manipulation cs.RO · 2025-12-29 · unreviewed · ref 9 · internal anchor

World Models

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer