AIQI is the first model-free universal AI agent proven asymptotically ε-optimal in general RL by inducing over distributional Q-functions instead of policies or environments.
hub Canonical reference
Dream to Control: Learning Behaviors by Latent Imagination
Canonical reference. 95% of citing Pith papers cite this work as background.
abstract
Learned world models summarize an agent's experience to facilitate learning complex behaviors. While learning world models from high-dimensional sensory inputs is becoming feasible through deep learning, there are many potential ways for deriving behaviors from them. We present Dreamer, a reinforcement learning agent that solves long-horizon tasks from images purely by latent imagination. We efficiently learn behaviors by propagating analytic gradients of learned state values back through trajectories imagined in the compact state space of a learned world model. On 20 challenging visual control tasks, Dreamer exceeds existing approaches in data-efficiency, computation time, and final performance.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Learned world models summarize an agent's experience to facilitate learning complex behaviors. While learning world models from high-dimensional sensory inputs is becoming feasible through deep learning, there are many potential ways for deriving behaviors from them. We present Dreamer, a reinforcement learning agent that solves long-horizon tasks from images purely by latent imagination. We efficiently learn behaviors by propagating analytic gradients of learned state values back through trajectories imagined in the compact state space of a learned world model. On 20 challenging visual contro
- background Langugae-Conditoned MoCoGAN [29], U-Net [30], Latte [ 31], Wan [32], Sora 2 [ 33]. . . Embodied World Model SWIM [34], DreamDojo [ 35], RoboDreamer [36], RoboScape [37]. . . WM for VLA Imitation Learning Ctrl-World [38], RoboScape [37], DREMA [ 39] Reinforcement Learning Dreamer to Control [ 40] DreamerV2 [ 41], Dreamer 4 [ 42], RISE [ 43] DreamerV3 [44], DayDreamer [45], World-Env [46], RoboScape-R [47] WMPO [48], WoVR [49], VLA-RFT [50], RWML [51], MoDem-V2 [52] World-Gymnast [53], RWM-U [54],
co-cited works
representative citing papers
Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.
JEDI is the first online end-to-end latent diffusion world model that trains latents from denoising loss rather than reconstruction, achieving competitive Atari100k results with 43% less VRAM and over 3x faster sampling than pixel diffusion baselines.
VHYDRO is a support-safe variational hybrid filter that jointly recovers continuous latent states, discrete contact modes, and sparse port-Hamiltonian laws per regime while preventing loss of feasible transitions.
Reducing visual input to one token per frame in VLA world models maintains or improves long-horizon performance on MetaWorld, LIBERO, and real-robot tasks.
VPSD-RL discovers exact and approximate value-preserving Lie-group operators in continuous RL to stabilize learning via transition augmentation and consistency regularization.
World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.
RopeDreamer uses quaternionic kinematic chains in a recurrent state space model with a dual decoder to cut open-loop prediction error by 40.52% over 50 steps on simulated DLO trajectories while preserving physical constraints.
Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization and texture robustness.
MobiWM is a multimodal world model for mobile networks that learns state-action dynamics to enable unlimited-horizon counterfactual traffic simulations and optimization.
MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply causality in video generation.
MoGaF groups Gaussians by motion in 4D splatting representations to enable stable long-term forecasting of dynamic scenes.
Dreamer 4 is the first agent to obtain diamonds in Minecraft from only offline data by reinforcement learning inside a scalable world model that accurately predicts game mechanics.
A diffusion model trained on DOOM play sessions generates stable real-time interactive game frames at 20 FPS with quality near lossy JPEG.
Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.
SuSIE uses a finetuned InstructPix2Pix diffusion model to propose subgoal images that guide a low-level goal-conditioned policy, achieving SOTA zero-shot performance on CALVIN and real-world manipulation.
UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.
DreamerV3 uses world models and robustness techniques to solve over 150 tasks across domains with a single configuration, including Minecraft diamond collection from scratch.
DreamerV2 reaches human-level performance on 55 Atari games by learning behaviors inside a separately trained discrete-latent world model.
SR²AM achieves competitive Pass@1 accuracy on diverse tasks with 25.8-95.3% fewer reasoning tokens than much larger models by using self-regulated simulative planning trained via supervised learning and RL.
Spectral Souping learns offline specialized policies for fine-grained preferences and merges them online using a discovered universal spectral representation for efficient LLM alignment.
DiLA uses content-structure disentanglement driven by predictive bottlenecks to create semantically structured latent actions for high-fidelity video world models.
Feedback world model closes the prediction-observation loop at inference time to correct errors and improve diffusion policy performance under distribution shift in robotics.
Latent prediction video models exhibit a distinct robustness profile across corruption, occlusion, fine-grained discrimination, and temporal sensitivity compared to other self-supervised video models when used as world models.
citing papers explorer
-
World Action Models are Zero-shot Policies
DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment transfer with 10-30 minutes of data.