Disentangled World Models: Learning to Transfer Semantic Knowledge from Distracting Videos for Reinforcement Learning

Baao Xie; Liaomo Zheng; Qi Wang; Shiyu Wang; Wenjun Zeng; Xiaokang Yang; Xin Jin; Yunbo Wang; Zhipeng Zhang

arxiv: 2503.08751 · v3 · submitted 2025-03-11 · 💻 cs.CV · cs.LG

Disentangled World Models: Learning to Transfer Semantic Knowledge from Distracting Videos for Reinforcement Learning

Qi Wang , Zhipeng Zhang , Baao Xie , Xin Jin , Yunbo Wang , Shiyu Wang , Liaomo Zheng , Xiaokang Yang

show 1 more author

Wenjun Zeng

This is my paper

Pith reviewed 2026-05-23 00:18 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords disentangled representationsworld modelsreinforcement learningvideo predictionlatent distillationsemantic knowledge transfervisual variationsmodel-based RL

0 comments

The pith

Disentangled world models transfer semantic knowledge from distracting videos to reinforcement learning via latent distillation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that visual RL agents facing environment variations can achieve higher sample efficiency by extracting and transferring semantic knowledge from unlabeled distracting videos rather than learning representations from scratch. It pretrains an action-free video prediction model offline with disentanglement regularization, then distills the resulting capability into the world model. During online adaptation the model receives an additional disentanglement constraint while actions and rewards from environment interactions increase data diversity and strengthen the representations.

Core claim

We introduce Disentangled World Models that pretrain an action-free video prediction model offline with disentanglement regularization to extract semantic knowledge from distracting videos. The disentanglement capability of the pretrained model is transferred to the world model through latent distillation. For online finetuning we add a disentanglement constraint to the world model; the incorporation of actions and rewards from environment interactions enriches data diversity and strengthens the learned representations.

What carries the argument

Offline-to-online latent distillation that transfers disentanglement capability from a pretrained action-free video prediction model to the RL world model.

If this is right

RL agents reach target performance with fewer environment interactions on visual benchmarks that contain variations.
Semantic factors are separated in the world model without requiring task-specific labels during the offline pretraining stage.
Online adaptation gains from richer data that includes both visual observations and the actions plus rewards collected during interaction.
The world model inherits interpretable disentanglement properties that persist through the transfer and finetuning stages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pretraining-plus-distillation pattern could be applied in any domain where large amounts of unlabeled video exist but direct RL interaction data remain expensive.
If the distracting videos contain semantics that conflict with the downstream task, the latent distillation step may degrade rather than improve the world model.
Extending the approach to include multiple distinct video sources during pretraining might further increase the robustness of the transferred representations.

Load-bearing premise

Semantic variations present in the chosen distracting videos are relevant to the target RL tasks and can be disentangled in a form that transfers usefully without negative transfer or loss of task-critical information.

What would settle it

Run the full pipeline on a benchmark where the distracting videos contain only variations unrelated to the RL task and measure whether sample efficiency and final performance remain at or above the level of a standard world model without the offline pretraining step.

Figures

Figures reproduced from arXiv: 2503.08751 by Baao Xie, Liaomo Zheng, Qi Wang, Shiyu Wang, Wenjun Zeng, Xiaokang Yang, Xin Jin, Yunbo Wang, Zhipeng Zhang.

**Figure 2.** Figure 2: Architecture of Disentangled World Models. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Example image observations of our modified DMC and MuJoCo [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of DisWM against visual RL baselines, including [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Visualization of traversals of β-VAE during the pretraining phase. Object Background color Robot arm color [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Fine-grained disentanglement results on MuJoCo [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: These figures illustrate the ablation studies and sensitivity analyses of DisWM on DMC [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Performance of DisWM on DMC Cartpole Swingup with different video datasets. 5. Related Work Visual MBRL. Visual RL learns control policies from raw pixels, which has achieved remarkable performance in various tasks [3, 4, 31, 37], while prior RL studies focus on learning policies from low-dimensional states. Extant approaches can be divided into two main directions: model-free RL [18, 19, 24, 32, 42, 44,… view at source ↗

read the original abstract

Training visual reinforcement learning (RL) in practical scenarios presents a significant challenge, $\textit{i.e.,}$ RL agents suffer from low sample efficiency in environments with variations. While various approaches have attempted to alleviate this issue by disentangled representation learning, these methods usually start learning from scratch without prior knowledge of the world. This paper, in contrast, tries to learn and understand underlying semantic variations from distracting videos via offline-to-online latent distillation and flexible disentanglement constraints. To enable effective cross-domain semantic knowledge transfer, we introduce an interpretable model-based RL framework, dubbed Disentangled World Models (DisWM). Specifically, we pretrain the action-free video prediction model offline with disentanglement regularization to extract semantic knowledge from distracting videos. The disentanglement capability of the pretrained model is then transferred to the world model through latent distillation. For finetuning in the online environment, we exploit the knowledge from the pretrained model and introduce a disentanglement constraint to the world model. During the adaptation phase, the incorporation of actions and rewards from online environment interactions enriches the diversity of the data, which in turn strengthens the disentangled representation learning. Experimental results validate the superiority of our approach on various benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sketches an offline video pretraining plus latent distillation pipeline for disentangled world models in visual RL, but the abstract contains no results, ablations, or numbers to check whether it works.

read the letter

The central claim is that pretraining an action-free video model on distracting videos with disentanglement regularization, then distilling the latents into an action-conditioned world model and keeping the disentanglement constraint during online RL, lets the agent pick up useful semantic factors and improves sample efficiency. This specific offline-to-online distillation step with continued constraints during adaptation is the concrete new piece; it sits on top of prior disentangled representation work and model-based RL rather than replacing them. The framing of the problem is clear and the pipeline is described in enough detail to understand what they are trying to do. Credit for trying to make external video data useful instead of starting every RL run from scratch. The soft spots are right where the stress-test note points. The abstract asserts benchmark superiority and says actions and rewards strengthen the disentanglement, yet supplies zero quantitative results, no ablation tables, no baseline numbers, and no implementation details. Without those, there is no way to tell whether the video-derived factors actually align with task-critical variations or whether the distillation step adds anything beyond what online data alone would give. The assumption that the chosen distracting videos contain relevant, transferable semantics without negative transfer is stated but not evidenced here. The approach uses standard techniques for disentanglement and world models, so the math looks ordinary rather than circular. This is aimed at people working on visual model-based RL who already follow disentangled representations. A reader could extract the framework idea, but the lack of any supporting data means it is not yet ready for serious use or citation. I would not bring it to a reading group on the current abstract alone. I would not cite it. It does not yet deserve peer review; the full manuscript would need reproducible experiments and controls before an editor should send it out.

Referee Report

2 major / 0 minor

Summary. The paper introduces Disentangled World Models (DisWM), a model-based RL framework that pretrains an action-free video prediction model offline on distracting videos using disentanglement regularization to extract semantic knowledge. This pretrained capability is transferred to the world model via latent distillation; during online finetuning a disentanglement constraint is added, with actions and rewards from environment interactions claimed to enrich data diversity and strengthen disentangled representations. The central claim is that this offline-to-online pipeline yields superior performance on various visual RL benchmarks in environments with variations.

Significance. If the empirical claims hold, the work could provide a practical route for transferring semantic factors from offline video data into RL world models, potentially improving sample efficiency and robustness without learning representations entirely from scratch. The approach relies on standard techniques in disentangled representation learning and latent distillation rather than novel theoretical derivations or parameter-free results.

major comments (2)

[Abstract] Abstract: The manuscript asserts that 'Experimental results validate the superiority of our approach on various benchmarks' yet supplies no quantitative results, ablation studies, baseline comparisons, or implementation details. This absence prevents verification that the proposed pipeline (offline pretraining + latent distillation + online disentanglement constraint) is responsible for any measured gains rather than online data alone.
[Abstract] Abstract: The central claim requires that semantic variations in the chosen distracting videos align with task-critical factors in the target RL environments and survive latent distillation without negative transfer. The text states that 'the incorporation of actions and rewards … strengthens the disentangled representation learning' but provides no mechanism, analysis, or ablation demonstrating that video-derived factors are the relevant ones or that distillation (versus online interactions) drives the improvement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We agree that the abstract requires strengthening to better substantiate the claims with quantitative highlights and clarifications. We will revise the abstract in the next version to address both points while preserving its concise nature, and we point to the detailed experiments, ablations, and analyses already present in the full manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The manuscript asserts that 'Experimental results validate the superiority of our approach on various benchmarks' yet supplies no quantitative results, ablation studies, baseline comparisons, or implementation details. This absence prevents verification that the proposed pipeline (offline pretraining + latent distillation + online disentanglement constraint) is responsible for any measured gains rather than online data alone.

Authors: We acknowledge that the current abstract does not embed specific numbers or direct references to ablations. In the revision we will add concise quantitative highlights (e.g., relative sample-efficiency gains on the reported visual RL benchmarks) and explicit mentions of the ablation studies and baseline comparisons that appear in Sections 4 and 5. These additions will make clear that the measured improvements arise from the full offline-to-online pipeline rather than online data alone. revision: yes
Referee: [Abstract] Abstract: The central claim requires that semantic variations in the chosen distracting videos align with task-critical factors in the target RL environments and survive latent distillation without negative transfer. The text states that 'the incorporation of actions and rewards … strengthens the disentangled representation learning' but provides no mechanism, analysis, or ablation demonstrating that video-derived factors are the relevant ones or that distillation (versus online interactions) drives the improvement.

Authors: The manuscript already contains the requested mechanism (latent distillation of the pretrained disentangled factors followed by an online disentanglement constraint) and supporting ablations (Section 5.3) that isolate the contribution of actions/rewards and show that distillation improves over purely online training without negative transfer. To satisfy the abstract-level concern we will insert a short clause referencing these results and noting that the chosen video variations were selected to match task-critical factors. This keeps the abstract concise while directing readers to the evidence. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline with independent experimental validation

full rationale

The paper presents a standard offline-to-online transfer pipeline: pretrain an action-free video model with disentanglement regularization on distracting videos, distill latents to a world model, then finetune online with added actions/rewards and a disentanglement constraint. No derivation, equation, or claim reduces a reported prediction or result to a fitted quantity by construction, nor invokes a self-citation chain or uniqueness theorem to force the architecture. The central claims rest on benchmark comparisons rather than self-referential definitions, making the work self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Abstract provides insufficient detail to enumerate concrete free parameters or background axioms; the central claim rests on the domain assumption that disentanglement regularization extracts transferable semantics and that latent distillation preserves them.

axioms (2)

domain assumption Disentanglement regularization applied to action-free video prediction extracts semantic factors relevant to downstream RL tasks
Invoked to justify offline pretraining step
domain assumption Latent distillation transfers disentanglement capability without substantial information loss
Central to the knowledge-transfer mechanism

invented entities (1)

Disentangled World Models (DisWM) no independent evidence
purpose: End-to-end framework combining offline video pretraining, latent distillation, and online disentanglement constraint
New named model introduced to realize the described pipeline

pith-pipeline@v0.9.0 · 5772 in / 1353 out tokens · 85232 ms · 2026-05-23T00:18:53.078035+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Mask World Model: Predicting What Matters for Robust Robot Policy Learning
cs.RO 2026-04 unverdicted novelty 7.0

Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization...

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

Diffusion for world modeling: Visual details matter in atari

Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kan- ervisto, Amos Storkey, Tim Pearce, and Franc ¸ois Fleuret. Diffusion for world modeling: Visual details matter in atari. In NeurIPS, 2024. 8

work page 2024
[2]

Repre- sentation learning: A review and new perspectives

Yoshua Bengio, Aaron Courville, and Pascal Vincent. Repre- sentation learning: A review and new perspectives. TPAMI, 35(8):1798–1828, 2013. 2

work page 2013
[3]

Environment agnostic representation for visual rein- forcement learning

Hyesong Choi, Hunsang Lee, Seongwon Jeong, and Dongbo Min. Environment agnostic representation for visual rein- forcement learning. In ICCV, pages 263–273, 2023. 8

work page 2023
[4]

Local-guided global: Paired similarity representation for visual reinforcement learning

Hyesong Choi, Hunsang Lee, Wonil Song, Sangryul Jeon, Kwanghoon Sohn, and Dongbo Min. Local-guided global: Paired similarity representation for visual reinforcement learning. In CVPR, pages 15072–15082, 2023. 8

work page 2023
[5]

Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)

Djork-Arn ´e Clevert, Thomas Unterthiner, and Sepp Hochre- iter. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015. 1

work page internal anchor Pith review Pith/arXiv arXiv 2015
[6]

Conditional mutual in- formation for disentangled representations in reinforcement learning

Mhairi Dunion, Trevor McInroe, Kevin Sebastian Luck, Josiah Hanna, and Stefano Albrecht. Conditional mutual in- formation for disentangled representations in reinforcement learning. In NeurIPS, 2023. 1, 2

work page 2023
[7]

Temporal disen- tanglement of representations for improved generalisation in reinforcement learning

Mhairi Dunion, Trevor McInroe, Kevin Sebastian Luck, Josiah P Hanna, and Stefano V Albrecht. Temporal disen- tanglement of representations for improved generalisation in reinforcement learning. In ICLR, 2023. 1, 2, 5, 6

work page 2023
[8]

Learning latent dynamics for planning from pixels

Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Ville- gas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. In ICML, 2019. 8

work page 2019
[9]

Dream to control: Learning behaviors by la- tent imagination

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Moham- mad Norouzi. Dream to control: Learning behaviors by la- tent imagination. In ICLR, 2020

work page 2020
[10]

Mastering atari with discrete world models

Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. In ICLR, 2021. 4, 6, 8, 1

work page 2021
[11]

Deep hierarchical planning from pixels

Danijar Hafner, Kuang-Huei Lee, Ian Fischer, and Pieter Abbeel. Deep hierarchical planning from pixels. arXiv preprint arXiv:2206.04114, 2022. 1

work page arXiv 2022
[12]

Mastering diverse domains through world models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. Nature, 2025. 8, 1

work page 2025
[13]

Td-mpc2: Scalable, robust world models for continuous control

Nicklas Hansen, Hao Su, and Xiaolong Wang. Td-mpc2: Scalable, robust world models for continuous control. In ICLR, 2024. 1

work page 2024
[14]

beta-vae: Learning basic visual con- cepts with a constrained variational framework

Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual con- cepts with a constrained variational framework. In ICLR,

work page
[15]

Darla: Improv- ing zero-shot transfer in reinforcement learning

Irina Higgins, Arka Pal, Andrei Rusu, Loic Matthey, Christopher Burgess, Alexander Pritzel, Matthew Botvinick, Charles Blundell, and Alexander Lerchner. Darla: Improv- ing zero-shot transfer in reinforcement learning. In ICML,

work page
[16]

Leveraging separated world model for exploration in visually distracted environments

Kaichen Huang, Shenghua Wan, Minghao Shao, Hai-Hang Sun, Le Gan, Shuai Feng, and De-Chuan Zhan. Leveraging separated world model for exploration in visually distracted environments. NeurIPS, 2024. 8

work page 2024
[17]

Reinforcement learning with augmented data

Misha Laskin, Kimin Lee, Adam Stooke, Lerrel Pinto, Pieter Abbeel, and Aravind Srinivas. Reinforcement learning with augmented data. In NeurIPS, 2020. 5

work page 2020
[18]

Curl: Contrastive unsupervised representations for reinforcement learning

Michael Laskin, Aravind Srinivas, and Pieter Abbeel. Curl: Contrastive unsupervised representations for reinforcement learning. In ICML, pages 5639–5650, 2020. 1, 6, 8

work page 2020
[19]

Generalizing consistency policy to visual rl with prior- itized proximal experience regularization

Haoran Li, Zhennan Jiang, YUHUI CHEN, and Dongbin Zhao. Generalizing consistency policy to visual rl with prior- itized proximal experience regularization. In NeurIPS, 2024. 8

work page 2024
[20]

Open-world reinforcement learning over long short-term imagination

Jiajian Li, Qi Wang, Yunbo Wang, Xin Jin, Yang Li, Wen- jun Zeng, and Xiaokang Yang. Open-world reinforcement learning over long short-term imagination. In ICLR, 2025. 1, 8

work page 2025
[21]

Learning to model the world with language

Jessy Lin, Yuqing Du, Olivia Watkins, Danijar Hafner, Pieter Abbeel, Dan Klein, and Anca Dragan. Learning to model the world with language. In ICML, 2024. 8

work page 2024
[22]

Struc- tured state space models for in-context reinforcement learn- ing

Chris Lu, Yannick Schroecker, Albert Gu, Emilio Parisotto, Jakob Foerster, Satinder Singh, and Feryal Behbahani. Struc- tured state space models for in-context reinforcement learn- ing. NeurIPS, 2023. 8

work page 2023
[23]

Har- monydream: Task harmonization inside world models

Haoyu Ma, Jialong Wu, Ningya Feng, Chenjun Xiao, Dong Li, Jianye Hao, Jianmin Wang, and Mingsheng Long. Har- monydream: Task harmonization inside world models. In ICML, 2024. 8

work page 2024
[24]

Vip: Towards universal visual reward and representation via value-implicit pre-training

Yecheng Jason Ma, Shagun Sodhani, Dinesh Jayaraman, Os- bert Bastani, Vikash Kumar, and Amy Zhang. Vip: Towards universal visual reward and representation via value-implicit pre-training. In ICLR, 2023. 8

work page 2023
[25]

Visualizing data using t-sne

Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. JMLR, 9(Nov):2579–2605, 2008. 2

work page 2008
[26]

Choreographer: Learning and adapting skills in imagination

Pietro Mazzaglia, Tim Verbelen, Bart Dhoedt, Alexandre Lacoste, and Sai Rajeswar. Choreographer: Learning and adapting skills in imagination. In ICLR, 2023. 8

work page 2023
[27]

Genrl: Multimodal- foundation world models for generalization in embodied agents

Pietro Mazzaglia, Tim Verbelen, Bart Dhoedt, Aaron C Courville, and Sai Rajeswar Mudumba. Genrl: Multimodal- foundation world models for generalization in embodied agents. NeurIPS, 2024. 8

work page 2024
[28]

Iso-dream: Isolating and leveraging noncontrollable visual dynamics in world models

Minting Pan, Xiangming Zhu, Yunbo Wang, and Xiaokang Yang. Iso-dream: Isolating and leveraging noncontrollable visual dynamics in world models. In NeurIPS, 2022. 1, 8

work page 2022
[29]

Reinforcement learning with action-free pre- training from videos

Younggyo Seo, Kimin Lee, Stephen L James, and Pieter Abbeel. Reinforcement learning with action-free pre- training from videos. In ICML, 2022. 4, 5, 6, 8, 1

work page 2022
[30]

Multi-view masked world models for visual robotic manipulation

Younggyo Seo, Junsu Kim, Stephen James, Kimin Lee, Jin- woo Shin, and Pieter Abbeel. Multi-view masked world models for visual robotic manipulation. In ICML, 2023. 8

work page 2023
[31]

A simple framework for generalization in visual rl under dynamic scene perturbations

Wonil Song, Hyesong Choi, Kwanghoon Sohn, and Dongbo Min. A simple framework for generalization in visual rl under dynamic scene perturbations. NeurIPS, 37:121790– 121826, 2024. 8

work page 2024
[32]

Decoupling representation learning from reinforce- ment learning

Adam Stooke, Kimin Lee, Pieter Abbeel, and Michael Laskin. Decoupling representation learning from reinforce- ment learning. In ICML, pages 9870–9879, 2021. 8

work page 2021
[33]

Transfer rl across observation feature spaces via model-based regularization

Yanchao Sun, Ruijie Zheng, Xiyao Wang, Andrew Cohen, and Furong Huang. Transfer rl across observation feature spaces via model-based regularization. In ICLR, 2022. 8

work page 2022
[34]

DeepMind Control Suite

Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Ab- dolmaleki, Josh Merel, Andrew Lefrancq, et al. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018. 4

work page internal anchor Pith review Pith/arXiv arXiv 2018
[35]

Mujoco: A physics engine for model-based control

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In IROS, 2012. 4

work page 2012
[36]

Making offline rl online: Collab- orative world models for offline visual reinforcement learn- ing

Qi Wang, Junming Yang, Yunbo Wang, Xin Jin, Wenjun Zeng, and Xiaokang Yang. Making offline rl online: Collab- orative world models for offline visual reinforcement learn- ing. In NeurIPS, 2024. 8

work page 2024
[37]

Unsupervised visual attention and invariance for reinforcement learning

Xudong Wang, Long Lian, and Stella X Yu. Unsupervised visual attention and invariance for reinforcement learning. In CVPR, pages 6677–6687, 2021. 4, 8, 1

work page 2021
[38]

Dis- entangled representation learning

Xin Wang, Hong Chen, Zihao Wu, Wenwu Zhu, et al. Dis- entangled representation learning. TPAMI, 2024. 2

work page 2024
[39]

Pre-training contextualized world models with in-the-wild videos for reinforcement learning

Jialong Wu, Haoyu Ma, Chaoyi Deng, and Mingsheng Long. Pre-training contextualized world models with in-the-wild videos for reinforcement learning. NeurIPS, 2023. 8, 1

work page 2023
[40]

Navinerf: Nerf-based 3d representation disentanglement by latent semantic naviga- tion

Baao Xie, Bohan Li, Zequn Zhang, Junting Dong, Xin Jin, Jingyu Yang, and Wenjun Zeng. Navinerf: Nerf-based 3d representation disentanglement by latent semantic naviga- tion. In ICCV, 2023. 2

work page 2023
[41]

Graph-based unsupervised disentan- gled representation learning via multimodal large language models

Baao Xie, Qiuyu Chen, Yunnan Wang, Zequn Zhang, Xin Jin, and Wenjun Zeng. Graph-based unsupervised disentan- gled representation learning via multimodal large language models. In NeurIPS, 2024. 2

work page 2024
[42]

Mastering visual continuous control: Improved data- augmented reinforcement learning

Denis Yarats, Rob Fergus, Alessandro Lazaric, and Lerrel Pinto. Mastering visual continuous control: Improved data- augmented reinforcement learning. In ICLR, 2022. 8

work page 2022
[43]

Meta- world: A benchmark and evaluation for multi-task and meta reinforcement learning

Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta- world: A benchmark and evaluation for multi-task and meta reinforcement learning. In CoRL, 2019. 4

work page 2019
[44]

Learning invariant representations for reinforcement learning without reconstruction

Amy Zhang, Rowan McAllister, Roberto Calandra, Yarin Gal, and Sergey Levine. Learning invariant representations for reinforcement learning without reconstruction. In ICLR,

work page
[45]

Prelar: World model pre-training with learnable action rep- resentation

Lixuan Zhang, Meina Kan, Shiguang Shan, and Xilin Chen. Prelar: World model pre-training with learnable action rep- resentation. In ECCV, 2024. 1, 8

work page 2024
[46]

Storm: Efficient stochastic transformer based world models for reinforcement learning

Weipu Zhang, Gang Wang, Jian Sun, Yetian Yuan, and Gao Huang. Storm: Efficient stochastic transformer based world models for reinforcement learning. In NeurIPS, 2023. 8

work page 2023
[47]

Taco: Temporal latent action-driven contrastive loss for visual re- inforcement learning

Ruijie Zheng, Xiyao Wang, Yanchao Sun, Shuang Ma, Jieyu Zhao, Huazhe Xu, Hal Daum´e III, and Furong Huang. Taco: Temporal latent action-driven contrastive loss for visual re- inforcement learning. In NeurIPS, 2023. 8

work page 2023
[48]

Transfer learning in deep reinforcement learning: A survey

Zhuangdi Zhu, Kaixiang Lin, Anil K Jain, and Jiayu Zhou. Transfer learning in deep reinforcement learning: A survey. TPAMI, 45(11):13344–13362, 2023. 8 Disentangled World Models: Learning to Transfer Semantic Knowledge from Distracting Videos for Reinforcement Learning Supplementary Material A. Compared Baselines We compare DisWM with strong visual RL age...

work page 2023

[1] [1]

Diffusion for world modeling: Visual details matter in atari

Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kan- ervisto, Amos Storkey, Tim Pearce, and Franc ¸ois Fleuret. Diffusion for world modeling: Visual details matter in atari. In NeurIPS, 2024. 8

work page 2024

[2] [2]

Repre- sentation learning: A review and new perspectives

Yoshua Bengio, Aaron Courville, and Pascal Vincent. Repre- sentation learning: A review and new perspectives. TPAMI, 35(8):1798–1828, 2013. 2

work page 2013

[3] [3]

Environment agnostic representation for visual rein- forcement learning

Hyesong Choi, Hunsang Lee, Seongwon Jeong, and Dongbo Min. Environment agnostic representation for visual rein- forcement learning. In ICCV, pages 263–273, 2023. 8

work page 2023

[4] [4]

Local-guided global: Paired similarity representation for visual reinforcement learning

Hyesong Choi, Hunsang Lee, Wonil Song, Sangryul Jeon, Kwanghoon Sohn, and Dongbo Min. Local-guided global: Paired similarity representation for visual reinforcement learning. In CVPR, pages 15072–15082, 2023. 8

work page 2023

[5] [5]

Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)

Djork-Arn ´e Clevert, Thomas Unterthiner, and Sepp Hochre- iter. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015. 1

work page internal anchor Pith review Pith/arXiv arXiv 2015

[6] [6]

Conditional mutual in- formation for disentangled representations in reinforcement learning

Mhairi Dunion, Trevor McInroe, Kevin Sebastian Luck, Josiah Hanna, and Stefano Albrecht. Conditional mutual in- formation for disentangled representations in reinforcement learning. In NeurIPS, 2023. 1, 2

work page 2023

[7] [7]

Temporal disen- tanglement of representations for improved generalisation in reinforcement learning

Mhairi Dunion, Trevor McInroe, Kevin Sebastian Luck, Josiah P Hanna, and Stefano V Albrecht. Temporal disen- tanglement of representations for improved generalisation in reinforcement learning. In ICLR, 2023. 1, 2, 5, 6

work page 2023

[8] [8]

Learning latent dynamics for planning from pixels

Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Ville- gas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. In ICML, 2019. 8

work page 2019

[9] [9]

Dream to control: Learning behaviors by la- tent imagination

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Moham- mad Norouzi. Dream to control: Learning behaviors by la- tent imagination. In ICLR, 2020

work page 2020

[10] [10]

Mastering atari with discrete world models

Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. In ICLR, 2021. 4, 6, 8, 1

work page 2021

[11] [11]

Deep hierarchical planning from pixels

Danijar Hafner, Kuang-Huei Lee, Ian Fischer, and Pieter Abbeel. Deep hierarchical planning from pixels. arXiv preprint arXiv:2206.04114, 2022. 1

work page arXiv 2022

[12] [12]

Mastering diverse domains through world models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. Nature, 2025. 8, 1

work page 2025

[13] [13]

Td-mpc2: Scalable, robust world models for continuous control

Nicklas Hansen, Hao Su, and Xiaolong Wang. Td-mpc2: Scalable, robust world models for continuous control. In ICLR, 2024. 1

work page 2024

[14] [14]

beta-vae: Learning basic visual con- cepts with a constrained variational framework

Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual con- cepts with a constrained variational framework. In ICLR,

work page

[15] [15]

Darla: Improv- ing zero-shot transfer in reinforcement learning

Irina Higgins, Arka Pal, Andrei Rusu, Loic Matthey, Christopher Burgess, Alexander Pritzel, Matthew Botvinick, Charles Blundell, and Alexander Lerchner. Darla: Improv- ing zero-shot transfer in reinforcement learning. In ICML,

work page

[16] [16]

Leveraging separated world model for exploration in visually distracted environments

Kaichen Huang, Shenghua Wan, Minghao Shao, Hai-Hang Sun, Le Gan, Shuai Feng, and De-Chuan Zhan. Leveraging separated world model for exploration in visually distracted environments. NeurIPS, 2024. 8

work page 2024

[17] [17]

Reinforcement learning with augmented data

Misha Laskin, Kimin Lee, Adam Stooke, Lerrel Pinto, Pieter Abbeel, and Aravind Srinivas. Reinforcement learning with augmented data. In NeurIPS, 2020. 5

work page 2020

[18] [18]

Curl: Contrastive unsupervised representations for reinforcement learning

Michael Laskin, Aravind Srinivas, and Pieter Abbeel. Curl: Contrastive unsupervised representations for reinforcement learning. In ICML, pages 5639–5650, 2020. 1, 6, 8

work page 2020

[19] [19]

Generalizing consistency policy to visual rl with prior- itized proximal experience regularization

Haoran Li, Zhennan Jiang, YUHUI CHEN, and Dongbin Zhao. Generalizing consistency policy to visual rl with prior- itized proximal experience regularization. In NeurIPS, 2024. 8

work page 2024

[20] [20]

Open-world reinforcement learning over long short-term imagination

Jiajian Li, Qi Wang, Yunbo Wang, Xin Jin, Yang Li, Wen- jun Zeng, and Xiaokang Yang. Open-world reinforcement learning over long short-term imagination. In ICLR, 2025. 1, 8

work page 2025

[21] [21]

Learning to model the world with language

Jessy Lin, Yuqing Du, Olivia Watkins, Danijar Hafner, Pieter Abbeel, Dan Klein, and Anca Dragan. Learning to model the world with language. In ICML, 2024. 8

work page 2024

[22] [22]

Struc- tured state space models for in-context reinforcement learn- ing

Chris Lu, Yannick Schroecker, Albert Gu, Emilio Parisotto, Jakob Foerster, Satinder Singh, and Feryal Behbahani. Struc- tured state space models for in-context reinforcement learn- ing. NeurIPS, 2023. 8

work page 2023

[23] [23]

Har- monydream: Task harmonization inside world models

Haoyu Ma, Jialong Wu, Ningya Feng, Chenjun Xiao, Dong Li, Jianye Hao, Jianmin Wang, and Mingsheng Long. Har- monydream: Task harmonization inside world models. In ICML, 2024. 8

work page 2024

[24] [24]

Vip: Towards universal visual reward and representation via value-implicit pre-training

Yecheng Jason Ma, Shagun Sodhani, Dinesh Jayaraman, Os- bert Bastani, Vikash Kumar, and Amy Zhang. Vip: Towards universal visual reward and representation via value-implicit pre-training. In ICLR, 2023. 8

work page 2023

[25] [25]

Visualizing data using t-sne

Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. JMLR, 9(Nov):2579–2605, 2008. 2

work page 2008

[26] [26]

Choreographer: Learning and adapting skills in imagination

Pietro Mazzaglia, Tim Verbelen, Bart Dhoedt, Alexandre Lacoste, and Sai Rajeswar. Choreographer: Learning and adapting skills in imagination. In ICLR, 2023. 8

work page 2023

[27] [27]

Genrl: Multimodal- foundation world models for generalization in embodied agents

Pietro Mazzaglia, Tim Verbelen, Bart Dhoedt, Aaron C Courville, and Sai Rajeswar Mudumba. Genrl: Multimodal- foundation world models for generalization in embodied agents. NeurIPS, 2024. 8

work page 2024

[28] [28]

Iso-dream: Isolating and leveraging noncontrollable visual dynamics in world models

Minting Pan, Xiangming Zhu, Yunbo Wang, and Xiaokang Yang. Iso-dream: Isolating and leveraging noncontrollable visual dynamics in world models. In NeurIPS, 2022. 1, 8

work page 2022

[29] [29]

Reinforcement learning with action-free pre- training from videos

Younggyo Seo, Kimin Lee, Stephen L James, and Pieter Abbeel. Reinforcement learning with action-free pre- training from videos. In ICML, 2022. 4, 5, 6, 8, 1

work page 2022

[30] [30]

Multi-view masked world models for visual robotic manipulation

Younggyo Seo, Junsu Kim, Stephen James, Kimin Lee, Jin- woo Shin, and Pieter Abbeel. Multi-view masked world models for visual robotic manipulation. In ICML, 2023. 8

work page 2023

[31] [31]

A simple framework for generalization in visual rl under dynamic scene perturbations

Wonil Song, Hyesong Choi, Kwanghoon Sohn, and Dongbo Min. A simple framework for generalization in visual rl under dynamic scene perturbations. NeurIPS, 37:121790– 121826, 2024. 8

work page 2024

[32] [32]

Decoupling representation learning from reinforce- ment learning

Adam Stooke, Kimin Lee, Pieter Abbeel, and Michael Laskin. Decoupling representation learning from reinforce- ment learning. In ICML, pages 9870–9879, 2021. 8

work page 2021

[33] [33]

Transfer rl across observation feature spaces via model-based regularization

Yanchao Sun, Ruijie Zheng, Xiyao Wang, Andrew Cohen, and Furong Huang. Transfer rl across observation feature spaces via model-based regularization. In ICLR, 2022. 8

work page 2022

[34] [34]

DeepMind Control Suite

Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Ab- dolmaleki, Josh Merel, Andrew Lefrancq, et al. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018. 4

work page internal anchor Pith review Pith/arXiv arXiv 2018

[35] [35]

Mujoco: A physics engine for model-based control

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In IROS, 2012. 4

work page 2012

[36] [36]

Making offline rl online: Collab- orative world models for offline visual reinforcement learn- ing

Qi Wang, Junming Yang, Yunbo Wang, Xin Jin, Wenjun Zeng, and Xiaokang Yang. Making offline rl online: Collab- orative world models for offline visual reinforcement learn- ing. In NeurIPS, 2024. 8

work page 2024

[37] [37]

Unsupervised visual attention and invariance for reinforcement learning

Xudong Wang, Long Lian, and Stella X Yu. Unsupervised visual attention and invariance for reinforcement learning. In CVPR, pages 6677–6687, 2021. 4, 8, 1

work page 2021

[38] [38]

Dis- entangled representation learning

Xin Wang, Hong Chen, Zihao Wu, Wenwu Zhu, et al. Dis- entangled representation learning. TPAMI, 2024. 2

work page 2024

[39] [39]

Pre-training contextualized world models with in-the-wild videos for reinforcement learning

Jialong Wu, Haoyu Ma, Chaoyi Deng, and Mingsheng Long. Pre-training contextualized world models with in-the-wild videos for reinforcement learning. NeurIPS, 2023. 8, 1

work page 2023

[40] [40]

Navinerf: Nerf-based 3d representation disentanglement by latent semantic naviga- tion

Baao Xie, Bohan Li, Zequn Zhang, Junting Dong, Xin Jin, Jingyu Yang, and Wenjun Zeng. Navinerf: Nerf-based 3d representation disentanglement by latent semantic naviga- tion. In ICCV, 2023. 2

work page 2023

[41] [41]

Graph-based unsupervised disentan- gled representation learning via multimodal large language models

Baao Xie, Qiuyu Chen, Yunnan Wang, Zequn Zhang, Xin Jin, and Wenjun Zeng. Graph-based unsupervised disentan- gled representation learning via multimodal large language models. In NeurIPS, 2024. 2

work page 2024

[42] [42]

Mastering visual continuous control: Improved data- augmented reinforcement learning

Denis Yarats, Rob Fergus, Alessandro Lazaric, and Lerrel Pinto. Mastering visual continuous control: Improved data- augmented reinforcement learning. In ICLR, 2022. 8

work page 2022

[43] [43]

Meta- world: A benchmark and evaluation for multi-task and meta reinforcement learning

Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta- world: A benchmark and evaluation for multi-task and meta reinforcement learning. In CoRL, 2019. 4

work page 2019

[44] [44]

Learning invariant representations for reinforcement learning without reconstruction

Amy Zhang, Rowan McAllister, Roberto Calandra, Yarin Gal, and Sergey Levine. Learning invariant representations for reinforcement learning without reconstruction. In ICLR,

work page

[45] [45]

Prelar: World model pre-training with learnable action rep- resentation

Lixuan Zhang, Meina Kan, Shiguang Shan, and Xilin Chen. Prelar: World model pre-training with learnable action rep- resentation. In ECCV, 2024. 1, 8

work page 2024

[46] [46]

Storm: Efficient stochastic transformer based world models for reinforcement learning

Weipu Zhang, Gang Wang, Jian Sun, Yetian Yuan, and Gao Huang. Storm: Efficient stochastic transformer based world models for reinforcement learning. In NeurIPS, 2023. 8

work page 2023

[47] [47]

Taco: Temporal latent action-driven contrastive loss for visual re- inforcement learning

Ruijie Zheng, Xiyao Wang, Yanchao Sun, Shuang Ma, Jieyu Zhao, Huazhe Xu, Hal Daum´e III, and Furong Huang. Taco: Temporal latent action-driven contrastive loss for visual re- inforcement learning. In NeurIPS, 2023. 8

work page 2023

[48] [48]

Transfer learning in deep reinforcement learning: A survey

Zhuangdi Zhu, Kaixiang Lin, Anil K Jain, and Jiayu Zhou. Transfer learning in deep reinforcement learning: A survey. TPAMI, 45(11):13344–13362, 2023. 8 Disentangled World Models: Learning to Transfer Semantic Knowledge from Distracting Videos for Reinforcement Learning Supplementary Material A. Compared Baselines We compare DisWM with strong visual RL age...

work page 2023