Disentangled World Models: Learning to Transfer Semantic Knowledge from Distracting Videos for Reinforcement Learning
Pith reviewed 2026-05-23 00:18 UTC · model grok-4.3
The pith
Disentangled world models transfer semantic knowledge from distracting videos to reinforcement learning via latent distillation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce Disentangled World Models that pretrain an action-free video prediction model offline with disentanglement regularization to extract semantic knowledge from distracting videos. The disentanglement capability of the pretrained model is transferred to the world model through latent distillation. For online finetuning we add a disentanglement constraint to the world model; the incorporation of actions and rewards from environment interactions enriches data diversity and strengthens the learned representations.
What carries the argument
Offline-to-online latent distillation that transfers disentanglement capability from a pretrained action-free video prediction model to the RL world model.
If this is right
- RL agents reach target performance with fewer environment interactions on visual benchmarks that contain variations.
- Semantic factors are separated in the world model without requiring task-specific labels during the offline pretraining stage.
- Online adaptation gains from richer data that includes both visual observations and the actions plus rewards collected during interaction.
- The world model inherits interpretable disentanglement properties that persist through the transfer and finetuning stages.
Where Pith is reading between the lines
- The same pretraining-plus-distillation pattern could be applied in any domain where large amounts of unlabeled video exist but direct RL interaction data remain expensive.
- If the distracting videos contain semantics that conflict with the downstream task, the latent distillation step may degrade rather than improve the world model.
- Extending the approach to include multiple distinct video sources during pretraining might further increase the robustness of the transferred representations.
Load-bearing premise
Semantic variations present in the chosen distracting videos are relevant to the target RL tasks and can be disentangled in a form that transfers usefully without negative transfer or loss of task-critical information.
What would settle it
Run the full pipeline on a benchmark where the distracting videos contain only variations unrelated to the RL task and measure whether sample efficiency and final performance remain at or above the level of a standard world model without the offline pretraining step.
Figures
read the original abstract
Training visual reinforcement learning (RL) in practical scenarios presents a significant challenge, $\textit{i.e.,}$ RL agents suffer from low sample efficiency in environments with variations. While various approaches have attempted to alleviate this issue by disentangled representation learning, these methods usually start learning from scratch without prior knowledge of the world. This paper, in contrast, tries to learn and understand underlying semantic variations from distracting videos via offline-to-online latent distillation and flexible disentanglement constraints. To enable effective cross-domain semantic knowledge transfer, we introduce an interpretable model-based RL framework, dubbed Disentangled World Models (DisWM). Specifically, we pretrain the action-free video prediction model offline with disentanglement regularization to extract semantic knowledge from distracting videos. The disentanglement capability of the pretrained model is then transferred to the world model through latent distillation. For finetuning in the online environment, we exploit the knowledge from the pretrained model and introduce a disentanglement constraint to the world model. During the adaptation phase, the incorporation of actions and rewards from online environment interactions enriches the diversity of the data, which in turn strengthens the disentangled representation learning. Experimental results validate the superiority of our approach on various benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Disentangled World Models (DisWM), a model-based RL framework that pretrains an action-free video prediction model offline on distracting videos using disentanglement regularization to extract semantic knowledge. This pretrained capability is transferred to the world model via latent distillation; during online finetuning a disentanglement constraint is added, with actions and rewards from environment interactions claimed to enrich data diversity and strengthen disentangled representations. The central claim is that this offline-to-online pipeline yields superior performance on various visual RL benchmarks in environments with variations.
Significance. If the empirical claims hold, the work could provide a practical route for transferring semantic factors from offline video data into RL world models, potentially improving sample efficiency and robustness without learning representations entirely from scratch. The approach relies on standard techniques in disentangled representation learning and latent distillation rather than novel theoretical derivations or parameter-free results.
major comments (2)
- [Abstract] Abstract: The manuscript asserts that 'Experimental results validate the superiority of our approach on various benchmarks' yet supplies no quantitative results, ablation studies, baseline comparisons, or implementation details. This absence prevents verification that the proposed pipeline (offline pretraining + latent distillation + online disentanglement constraint) is responsible for any measured gains rather than online data alone.
- [Abstract] Abstract: The central claim requires that semantic variations in the chosen distracting videos align with task-critical factors in the target RL environments and survive latent distillation without negative transfer. The text states that 'the incorporation of actions and rewards … strengthens the disentangled representation learning' but provides no mechanism, analysis, or ablation demonstrating that video-derived factors are the relevant ones or that distillation (versus online interactions) drives the improvement.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the abstract. We agree that the abstract requires strengthening to better substantiate the claims with quantitative highlights and clarifications. We will revise the abstract in the next version to address both points while preserving its concise nature, and we point to the detailed experiments, ablations, and analyses already present in the full manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The manuscript asserts that 'Experimental results validate the superiority of our approach on various benchmarks' yet supplies no quantitative results, ablation studies, baseline comparisons, or implementation details. This absence prevents verification that the proposed pipeline (offline pretraining + latent distillation + online disentanglement constraint) is responsible for any measured gains rather than online data alone.
Authors: We acknowledge that the current abstract does not embed specific numbers or direct references to ablations. In the revision we will add concise quantitative highlights (e.g., relative sample-efficiency gains on the reported visual RL benchmarks) and explicit mentions of the ablation studies and baseline comparisons that appear in Sections 4 and 5. These additions will make clear that the measured improvements arise from the full offline-to-online pipeline rather than online data alone. revision: yes
-
Referee: [Abstract] Abstract: The central claim requires that semantic variations in the chosen distracting videos align with task-critical factors in the target RL environments and survive latent distillation without negative transfer. The text states that 'the incorporation of actions and rewards … strengthens the disentangled representation learning' but provides no mechanism, analysis, or ablation demonstrating that video-derived factors are the relevant ones or that distillation (versus online interactions) drives the improvement.
Authors: The manuscript already contains the requested mechanism (latent distillation of the pretrained disentangled factors followed by an online disentanglement constraint) and supporting ablations (Section 5.3) that isolate the contribution of actions/rewards and show that distillation improves over purely online training without negative transfer. To satisfy the abstract-level concern we will insert a short clause referencing these results and noting that the chosen video variations were selected to match task-critical factors. This keeps the abstract concise while directing readers to the evidence. revision: yes
Circularity Check
No circularity: empirical pipeline with independent experimental validation
full rationale
The paper presents a standard offline-to-online transfer pipeline: pretrain an action-free video model with disentanglement regularization on distracting videos, distill latents to a world model, then finetune online with added actions/rewards and a disentanglement constraint. No derivation, equation, or claim reduces a reported prediction or result to a fitted quantity by construction, nor invokes a self-citation chain or uniqueness theorem to force the architecture. The central claims rest on benchmark comparisons rather than self-referential definitions, making the work self-contained against external evaluation.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Disentanglement regularization applied to action-free video prediction extracts semantic factors relevant to downstream RL tasks
- domain assumption Latent distillation transfers disentanglement capability without substantial information loss
invented entities (1)
-
Disentangled World Models (DisWM)
no independent evidence
Forward citations
Cited by 1 Pith paper
-
Mask World Model: Predicting What Matters for Robust Robot Policy Learning
Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization...
Reference graph
Works this paper leans on
-
[1]
Diffusion for world modeling: Visual details matter in atari
Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kan- ervisto, Amos Storkey, Tim Pearce, and Franc ¸ois Fleuret. Diffusion for world modeling: Visual details matter in atari. In NeurIPS, 2024. 8
work page 2024
-
[2]
Repre- sentation learning: A review and new perspectives
Yoshua Bengio, Aaron Courville, and Pascal Vincent. Repre- sentation learning: A review and new perspectives. TPAMI, 35(8):1798–1828, 2013. 2
work page 2013
-
[3]
Environment agnostic representation for visual rein- forcement learning
Hyesong Choi, Hunsang Lee, Seongwon Jeong, and Dongbo Min. Environment agnostic representation for visual rein- forcement learning. In ICCV, pages 263–273, 2023. 8
work page 2023
-
[4]
Local-guided global: Paired similarity representation for visual reinforcement learning
Hyesong Choi, Hunsang Lee, Wonil Song, Sangryul Jeon, Kwanghoon Sohn, and Dongbo Min. Local-guided global: Paired similarity representation for visual reinforcement learning. In CVPR, pages 15072–15082, 2023. 8
work page 2023
-
[5]
Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)
Djork-Arn ´e Clevert, Thomas Unterthiner, and Sepp Hochre- iter. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015. 1
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[6]
Conditional mutual in- formation for disentangled representations in reinforcement learning
Mhairi Dunion, Trevor McInroe, Kevin Sebastian Luck, Josiah Hanna, and Stefano Albrecht. Conditional mutual in- formation for disentangled representations in reinforcement learning. In NeurIPS, 2023. 1, 2
work page 2023
-
[7]
Temporal disen- tanglement of representations for improved generalisation in reinforcement learning
Mhairi Dunion, Trevor McInroe, Kevin Sebastian Luck, Josiah P Hanna, and Stefano V Albrecht. Temporal disen- tanglement of representations for improved generalisation in reinforcement learning. In ICLR, 2023. 1, 2, 5, 6
work page 2023
-
[8]
Learning latent dynamics for planning from pixels
Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Ville- gas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. In ICML, 2019. 8
work page 2019
-
[9]
Dream to control: Learning behaviors by la- tent imagination
Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Moham- mad Norouzi. Dream to control: Learning behaviors by la- tent imagination. In ICLR, 2020
work page 2020
-
[10]
Mastering atari with discrete world models
Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. In ICLR, 2021. 4, 6, 8, 1
work page 2021
-
[11]
Deep hierarchical planning from pixels
Danijar Hafner, Kuang-Huei Lee, Ian Fischer, and Pieter Abbeel. Deep hierarchical planning from pixels. arXiv preprint arXiv:2206.04114, 2022. 1
-
[12]
Mastering diverse domains through world models
Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. Nature, 2025. 8, 1
work page 2025
-
[13]
Td-mpc2: Scalable, robust world models for continuous control
Nicklas Hansen, Hao Su, and Xiaolong Wang. Td-mpc2: Scalable, robust world models for continuous control. In ICLR, 2024. 1
work page 2024
-
[14]
beta-vae: Learning basic visual con- cepts with a constrained variational framework
Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual con- cepts with a constrained variational framework. In ICLR,
-
[15]
Darla: Improv- ing zero-shot transfer in reinforcement learning
Irina Higgins, Arka Pal, Andrei Rusu, Loic Matthey, Christopher Burgess, Alexander Pritzel, Matthew Botvinick, Charles Blundell, and Alexander Lerchner. Darla: Improv- ing zero-shot transfer in reinforcement learning. In ICML,
-
[16]
Leveraging separated world model for exploration in visually distracted environments
Kaichen Huang, Shenghua Wan, Minghao Shao, Hai-Hang Sun, Le Gan, Shuai Feng, and De-Chuan Zhan. Leveraging separated world model for exploration in visually distracted environments. NeurIPS, 2024. 8
work page 2024
-
[17]
Reinforcement learning with augmented data
Misha Laskin, Kimin Lee, Adam Stooke, Lerrel Pinto, Pieter Abbeel, and Aravind Srinivas. Reinforcement learning with augmented data. In NeurIPS, 2020. 5
work page 2020
-
[18]
Curl: Contrastive unsupervised representations for reinforcement learning
Michael Laskin, Aravind Srinivas, and Pieter Abbeel. Curl: Contrastive unsupervised representations for reinforcement learning. In ICML, pages 5639–5650, 2020. 1, 6, 8
work page 2020
-
[19]
Generalizing consistency policy to visual rl with prior- itized proximal experience regularization
Haoran Li, Zhennan Jiang, YUHUI CHEN, and Dongbin Zhao. Generalizing consistency policy to visual rl with prior- itized proximal experience regularization. In NeurIPS, 2024. 8
work page 2024
-
[20]
Open-world reinforcement learning over long short-term imagination
Jiajian Li, Qi Wang, Yunbo Wang, Xin Jin, Yang Li, Wen- jun Zeng, and Xiaokang Yang. Open-world reinforcement learning over long short-term imagination. In ICLR, 2025. 1, 8
work page 2025
-
[21]
Learning to model the world with language
Jessy Lin, Yuqing Du, Olivia Watkins, Danijar Hafner, Pieter Abbeel, Dan Klein, and Anca Dragan. Learning to model the world with language. In ICML, 2024. 8
work page 2024
-
[22]
Struc- tured state space models for in-context reinforcement learn- ing
Chris Lu, Yannick Schroecker, Albert Gu, Emilio Parisotto, Jakob Foerster, Satinder Singh, and Feryal Behbahani. Struc- tured state space models for in-context reinforcement learn- ing. NeurIPS, 2023. 8
work page 2023
-
[23]
Har- monydream: Task harmonization inside world models
Haoyu Ma, Jialong Wu, Ningya Feng, Chenjun Xiao, Dong Li, Jianye Hao, Jianmin Wang, and Mingsheng Long. Har- monydream: Task harmonization inside world models. In ICML, 2024. 8
work page 2024
-
[24]
Vip: Towards universal visual reward and representation via value-implicit pre-training
Yecheng Jason Ma, Shagun Sodhani, Dinesh Jayaraman, Os- bert Bastani, Vikash Kumar, and Amy Zhang. Vip: Towards universal visual reward and representation via value-implicit pre-training. In ICLR, 2023. 8
work page 2023
-
[25]
Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. JMLR, 9(Nov):2579–2605, 2008. 2
work page 2008
-
[26]
Choreographer: Learning and adapting skills in imagination
Pietro Mazzaglia, Tim Verbelen, Bart Dhoedt, Alexandre Lacoste, and Sai Rajeswar. Choreographer: Learning and adapting skills in imagination. In ICLR, 2023. 8
work page 2023
-
[27]
Genrl: Multimodal- foundation world models for generalization in embodied agents
Pietro Mazzaglia, Tim Verbelen, Bart Dhoedt, Aaron C Courville, and Sai Rajeswar Mudumba. Genrl: Multimodal- foundation world models for generalization in embodied agents. NeurIPS, 2024. 8
work page 2024
-
[28]
Iso-dream: Isolating and leveraging noncontrollable visual dynamics in world models
Minting Pan, Xiangming Zhu, Yunbo Wang, and Xiaokang Yang. Iso-dream: Isolating and leveraging noncontrollable visual dynamics in world models. In NeurIPS, 2022. 1, 8
work page 2022
-
[29]
Reinforcement learning with action-free pre- training from videos
Younggyo Seo, Kimin Lee, Stephen L James, and Pieter Abbeel. Reinforcement learning with action-free pre- training from videos. In ICML, 2022. 4, 5, 6, 8, 1
work page 2022
-
[30]
Multi-view masked world models for visual robotic manipulation
Younggyo Seo, Junsu Kim, Stephen James, Kimin Lee, Jin- woo Shin, and Pieter Abbeel. Multi-view masked world models for visual robotic manipulation. In ICML, 2023. 8
work page 2023
-
[31]
A simple framework for generalization in visual rl under dynamic scene perturbations
Wonil Song, Hyesong Choi, Kwanghoon Sohn, and Dongbo Min. A simple framework for generalization in visual rl under dynamic scene perturbations. NeurIPS, 37:121790– 121826, 2024. 8
work page 2024
-
[32]
Decoupling representation learning from reinforce- ment learning
Adam Stooke, Kimin Lee, Pieter Abbeel, and Michael Laskin. Decoupling representation learning from reinforce- ment learning. In ICML, pages 9870–9879, 2021. 8
work page 2021
-
[33]
Transfer rl across observation feature spaces via model-based regularization
Yanchao Sun, Ruijie Zheng, Xiyao Wang, Andrew Cohen, and Furong Huang. Transfer rl across observation feature spaces via model-based regularization. In ICLR, 2022. 8
work page 2022
-
[34]
Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Ab- dolmaleki, Josh Merel, Andrew Lefrancq, et al. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018. 4
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[35]
Mujoco: A physics engine for model-based control
Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In IROS, 2012. 4
work page 2012
-
[36]
Making offline rl online: Collab- orative world models for offline visual reinforcement learn- ing
Qi Wang, Junming Yang, Yunbo Wang, Xin Jin, Wenjun Zeng, and Xiaokang Yang. Making offline rl online: Collab- orative world models for offline visual reinforcement learn- ing. In NeurIPS, 2024. 8
work page 2024
-
[37]
Unsupervised visual attention and invariance for reinforcement learning
Xudong Wang, Long Lian, and Stella X Yu. Unsupervised visual attention and invariance for reinforcement learning. In CVPR, pages 6677–6687, 2021. 4, 8, 1
work page 2021
-
[38]
Dis- entangled representation learning
Xin Wang, Hong Chen, Zihao Wu, Wenwu Zhu, et al. Dis- entangled representation learning. TPAMI, 2024. 2
work page 2024
-
[39]
Pre-training contextualized world models with in-the-wild videos for reinforcement learning
Jialong Wu, Haoyu Ma, Chaoyi Deng, and Mingsheng Long. Pre-training contextualized world models with in-the-wild videos for reinforcement learning. NeurIPS, 2023. 8, 1
work page 2023
-
[40]
Navinerf: Nerf-based 3d representation disentanglement by latent semantic naviga- tion
Baao Xie, Bohan Li, Zequn Zhang, Junting Dong, Xin Jin, Jingyu Yang, and Wenjun Zeng. Navinerf: Nerf-based 3d representation disentanglement by latent semantic naviga- tion. In ICCV, 2023. 2
work page 2023
-
[41]
Graph-based unsupervised disentan- gled representation learning via multimodal large language models
Baao Xie, Qiuyu Chen, Yunnan Wang, Zequn Zhang, Xin Jin, and Wenjun Zeng. Graph-based unsupervised disentan- gled representation learning via multimodal large language models. In NeurIPS, 2024. 2
work page 2024
-
[42]
Mastering visual continuous control: Improved data- augmented reinforcement learning
Denis Yarats, Rob Fergus, Alessandro Lazaric, and Lerrel Pinto. Mastering visual continuous control: Improved data- augmented reinforcement learning. In ICLR, 2022. 8
work page 2022
-
[43]
Meta- world: A benchmark and evaluation for multi-task and meta reinforcement learning
Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta- world: A benchmark and evaluation for multi-task and meta reinforcement learning. In CoRL, 2019. 4
work page 2019
-
[44]
Learning invariant representations for reinforcement learning without reconstruction
Amy Zhang, Rowan McAllister, Roberto Calandra, Yarin Gal, and Sergey Levine. Learning invariant representations for reinforcement learning without reconstruction. In ICLR,
-
[45]
Prelar: World model pre-training with learnable action rep- resentation
Lixuan Zhang, Meina Kan, Shiguang Shan, and Xilin Chen. Prelar: World model pre-training with learnable action rep- resentation. In ECCV, 2024. 1, 8
work page 2024
-
[46]
Storm: Efficient stochastic transformer based world models for reinforcement learning
Weipu Zhang, Gang Wang, Jian Sun, Yetian Yuan, and Gao Huang. Storm: Efficient stochastic transformer based world models for reinforcement learning. In NeurIPS, 2023. 8
work page 2023
-
[47]
Taco: Temporal latent action-driven contrastive loss for visual re- inforcement learning
Ruijie Zheng, Xiyao Wang, Yanchao Sun, Shuang Ma, Jieyu Zhao, Huazhe Xu, Hal Daum´e III, and Furong Huang. Taco: Temporal latent action-driven contrastive loss for visual re- inforcement learning. In NeurIPS, 2023. 8
work page 2023
-
[48]
Transfer learning in deep reinforcement learning: A survey
Zhuangdi Zhu, Kaixiang Lin, Anil K Jain, and Jiayu Zhou. Transfer learning in deep reinforcement learning: A survey. TPAMI, 45(11):13344–13362, 2023. 8 Disentangled World Models: Learning to Transfer Semantic Knowledge from Distracting Videos for Reinforcement Learning Supplementary Material A. Compared Baselines We compare DisWM with strong visual RL age...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.