DVG-WM: Disentangled Video Generation Enables Efficient Embodied World Model for Robotic Manipulation
Pith reviewed 2026-07-01 04:52 UTC · model grok-4.3
The pith
Disentangling dynamics learning from visual synthesis in video generation produces faster and higher-quality embodied world models for robotic manipulation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DVG-WM decomposes world modeling into dynamics learning and visual synthesis; conditioned on an initial observation and language instruction, it first generates a sequence of intermediate visual states to preview interactions and then refines them to high-fidelity video using an efficient cascading mechanism with flow matching to map dynamics to latents and latent degradation to recover details, yielding improved video quality and up to 3.97 times acceleration.
What carries the argument
The DVG-WM framework that decomposes world modeling into dynamics learning followed by visual synthesis, with a cascading mechanism that applies flow matching and latent degradation.
If this is right
- Inference becomes fast enough for iterative planning loops in robotic manipulation.
- Contact-rich details are retained better than in entangled video generation approaches.
- The model achieves measurable gains on both the LIBERO benchmark and real-world robot platforms.
- The separation allows direct mapping from dynamics to video latents without full low-level temporal reasoning at every frame.
Where Pith is reading between the lines
- The two-stage structure could support modular upgrades where dynamics or synthesis components are improved independently.
- Faster inference might enable deployment on resource-constrained robots that previously could not run full world models in real time.
- The disentanglement might extend to other prediction-heavy embodied tasks such as navigation if the same separation holds.
Load-bearing premise
Generating an initial sequence of intermediate visual states from the starting observation and language instruction produces a plausible enough preview of physical interactions that later refinement recovers contact-rich details without critical errors.
What would settle it
A manipulation task where the final refined video predicts contact points or object trajectories that diverge measurably from real execution outcomes after the refinement stage.
read the original abstract
Video-based embodied world models provide an appealing substrate for robotic manipulation by predicting future states, yet current approaches remain limited by a fundamental entanglement: accurately modeling dynamics typically requires low-level temporal reasoning, while producing high-resolution frames demands expansive visual synthesis according to high-level semantics. This entanglement results in slow inference speed for iterative planning or too coarse predictions to retain contact-rich details. To solve this dilemma, we present Disentangled Video Generation World Model (DVG-WM), an efficient framework that explicitly decomposes world modeling into dynamics learning and visual synthesis. Conditioned on an initial observation and a language instruction, our model first generates a plausible sequence of intermediate visual states to preview the physical interaction and refines them to obtain high-fidelity videos. Furthermore, an efficient cascading mechanism is proposed, where DVG-WM uses flow matching to directly map the dynamics to video latents, and introduces a latent degradation mechanism to regenerate contact-rich details. Experiments on LIBERO and real-world platforms demonstrate improved video quality with up to 3.97 times acceleration, validating that disentangled video generation can be an efficient embodied world model for robotic manipulation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes DVG-WM, a framework that disentangles embodied world modeling into a dynamics stage (generating an intermediate sequence of visual states conditioned on initial observation and language instruction) and a visual synthesis stage (refining via flow matching and latent degradation for high-fidelity output). It claims this resolves the entanglement between low-level dynamics and high-resolution synthesis, yielding improved video quality and up to 3.97x acceleration on LIBERO and real-world robotic manipulation platforms.
Significance. If the results hold, the disentanglement and cascading flow-matching mechanism could meaningfully advance efficient video-based world models for robotics by enabling faster planning without sacrificing contact-rich fidelity. The explicit separation of concerns is a clear technical strength.
major comments (1)
- [Experiments] Experiments section: No ablation or direct evaluation is reported comparing intermediate dynamics-preview states against simulator ground truth for object trajectories and contacts, nor measuring downstream planning success rates on LIBERO tasks with versus without the refinement stage. This is load-bearing for the central claim, as refinement can only alter appearance and cannot correct errors in the initial preview.
minor comments (1)
- [Abstract] Abstract: The 3.97x acceleration claim would be strengthened by naming the exact baseline model, hardware, and quantitative video-quality metrics (e.g., FVD or LPIPS) used for comparison.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the experimental validation. We address the major comment below and will strengthen the manuscript accordingly.
read point-by-point responses
-
Referee: [Experiments] Experiments section: No ablation or direct evaluation is reported comparing intermediate dynamics-preview states against simulator ground truth for object trajectories and contacts, nor measuring downstream planning success rates on LIBERO tasks with versus without the refinement stage. This is load-bearing for the central claim, as refinement can only alter appearance and cannot correct errors in the initial preview.
Authors: We agree that direct ablations comparing the intermediate dynamics-preview states to simulator ground truth (object trajectories and contacts) and measuring downstream planning success rates on LIBERO with versus without the refinement stage would provide stronger support for the central disentanglement claim. The current manuscript reports final video quality metrics and inference speed on LIBERO and real-world platforms, which demonstrate overall benefits but do not isolate the dynamics stage accuracy. In the revised version we will add these evaluations, including quantitative comparisons of preview states to ground truth and planning success rates with/without refinement. revision: yes
Circularity Check
No circularity; derivation is architectural proposal plus external experiments
full rationale
The paper introduces DVG-WM as an explicit decomposition of world modeling into dynamics and visual synthesis stages, conditioned on observation and language. Claims of efficiency and quality rest on the proposed cascading mechanism (flow matching + latent degradation) and reported metrics from LIBERO/real-world experiments. No equations, fitted parameters renamed as predictions, or self-citation chains are present in the provided text that would reduce any result to its inputs by construction. The central premise is an architectural choice whose validity is tested externally rather than defined into existence.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Cosmos World Foundation Model Platform for Physical AI
Agarwal, N., Ali, A., Bala, M., Balaji, Y., Barker, E., Cai, T., Chattopadhyay, P., Chen, Y., Cui, Y., Ding, Y., et al.: Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Motus: A Unified Latent Action World Model
Bi, H., Tan, H., Xie, S., Wang, Z., Huang, S., Liu, H., Zhao, R., Feng, Y., Xiang, C., Rong, Y., et al.: Motus: A unified latent action world model. arXiv preprint arXiv:2512.13030 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
OpenAI Blog1(8), 1 (2024)
Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., et al.: Video generation models as world simulators. OpenAI Blog1(8), 1 (2024)
2024
-
[5]
RynnVLA-002: A Unified Vision-Language-Action and World Model
Cen, J., Huang, S., Yuan, Y., Li, K., Yuan, H., Yu, C., Jiang, Y., Guo, J., Li, X., Luo, H., et al.: Rynnvla-002: A unified vision-language-action and world model. arXiv preprint arXiv:2511.17502 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
WorldVLA: Towards Autoregressive Action World Model
Cen, J., Yu, C., Yuan, H., Jiang, Y., Huang, S., Guo, J., Li, X., Song, Y., Luo, H., Wang, F., et al.: Worldvla: Towards autoregressive action world model. arXiv preprint arXiv:2506.21539 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Advances in Neural Information Processing Systems37, 24081–24125 (2024)
Chen, B., Martí Monsó, D., Du, Y., Simchowitz, M., Tedrake, R., Sitzmann, V.: Diffusion forcing: Next-token prediction meets full-sequence diffusion. Advances in Neural Information Processing Systems37, 24081–24125 (2024)
2024
-
[8]
Large Video Planner Enables Generalizable Robot Control
Chen, B., Zhang, T., Geng, H., Song, K., Zhang, C., Li, P., Freeman, W.T., Malik, J., Abbeel, P., Tedrake, R., et al.: Large video planner enables generalizable robot control. arXiv preprint arXiv:2512.15840 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
arXiv preprint arXiv:2602.03793 (2026)
Chen, Y., Li, P., Yang, J., He, K., Wu, X., Xu, Y., Wang, K., Liu, J., Liu, N., Huang, Y., et al.: Bridgev2w: Bridging video generation models to embodied world models via embodiment masks. arXiv preprint arXiv:2602.03793 (2026)
-
[10]
The International Journal of Robotics Research44(10-11), 1684–1704 (2025)
Chi, C., Xu, Z., Feng, S., Cousineau, E., Du, Y., Burchfiel, B., Tedrake, R., Song, S.: Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research44(10-11), 1684–1704 (2025)
2025
-
[11]
Wow: Towards a world omni- scient world model through embodied interaction,
Chi, X., Jia, P., Fan, C.K., Ju, X., Mi, W., Zhang, K., Qin, Z., Tian, W., Ge, K., Li, H., et al.: Wow: Towards a world omniscient world model through embodied interaction. arXiv preprint arXiv:2509.22642 (2025)
-
[12]
Authorea Preprints (2025)
Deng, H., Wu, Z., Liu, H., Guo, W., Xue, Y., Shan, Z., Zhang, C., Jia, B., Ling, Y., Lu, G., et al.: A survey on reinforcement learning of vision-language-action models for robotic manipulation. Authorea Preprints (2025)
2025
-
[13]
Ctrl-World: A Controllable Generative World Model for Robot Manipulation
Guo, Y., Shi, L.X., Chen, J., Finn, C.: Ctrl-world: A controllable generative world model for robot manipulation. arXiv preprint arXiv:2510.10125 (2025) 14
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
arXiv preprint arXiv:2407.07667 (2024)
He, J., Xue, T., Liu, D., Lin, X., Gao, P., Lin, D., Qiao, Y., Ouyang, W., Liu, Z.: Venhancer: Generative space-time enhancement for video generation. arXiv preprint arXiv:2407.07667 (2024)
-
[15]
Imagen Video: High Definition Video Generation with Diffusion Models
Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D.P., Poole, B., Norouzi, M., Fleet, D.J., et al.: Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[16]
Advances in neural information processing systems33, 6840–6851 (2020)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020)
2020
-
[17]
Iclr1(2), 3 (2022)
Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. Iclr1(2), 3 (2022)
2022
-
[18]
Enerverse: Envisioning embodied future space for robotics manipulation,
Huang, S., Chen, L., Zhou, P., Chen, S., Jiang, Z., Hu, Y., Liao, Y., Gao, P., Li, H., Yao, M., et al.: Enerverse: Envisioning embodied future space for robotics manipulation. arXiv preprint arXiv:2501.01895 (2025)
-
[19]
Huang, W., Chao, Y.W., Mousavian, A., Liu, M.Y., Fox, D., Mo, K., Fei-Fei, L.: Point- world: Scaling 3d world models for in-the-wild robotic manipulation. arXiv preprint arXiv:2601.03782 (2026)
-
[20]
NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks
Hung, C.Y., Sun, Q., Hong, P., Zadeh, A., Li, C., Tan, U., Majumder, N., Poria, S., et al.: Nora: A small open-sourced generalist vision language action model for embodied tasks. arXiv preprint arXiv:2504.19854 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
arXiv preprint arXiv:2509.19080 (2025)
Jiang, Z., Liu, K., Qin, Y., Tian, S., Zheng, Y., Zhou, M., Yu, C., Li, H., Zhao, D.: World4rl: Diffusion world models for policy refinement with reinforcement learning for robotic manipulation. arXiv preprint arXiv:2509.19080 (2025)
-
[22]
Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning
Kim, M.J., Gao, Y., Lin, T.Y., Lin, Y.C., Ge, Y., Lam, G., Liang, P., Song, S., Liu, M.Y., Finn, C., et al.: Cosmos policy: Fine-tuning video models for visuomotor control and planning. arXiv preprint arXiv:2601.16163 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[23]
Causal World Modeling for Robot Control
Li, L., Zhang, Q., Luo, Y., Yang, S., Wang, R., Han, F., Yu, M., Gao, Z., Xue, N., Zhu, X., et al.: Causal world modeling for robot control. arXiv preprint arXiv:2601.21998 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[24]
arXiv preprint arXiv:2509.21027 (2025)
Li, S., Hao, Q., Shang, Y., Li, Y.: Keyworld: Key frame reasoning enables effective and efficient world models. arXiv preprint arXiv:2509.21027 (2025)
-
[25]
Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation
Liao, Y., Zhou, P., Huang, S., Yang, D., Chen, S., Jiang, Y., Hu, Y., Cai, J., Liu, S., Luo, J., et al.: Genie envisioner: A unified world foundation platform for robotic manipulation. arXiv preprint arXiv:2508.05635 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
Flow Matching for Generative Modeling
Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[27]
Advances in Neural Information Processing Systems36, 44776–44791 (2023)
Liu, B., Zhu, Y., Gao, C., Feng, Y., Liu, Q., Zhu, Y., Stone, P.: Libero: Benchmarking knowledge transfer for lifelong robot learning. Advances in Neural Information Processing Systems36, 44776–44791 (2023)
2023
-
[28]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003 (2022) 15
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[29]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Lu, G., Jia, B., Li, P., Chen, Y., Wang, Z., Tang, Y., Huang, S.: Gwm: Towards scal- able gaussian world models for robotic manipulation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9263–9274 (2025)
2025
- [30]
-
[31]
Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks
Ren, T., Liu, S., Zeng, A., Lin, J., Li, K., Cao, H., Chen, J., Huang, X., Chen, Y., Yan, F., et al.: Grounded sam: Assembling open-world models for diverse visual tasks. arXiv preprint arXiv:2401.14159 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
arXiv preprint arXiv:2402.09470 (2024)
Ruhe, D., Heek, J., Salimans, T., Hoogeboom, E.: Rolling diffusion models. arXiv preprint arXiv:2402.09470 (2024)
-
[33]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Shan, Z., Zhang, Y., Yang, Q., Yang, H., Xu, Y., Hwang, J.N., Xu, X., Liu, S.: Contrastive pre-training with multi-view fusion for no-reference point cloud quality assessment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 25942–25951 (2024)
2024
-
[34]
IEEE Robotics and Automation Letters (2026)
Shan, Z., Zhou, Y., Wu, G., Ji, Z., Wu, Z., Wang, Z.: Dockanywhere: Data-efficient visuomotor policy learning for mobile manipulation via novel demonstration generation. IEEE Robotics and Automation Letters (2026)
2026
-
[35]
arXiv preprint arXiv:2509.21790 (2025)
Shang, Y., Jin, L., Ma, Y., Zhang, X., Gao, C., Wu, W., Li, Y.: Longscape: Advancing long- horizon embodied world models with context-aware moe. arXiv preprint arXiv:2509.21790 (2025)
-
[36]
arXiv preprint arXiv:2512.06963 (2025)
Shen, Y., Wei, F., Du, Z., Liang, Y., Lu, Y., Yang, J., Zheng, N., Guo, B.: Videovla: Video generators can be generalizable robot manipulators. arXiv preprint arXiv:2512.06963 (2025)
-
[37]
Neurocomputing568, 127063 (2024)
Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., Liu, Y.: Roformer: Enhanced transformer with rotary position embedding. Neurocomputing568, 127063 (2024)
2024
-
[38]
arXiv preprint arXiv:2512.10675 (2025)
Team, G.R., Choromanski, K., Devin, C., Du, Y., Dwibedi, D., Gao, R., Jindal, A., Kipf, T., Kirmani, S., Leal, I., et al.: Evaluating gemini robotics policies in a veo world simulator. arXiv preprint arXiv:2512.10675 (2025)
-
[39]
Wan: Open and Advanced Large-Scale Video Generative Models
Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
International Journal of Computer Vision133(5), 3059–3078 (2025)
Wang, Y., Chen, X., Ma, X., Zhou, S., Huang, Z., Wang, Y., Yang, C., He, Y., Yu, J., Yang, P., et al.: Lavie: High-quality video generation with cascaded latent diffusion models. International Journal of Computer Vision133(5), 3059–3078 (2025)
2025
-
[41]
IEEE transactions on image processing13(4), 600–612 (2004)
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing13(4), 600–612 (2004)
2004
-
[42]
Advances in Neural Information Processing Systems 37, 68082–68119 (2024) 16
Wu, J., Yin, S., Feng, N., He, X., Li, D., Hao, J., Long, M.: ivideogpt: Interactive videogpts are scalable world models. Advances in Neural Information Processing Systems 37, 68082–68119 (2024) 16
2024
-
[43]
World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training
Xiao, J., Yang, Y., Chang, X., Chen, R., Xiong, F., Xu, M., Zheng, W.S., Zhang, Q.: World-env: Leveraging world model as a virtual environment for vla post-training. arXiv preprint arXiv:2509.24948 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Xie, R., Liu, Y., Zhou, P., Zhao, C., Zhou, J., Zhang, K., Zhang, Z., Yang, J., Yang, Z., Tai, Y.: Star: Spatial-temporal augmentation with text-to-video models for real-world video super-resolution. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17108–17118 (2025)
2025
-
[45]
RISE: Self-Improving Robot Policy with Compositional World Model
Yang, J., Lin, K., Li, J., Zhang, W., Lin, T., Wu, L., Su, Z., Zhao, H., Zhang, Y.Q., Chen, L., et al.: Rise: Self-improving robot policy with compositional world model. arXiv preprint arXiv:2602.11075 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[46]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., et al.: Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[47]
World Action Models are Zero-shot Policies
Ye, S., Ge, Y., Zheng, K., Gao, S., Yu, S., Kurian, G., Indupuru, S., Tan, Y.L., Zhu, C., Xiang, J., et al.: World action models are zero-shot policies. arXiv preprint arXiv:2602.15922 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[48]
3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
Ze, Y., Zhang, G., Zhang, K., Hu, C., Wang, M., Xu, H.: 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations. arXiv preprint arXiv:2403.03954 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
RoDyn: Taming Interactive Robot-Dynamic 2.5D World Model for Robotic Manipulation
Zhang, C., Wu, Z., Lu, G., Tang, Y., Wang, Z.: imowm: Taming interactive multi-modal world model for robotic manipulation. arXiv preprint arXiv:2510.09036 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
arXiv preprint arXiv:2511.02097 (2025)
Zhang, P.F., Cheng, Y., Sun, X., Wang, S., Li, F., Zhu, L., Shen, H.T.: A step toward world models: A survey on robotic manipulation. arXiv preprint arXiv:2511.02097 (2025)
-
[51]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 586–595 (2018)
2018
-
[52]
arXiv preprint arXiv:2502.05179 (2025)
Zhang, S., Li, W., Chen, S., Ge, C., Sun, P., Zhang, Y., Jiang, Y., Yuan, Z., Peng, B., Luo, P.: Flashvideo: Flowing fidelity to detail for efficient high-resolution video generation. arXiv preprint arXiv:2502.05179 (2025)
-
[53]
arXiv preprint arXiv:2506.09990 (2025)
Zhang, W., Hu, T., Zhang, H., Qiao, Y., Qin, Y., Li, Y., Liu, J., Kong, T., Liu, L., Ma, X.: Chain-of-action: Trajectory autoregressive modeling for robotic manipulation. arXiv preprint arXiv:2506.09990 (2025)
-
[54]
IEEE Transactions on Circuits and Systems for Video Technology (2024)
Zhang, Y., Yang, Q., Shan, Z., Xu, Y.: Asynchronous feedback network for perceptual point cloud quality assessment. IEEE Transactions on Circuits and Systems for Video Technology (2024)
2024
-
[55]
arXiv preprint arXiv:2512.23541 (2025)
Zhou, P., Chen, L., Chen, S., Chen, D., Zhao, W., Jin, R., Ren, G., Luo, J.: Act2goal: From world model to general goal-conditioned policy. arXiv preprint arXiv:2512.23541 (2025)
-
[56]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Zhou, S., Yang, P., Wang, J., Luo, Y., Loy, C.C.: Upscale-a-video: Temporal-consistent diffusion model for real-world video super-resolution. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 2535–2545 (2024) 17
2024
-
[57]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Zhu, F., Wu, H., Guo, S., Liu, Y., Cheang, C., Kong, T.: Irasim: A fine-grained world model for robot manipulation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9834–9844 (2025)
2025
-
[58]
Zhu, F., Yan, Z., Hong, Z., Shou, Q., Ma, X., Guo, S.: Wmpo: World model-based policy optimization for vision-language-action models. arXiv preprint arXiv:2511.09515 (2025) 18 Appendix A Details of Latent Degradation The refinement stage aims to reconstruct a high-resolution latent videozhr∈Rc×t×hhr×whr from the low-resolution dynamics produced by the pre...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.