DVG-WM: Disentangled Video Generation Enables Efficient Embodied World Model for Robotic Manipulation

Xiaofeng Wang; Zheng Zhu; Zhenyu Wu; Ziwei Wang; Ziyu Shan

arxiv: 2606.32028 · v1 · pith:5DZFKRMBnew · submitted 2026-06-30 · 💻 cs.RO

DVG-WM: Disentangled Video Generation Enables Efficient Embodied World Model for Robotic Manipulation

Ziyu Shan , Zhenyu Wu , Xiaofeng Wang , Zheng Zhu , Ziwei Wang This is my paper

Pith reviewed 2026-07-01 04:52 UTC · model grok-4.3

classification 💻 cs.RO

keywords disentangled video generationembodied world modelrobotic manipulationflow matchingvideo predictionLIBERO benchmarklatent degradation

0 comments

The pith

Disentangling dynamics learning from visual synthesis in video generation produces faster and higher-quality embodied world models for robotic manipulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that current video-based world models suffer from an entanglement between low-level dynamics modeling and high-level visual synthesis, which either slows down inference or loses contact-rich details needed for manipulation. By explicitly separating these into a two-stage process—first generating intermediate visual states from an initial observation and language instruction, then refining them via flow matching and latent degradation—the approach aims to deliver both speed and fidelity. A sympathetic reader would care because this could make iterative planning feasible in real robotic systems without sacrificing the accuracy required for physical interactions on benchmarks like LIBERO and real platforms.

Core claim

DVG-WM decomposes world modeling into dynamics learning and visual synthesis; conditioned on an initial observation and language instruction, it first generates a sequence of intermediate visual states to preview interactions and then refines them to high-fidelity video using an efficient cascading mechanism with flow matching to map dynamics to latents and latent degradation to recover details, yielding improved video quality and up to 3.97 times acceleration.

What carries the argument

The DVG-WM framework that decomposes world modeling into dynamics learning followed by visual synthesis, with a cascading mechanism that applies flow matching and latent degradation.

If this is right

Inference becomes fast enough for iterative planning loops in robotic manipulation.
Contact-rich details are retained better than in entangled video generation approaches.
The model achieves measurable gains on both the LIBERO benchmark and real-world robot platforms.
The separation allows direct mapping from dynamics to video latents without full low-level temporal reasoning at every frame.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The two-stage structure could support modular upgrades where dynamics or synthesis components are improved independently.
Faster inference might enable deployment on resource-constrained robots that previously could not run full world models in real time.
The disentanglement might extend to other prediction-heavy embodied tasks such as navigation if the same separation holds.

Load-bearing premise

Generating an initial sequence of intermediate visual states from the starting observation and language instruction produces a plausible enough preview of physical interactions that later refinement recovers contact-rich details without critical errors.

What would settle it

A manipulation task where the final refined video predicts contact points or object trajectories that diverge measurably from real execution outcomes after the refinement stage.

read the original abstract

Video-based embodied world models provide an appealing substrate for robotic manipulation by predicting future states, yet current approaches remain limited by a fundamental entanglement: accurately modeling dynamics typically requires low-level temporal reasoning, while producing high-resolution frames demands expansive visual synthesis according to high-level semantics. This entanglement results in slow inference speed for iterative planning or too coarse predictions to retain contact-rich details. To solve this dilemma, we present Disentangled Video Generation World Model (DVG-WM), an efficient framework that explicitly decomposes world modeling into dynamics learning and visual synthesis. Conditioned on an initial observation and a language instruction, our model first generates a plausible sequence of intermediate visual states to preview the physical interaction and refines them to obtain high-fidelity videos. Furthermore, an efficient cascading mechanism is proposed, where DVG-WM uses flow matching to directly map the dynamics to video latents, and introduces a latent degradation mechanism to regenerate contact-rich details. Experiments on LIBERO and real-world platforms demonstrate improved video quality with up to 3.97 times acceleration, validating that disentangled video generation can be an efficient embodied world model for robotic manipulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The disentangled dynamics-then-refinement setup is the real contribution here, but it rests on an untested assumption that the coarse preview gets contacts and trajectories right enough for the visual stage to polish without fixing errors.

read the letter

The paper's core move is to split world modeling into a first stage that generates a sequence of intermediate visual states from the starting image and language instruction, then a second stage that refines those states into high-fidelity video using flow matching to map dynamics directly to latents plus a latent degradation step for contact details. This is positioned as solving the speed-versus-quality tradeoff that has limited prior video-based embodied models for planning.

What stands out as new is the explicit two-stage decomposition combined with the cascading flow-matching mechanism. The authors show this yields up to 3.97 times faster inference while reporting better video quality on LIBERO tasks and real-robot hardware. That architectural separation is a clean way to allocate compute, and the choice of flow matching for the dynamics-to-latent step is a reasonable efficiency play.

The soft spot is exactly the one flagged in the stress-test note. If the initial dynamics preview does not already encode accurate object trajectories and contacts, the refinement stage can only change appearance; it cannot correct the underlying physics. The abstract claims positive results on standard benchmarks but gives no ablations that isolate the refinement stage's effect on downstream planning success, nor any direct comparison of the intermediate states against simulator ground truth. Without those checks, the efficiency gains could come from reduced computation at the price of hidden errors in manipulation-relevant details.

This work is aimed at robotics researchers building world models for iterative planning who need faster inference without losing too much fidelity. A reader working on similar video prediction pipelines would find the framework worth examining for the decomposition idea alone.

I would send it to peer review. The idea targets a genuine practical bottleneck and the experiments use relevant benchmarks, so referees can pressure-test the missing ablations and metrics.

Referee Report

1 major / 1 minor

Summary. The paper proposes DVG-WM, a framework that disentangles embodied world modeling into a dynamics stage (generating an intermediate sequence of visual states conditioned on initial observation and language instruction) and a visual synthesis stage (refining via flow matching and latent degradation for high-fidelity output). It claims this resolves the entanglement between low-level dynamics and high-resolution synthesis, yielding improved video quality and up to 3.97x acceleration on LIBERO and real-world robotic manipulation platforms.

Significance. If the results hold, the disentanglement and cascading flow-matching mechanism could meaningfully advance efficient video-based world models for robotics by enabling faster planning without sacrificing contact-rich fidelity. The explicit separation of concerns is a clear technical strength.

major comments (1)

[Experiments] Experiments section: No ablation or direct evaluation is reported comparing intermediate dynamics-preview states against simulator ground truth for object trajectories and contacts, nor measuring downstream planning success rates on LIBERO tasks with versus without the refinement stage. This is load-bearing for the central claim, as refinement can only alter appearance and cannot correct errors in the initial preview.

minor comments (1)

[Abstract] Abstract: The 3.97x acceleration claim would be strengthened by naming the exact baseline model, hardware, and quantitative video-quality metrics (e.g., FVD or LPIPS) used for comparison.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the experimental validation. We address the major comment below and will strengthen the manuscript accordingly.

read point-by-point responses

Referee: [Experiments] Experiments section: No ablation or direct evaluation is reported comparing intermediate dynamics-preview states against simulator ground truth for object trajectories and contacts, nor measuring downstream planning success rates on LIBERO tasks with versus without the refinement stage. This is load-bearing for the central claim, as refinement can only alter appearance and cannot correct errors in the initial preview.

Authors: We agree that direct ablations comparing the intermediate dynamics-preview states to simulator ground truth (object trajectories and contacts) and measuring downstream planning success rates on LIBERO with versus without the refinement stage would provide stronger support for the central disentanglement claim. The current manuscript reports final video quality metrics and inference speed on LIBERO and real-world platforms, which demonstrate overall benefits but do not isolate the dynamics stage accuracy. In the revised version we will add these evaluations, including quantitative comparisons of preview states to ground truth and planning success rates with/without refinement. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation is architectural proposal plus external experiments

full rationale

The paper introduces DVG-WM as an explicit decomposition of world modeling into dynamics and visual synthesis stages, conditioned on observation and language. Claims of efficiency and quality rest on the proposed cascading mechanism (flow matching + latent degradation) and reported metrics from LIBERO/real-world experiments. No equations, fitted parameters renamed as predictions, or self-citation chains are present in the provided text that would reduce any result to its inputs by construction. The central premise is an architectural choice whose validity is tested externally rather than defined into existence.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are identifiable from the provided text.

pith-pipeline@v0.9.1-grok · 5743 in / 1104 out tokens · 33816 ms · 2026-07-01T04:52:12.565285+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

58 extracted references · 39 canonical work pages · 22 internal anchors

[1]

Cosmos World Foundation Model Platform for Physical AI

Agarwal, N., Ali, A., Bala, M., Balaji, Y., Barker, E., Cai, T., Chattopadhyay, P., Chen, Y., Cui, Y., Ding, Y., et al.: Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Motus: A Unified Latent Action World Model

Bi, H., Tan, H., Xie, S., Wang, Z., Huang, S., Liu, H., Zhao, R., Feng, Y., Xiang, C., Rong, Y., et al.: Motus: A unified latent action world model. arXiv preprint arXiv:2512.13030 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

OpenAI Blog1(8), 1 (2024)

Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., et al.: Video generation models as world simulators. OpenAI Blog1(8), 1 (2024)

2024
[5]

RynnVLA-002: A Unified Vision-Language-Action and World Model

Cen, J., Huang, S., Yuan, Y., Li, K., Yuan, H., Yu, C., Jiang, Y., Guo, J., Li, X., Luo, H., et al.: Rynnvla-002: A unified vision-language-action and world model. arXiv preprint arXiv:2511.17502 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

WorldVLA: Towards Autoregressive Action World Model

Cen, J., Yu, C., Yuan, H., Jiang, Y., Huang, S., Guo, J., Li, X., Song, Y., Luo, H., Wang, F., et al.: Worldvla: Towards autoregressive action world model. arXiv preprint arXiv:2506.21539 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Advances in Neural Information Processing Systems37, 24081–24125 (2024)

Chen, B., Martí Monsó, D., Du, Y., Simchowitz, M., Tedrake, R., Sitzmann, V.: Diffusion forcing: Next-token prediction meets full-sequence diffusion. Advances in Neural Information Processing Systems37, 24081–24125 (2024)

2024
[8]

Large Video Planner Enables Generalizable Robot Control

Chen, B., Zhang, T., Geng, H., Song, K., Zhang, C., Li, P., Freeman, W.T., Malik, J., Abbeel, P., Tedrake, R., et al.: Large video planner enables generalizable robot control. arXiv preprint arXiv:2512.15840 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

arXiv preprint arXiv:2602.03793 (2026)

Chen, Y., Li, P., Yang, J., He, K., Wu, X., Xu, Y., Wang, K., Liu, J., Liu, N., Huang, Y., et al.: Bridgev2w: Bridging video generation models to embodied world models via embodiment masks. arXiv preprint arXiv:2602.03793 (2026)

work page arXiv 2026
[10]

The International Journal of Robotics Research44(10-11), 1684–1704 (2025)

Chi, C., Xu, Z., Feng, S., Cousineau, E., Du, Y., Burchfiel, B., Tedrake, R., Song, S.: Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research44(10-11), 1684–1704 (2025)

2025
[11]

Wow: Towards a world omni- scient world model through embodied interaction,

Chi, X., Jia, P., Fan, C.K., Ju, X., Mi, W., Zhang, K., Qin, Z., Tian, W., Ge, K., Li, H., et al.: Wow: Towards a world omniscient world model through embodied interaction. arXiv preprint arXiv:2509.22642 (2025)

work page arXiv 2025
[12]

Authorea Preprints (2025)

Deng, H., Wu, Z., Liu, H., Guo, W., Xue, Y., Shan, Z., Zhang, C., Jia, B., Ling, Y., Lu, G., et al.: A survey on reinforcement learning of vision-language-action models for robotic manipulation. Authorea Preprints (2025)

2025
[13]

Ctrl-World: A Controllable Generative World Model for Robot Manipulation

Guo, Y., Shi, L.X., Chen, J., Finn, C.: Ctrl-world: A controllable generative world model for robot manipulation. arXiv preprint arXiv:2510.10125 (2025) 14

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

arXiv preprint arXiv:2407.07667 (2024)

He, J., Xue, T., Liu, D., Lin, X., Gao, P., Lin, D., Qiao, Y., Ouyang, W., Liu, Z.: Venhancer: Generative space-time enhancement for video generation. arXiv preprint arXiv:2407.07667 (2024)

work page arXiv 2024
[15]

Imagen Video: High Definition Video Generation with Diffusion Models

Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D.P., Poole, B., Norouzi, M., Fleet, D.J., et al.: Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[16]

Advances in neural information processing systems33, 6840–6851 (2020)

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020)

2020
[17]

Iclr1(2), 3 (2022)

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. Iclr1(2), 3 (2022)

2022
[18]

Enerverse: Envisioning embodied future space for robotics manipulation,

Huang, S., Chen, L., Zhou, P., Chen, S., Jiang, Z., Hu, Y., Liao, Y., Gao, P., Li, H., Yao, M., et al.: Enerverse: Envisioning embodied future space for robotics manipulation. arXiv preprint arXiv:2501.01895 (2025)

work page arXiv 2025
[19]

Pointworld: Scaling 3d world models for in-the-wild robotic manipulation.arXiv preprint arXiv:2601.03782, 2026

Huang, W., Chao, Y.W., Mousavian, A., Liu, M.Y., Fox, D., Mo, K., Fei-Fei, L.: Point- world: Scaling 3d world models for in-the-wild robotic manipulation. arXiv preprint arXiv:2601.03782 (2026)

work page arXiv 2026
[20]

NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks

Hung, C.Y., Sun, Q., Hong, P., Zadeh, A., Li, C., Tan, U., Majumder, N., Poria, S., et al.: Nora: A small open-sourced generalist vision language action model for embodied tasks. arXiv preprint arXiv:2504.19854 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

arXiv preprint arXiv:2509.19080 (2025)

Jiang, Z., Liu, K., Qin, Y., Tian, S., Zheng, Y., Zhou, M., Yu, C., Li, H., Zhao, D.: World4rl: Diffusion world models for policy refinement with reinforcement learning for robotic manipulation. arXiv preprint arXiv:2509.19080 (2025)

work page arXiv 2025
[22]

Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

Kim, M.J., Gao, Y., Lin, T.Y., Lin, Y.C., Ge, Y., Lam, G., Liang, P., Song, S., Liu, M.Y., Finn, C., et al.: Cosmos policy: Fine-tuning video models for visuomotor control and planning. arXiv preprint arXiv:2601.16163 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[23]

Causal World Modeling for Robot Control

Li, L., Zhang, Q., Luo, Y., Yang, S., Wang, R., Han, F., Yu, M., Gao, Z., Xue, N., Zhu, X., et al.: Causal world modeling for robot control. arXiv preprint arXiv:2601.21998 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[24]

arXiv preprint arXiv:2509.21027 (2025)

Li, S., Hao, Q., Shang, Y., Li, Y.: Keyworld: Key frame reasoning enables effective and efficient world models. arXiv preprint arXiv:2509.21027 (2025)

work page arXiv 2025
[25]

Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation

Liao, Y., Zhou, P., Huang, S., Yang, D., Chen, S., Jiang, Y., Hu, Y., Cai, J., Liu, S., Luo, J., et al.: Genie envisioner: A unified world foundation platform for robotic manipulation. arXiv preprint arXiv:2508.05635 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Flow Matching for Generative Modeling

Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[27]

Advances in Neural Information Processing Systems36, 44776–44791 (2023)

Liu, B., Zhu, Y., Gao, C., Feng, Y., Liu, Q., Zhu, Y., Stone, P.: Libero: Benchmarking knowledge transfer for lifelong robot learning. Advances in Neural Information Processing Systems36, 44776–44791 (2023)

2023
[28]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003 (2022) 15

work page internal anchor Pith review Pith/arXiv arXiv 2022
[29]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Lu, G., Jia, B., Li, P., Chen, Y., Wang, Z., Tang, Y., Huang, S.: Gwm: Towards scal- able gaussian world models for robotic manipulation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9263–9274 (2025)

2025
[30]

Qian, Z., Chi, X., Li, Y., Wang, S., Qin, Z., Ju, X., Han, S., Zhang, S.: Wristworld: Generat- ingwrist-viewsvia4dworldmodelsforroboticmanipulation.arXivpreprintarXiv:2510.07313 (2025)

work page arXiv 2025
[31]

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

Ren, T., Liu, S., Zeng, A., Lin, J., Li, K., Cao, H., Chen, J., Huang, X., Chen, Y., Yan, F., et al.: Grounded sam: Assembling open-world models for diverse visual tasks. arXiv preprint arXiv:2401.14159 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

arXiv preprint arXiv:2402.09470 (2024)

Ruhe, D., Heek, J., Salimans, T., Hoogeboom, E.: Rolling diffusion models. arXiv preprint arXiv:2402.09470 (2024)

work page arXiv 2024
[33]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Shan, Z., Zhang, Y., Yang, Q., Yang, H., Xu, Y., Hwang, J.N., Xu, X., Liu, S.: Contrastive pre-training with multi-view fusion for no-reference point cloud quality assessment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 25942–25951 (2024)

2024
[34]

IEEE Robotics and Automation Letters (2026)

Shan, Z., Zhou, Y., Wu, G., Ji, Z., Wu, Z., Wang, Z.: Dockanywhere: Data-efficient visuomotor policy learning for mobile manipulation via novel demonstration generation. IEEE Robotics and Automation Letters (2026)

2026
[35]

arXiv preprint arXiv:2509.21790 (2025)

Shang, Y., Jin, L., Ma, Y., Zhang, X., Gao, C., Wu, W., Li, Y.: Longscape: Advancing long- horizon embodied world models with context-aware moe. arXiv preprint arXiv:2509.21790 (2025)

work page arXiv 2025
[36]

arXiv preprint arXiv:2512.06963 (2025)

Shen, Y., Wei, F., Du, Z., Liang, Y., Lu, Y., Yang, J., Zheng, N., Guo, B.: Videovla: Video generators can be generalizable robot manipulators. arXiv preprint arXiv:2512.06963 (2025)

work page arXiv 2025
[37]

Neurocomputing568, 127063 (2024)

Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., Liu, Y.: Roformer: Enhanced transformer with rotary position embedding. Neurocomputing568, 127063 (2024)

2024
[38]

arXiv preprint arXiv:2512.10675 (2025)

Team, G.R., Choromanski, K., Devin, C., Du, Y., Dwibedi, D., Gao, R., Jindal, A., Kipf, T., Kirmani, S., Leal, I., et al.: Evaluating gemini robotics policies in a veo world simulator. arXiv preprint arXiv:2512.10675 (2025)

work page arXiv 2025
[39]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

International Journal of Computer Vision133(5), 3059–3078 (2025)

Wang, Y., Chen, X., Ma, X., Zhou, S., Huang, Z., Wang, Y., Yang, C., He, Y., Yu, J., Yang, P., et al.: Lavie: High-quality video generation with cascaded latent diffusion models. International Journal of Computer Vision133(5), 3059–3078 (2025)

2025
[41]

IEEE transactions on image processing13(4), 600–612 (2004)

Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing13(4), 600–612 (2004)

2004
[42]

Advances in Neural Information Processing Systems 37, 68082–68119 (2024) 16

Wu, J., Yin, S., Feng, N., He, X., Li, D., Hao, J., Long, M.: ivideogpt: Interactive videogpts are scalable world models. Advances in Neural Information Processing Systems 37, 68082–68119 (2024) 16

2024
[43]

World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training

Xiao, J., Yang, Y., Chang, X., Chen, R., Xiong, F., Xu, M., Zheng, W.S., Zhang, Q.: World-env: Leveraging world model as a virtual environment for vla post-training. arXiv preprint arXiv:2509.24948 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Xie, R., Liu, Y., Zhou, P., Zhao, C., Zhou, J., Zhang, K., Zhang, Z., Yang, J., Yang, Z., Tai, Y.: Star: Spatial-temporal augmentation with text-to-video models for real-world video super-resolution. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17108–17118 (2025)

2025
[45]

RISE: Self-Improving Robot Policy with Compositional World Model

Yang, J., Lin, K., Li, J., Zhang, W., Lin, T., Wu, L., Su, Z., Zhao, H., Zhang, Y.Q., Chen, L., et al.: Rise: Self-improving robot policy with compositional world model. arXiv preprint arXiv:2602.11075 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[46]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., et al.: Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

World Action Models are Zero-shot Policies

Ye, S., Ge, Y., Zheng, K., Gao, S., Yu, S., Kurian, G., Indupuru, S., Tan, Y.L., Zhu, C., Xiang, J., et al.: World action models are zero-shot policies. arXiv preprint arXiv:2602.15922 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[48]

3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

Ze, Y., Zhang, G., Zhang, K., Hu, C., Wang, M., Xu, H.: 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations. arXiv preprint arXiv:2403.03954 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

RoDyn: Taming Interactive Robot-Dynamic 2.5D World Model for Robotic Manipulation

Zhang, C., Wu, Z., Lu, G., Tang, Y., Wang, Z.: imowm: Taming interactive multi-modal world model for robotic manipulation. arXiv preprint arXiv:2510.09036 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

arXiv preprint arXiv:2511.02097 (2025)

Zhang, P.F., Cheng, Y., Sun, X., Wang, S., Li, F., Zhu, L., Shen, H.T.: A step toward world models: A survey on robotic manipulation. arXiv preprint arXiv:2511.02097 (2025)

work page arXiv 2025
[51]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 586–595 (2018)

2018
[52]

arXiv preprint arXiv:2502.05179 (2025)

Zhang, S., Li, W., Chen, S., Ge, C., Sun, P., Zhang, Y., Jiang, Y., Yuan, Z., Peng, B., Luo, P.: Flashvideo: Flowing fidelity to detail for efficient high-resolution video generation. arXiv preprint arXiv:2502.05179 (2025)

work page arXiv 2025
[53]

arXiv preprint arXiv:2506.09990 (2025)

Zhang, W., Hu, T., Zhang, H., Qiao, Y., Qin, Y., Li, Y., Liu, J., Kong, T., Liu, L., Ma, X.: Chain-of-action: Trajectory autoregressive modeling for robotic manipulation. arXiv preprint arXiv:2506.09990 (2025)

work page arXiv 2025
[54]

IEEE Transactions on Circuits and Systems for Video Technology (2024)

Zhang, Y., Yang, Q., Shan, Z., Xu, Y.: Asynchronous feedback network for perceptual point cloud quality assessment. IEEE Transactions on Circuits and Systems for Video Technology (2024)

2024
[55]

arXiv preprint arXiv:2512.23541 (2025)

Zhou, P., Chen, L., Chen, S., Chen, D., Zhao, W., Jin, R., Ren, G., Luo, J.: Act2goal: From world model to general goal-conditioned policy. arXiv preprint arXiv:2512.23541 (2025)

work page arXiv 2025
[56]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Zhou, S., Yang, P., Wang, J., Luo, Y., Loy, C.C.: Upscale-a-video: Temporal-consistent diffusion model for real-world video super-resolution. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 2535–2545 (2024) 17

2024
[57]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Zhu, F., Wu, H., Guo, S., Liu, Y., Cheang, C., Kong, T.: Irasim: A fine-grained world model for robot manipulation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9834–9844 (2025)

2025
[58]

Zhu, F., Yan, Z., Hong, Z., Shou, Q., Ma, X., Guo, S.: Wmpo: World model-based policy optimization for vision-language-action models. arXiv preprint arXiv:2511.09515 (2025) 18 Appendix A Details of Latent Degradation The refinement stage aims to reconstruct a high-resolution latent videozhr∈Rc×t×hhr×whr from the low-resolution dynamics produced by the pre...

work page arXiv 2025

[1] [1]

Cosmos World Foundation Model Platform for Physical AI

Agarwal, N., Ali, A., Bala, M., Balaji, Y., Barker, E., Cai, T., Chattopadhyay, P., Chen, Y., Cui, Y., Ding, Y., et al.: Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Motus: A Unified Latent Action World Model

Bi, H., Tan, H., Xie, S., Wang, Z., Huang, S., Liu, H., Zhao, R., Feng, Y., Xiang, C., Rong, Y., et al.: Motus: A unified latent action world model. arXiv preprint arXiv:2512.13030 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

OpenAI Blog1(8), 1 (2024)

Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., et al.: Video generation models as world simulators. OpenAI Blog1(8), 1 (2024)

2024

[5] [5]

RynnVLA-002: A Unified Vision-Language-Action and World Model

Cen, J., Huang, S., Yuan, Y., Li, K., Yuan, H., Yu, C., Jiang, Y., Guo, J., Li, X., Luo, H., et al.: Rynnvla-002: A unified vision-language-action and world model. arXiv preprint arXiv:2511.17502 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

WorldVLA: Towards Autoregressive Action World Model

Cen, J., Yu, C., Yuan, H., Jiang, Y., Huang, S., Guo, J., Li, X., Song, Y., Luo, H., Wang, F., et al.: Worldvla: Towards autoregressive action world model. arXiv preprint arXiv:2506.21539 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Advances in Neural Information Processing Systems37, 24081–24125 (2024)

Chen, B., Martí Monsó, D., Du, Y., Simchowitz, M., Tedrake, R., Sitzmann, V.: Diffusion forcing: Next-token prediction meets full-sequence diffusion. Advances in Neural Information Processing Systems37, 24081–24125 (2024)

2024

[8] [8]

Large Video Planner Enables Generalizable Robot Control

Chen, B., Zhang, T., Geng, H., Song, K., Zhang, C., Li, P., Freeman, W.T., Malik, J., Abbeel, P., Tedrake, R., et al.: Large video planner enables generalizable robot control. arXiv preprint arXiv:2512.15840 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

arXiv preprint arXiv:2602.03793 (2026)

Chen, Y., Li, P., Yang, J., He, K., Wu, X., Xu, Y., Wang, K., Liu, J., Liu, N., Huang, Y., et al.: Bridgev2w: Bridging video generation models to embodied world models via embodiment masks. arXiv preprint arXiv:2602.03793 (2026)

work page arXiv 2026

[10] [10]

The International Journal of Robotics Research44(10-11), 1684–1704 (2025)

Chi, C., Xu, Z., Feng, S., Cousineau, E., Du, Y., Burchfiel, B., Tedrake, R., Song, S.: Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research44(10-11), 1684–1704 (2025)

2025

[11] [11]

Wow: Towards a world omni- scient world model through embodied interaction,

Chi, X., Jia, P., Fan, C.K., Ju, X., Mi, W., Zhang, K., Qin, Z., Tian, W., Ge, K., Li, H., et al.: Wow: Towards a world omniscient world model through embodied interaction. arXiv preprint arXiv:2509.22642 (2025)

work page arXiv 2025

[12] [12]

Authorea Preprints (2025)

Deng, H., Wu, Z., Liu, H., Guo, W., Xue, Y., Shan, Z., Zhang, C., Jia, B., Ling, Y., Lu, G., et al.: A survey on reinforcement learning of vision-language-action models for robotic manipulation. Authorea Preprints (2025)

2025

[13] [13]

Ctrl-World: A Controllable Generative World Model for Robot Manipulation

Guo, Y., Shi, L.X., Chen, J., Finn, C.: Ctrl-world: A controllable generative world model for robot manipulation. arXiv preprint arXiv:2510.10125 (2025) 14

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

arXiv preprint arXiv:2407.07667 (2024)

He, J., Xue, T., Liu, D., Lin, X., Gao, P., Lin, D., Qiao, Y., Ouyang, W., Liu, Z.: Venhancer: Generative space-time enhancement for video generation. arXiv preprint arXiv:2407.07667 (2024)

work page arXiv 2024

[15] [15]

Imagen Video: High Definition Video Generation with Diffusion Models

Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D.P., Poole, B., Norouzi, M., Fleet, D.J., et al.: Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[16] [16]

Advances in neural information processing systems33, 6840–6851 (2020)

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020)

2020

[17] [17]

Iclr1(2), 3 (2022)

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. Iclr1(2), 3 (2022)

2022

[18] [18]

Enerverse: Envisioning embodied future space for robotics manipulation,

Huang, S., Chen, L., Zhou, P., Chen, S., Jiang, Z., Hu, Y., Liao, Y., Gao, P., Li, H., Yao, M., et al.: Enerverse: Envisioning embodied future space for robotics manipulation. arXiv preprint arXiv:2501.01895 (2025)

work page arXiv 2025

[19] [19]

Pointworld: Scaling 3d world models for in-the-wild robotic manipulation.arXiv preprint arXiv:2601.03782, 2026

Huang, W., Chao, Y.W., Mousavian, A., Liu, M.Y., Fox, D., Mo, K., Fei-Fei, L.: Point- world: Scaling 3d world models for in-the-wild robotic manipulation. arXiv preprint arXiv:2601.03782 (2026)

work page arXiv 2026

[20] [20]

NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks

Hung, C.Y., Sun, Q., Hong, P., Zadeh, A., Li, C., Tan, U., Majumder, N., Poria, S., et al.: Nora: A small open-sourced generalist vision language action model for embodied tasks. arXiv preprint arXiv:2504.19854 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

arXiv preprint arXiv:2509.19080 (2025)

Jiang, Z., Liu, K., Qin, Y., Tian, S., Zheng, Y., Zhou, M., Yu, C., Li, H., Zhao, D.: World4rl: Diffusion world models for policy refinement with reinforcement learning for robotic manipulation. arXiv preprint arXiv:2509.19080 (2025)

work page arXiv 2025

[22] [22]

Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

Kim, M.J., Gao, Y., Lin, T.Y., Lin, Y.C., Ge, Y., Lam, G., Liang, P., Song, S., Liu, M.Y., Finn, C., et al.: Cosmos policy: Fine-tuning video models for visuomotor control and planning. arXiv preprint arXiv:2601.16163 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[23] [23]

Causal World Modeling for Robot Control

Li, L., Zhang, Q., Luo, Y., Yang, S., Wang, R., Han, F., Yu, M., Gao, Z., Xue, N., Zhu, X., et al.: Causal world modeling for robot control. arXiv preprint arXiv:2601.21998 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[24] [24]

arXiv preprint arXiv:2509.21027 (2025)

Li, S., Hao, Q., Shang, Y., Li, Y.: Keyworld: Key frame reasoning enables effective and efficient world models. arXiv preprint arXiv:2509.21027 (2025)

work page arXiv 2025

[25] [25]

Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation

Liao, Y., Zhou, P., Huang, S., Yang, D., Chen, S., Jiang, Y., Hu, Y., Cai, J., Liu, S., Luo, J., et al.: Genie envisioner: A unified world foundation platform for robotic manipulation. arXiv preprint arXiv:2508.05635 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

Flow Matching for Generative Modeling

Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[27] [27]

Advances in Neural Information Processing Systems36, 44776–44791 (2023)

Liu, B., Zhu, Y., Gao, C., Feng, Y., Liu, Q., Zhu, Y., Stone, P.: Libero: Benchmarking knowledge transfer for lifelong robot learning. Advances in Neural Information Processing Systems36, 44776–44791 (2023)

2023

[28] [28]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003 (2022) 15

work page internal anchor Pith review Pith/arXiv arXiv 2022

[29] [29]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Lu, G., Jia, B., Li, P., Chen, Y., Wang, Z., Tang, Y., Huang, S.: Gwm: Towards scal- able gaussian world models for robotic manipulation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9263–9274 (2025)

2025

[30] [30]

Qian, Z., Chi, X., Li, Y., Wang, S., Qin, Z., Ju, X., Han, S., Zhang, S.: Wristworld: Generat- ingwrist-viewsvia4dworldmodelsforroboticmanipulation.arXivpreprintarXiv:2510.07313 (2025)

work page arXiv 2025

[31] [31]

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

Ren, T., Liu, S., Zeng, A., Lin, J., Li, K., Cao, H., Chen, J., Huang, X., Chen, Y., Yan, F., et al.: Grounded sam: Assembling open-world models for diverse visual tasks. arXiv preprint arXiv:2401.14159 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

arXiv preprint arXiv:2402.09470 (2024)

Ruhe, D., Heek, J., Salimans, T., Hoogeboom, E.: Rolling diffusion models. arXiv preprint arXiv:2402.09470 (2024)

work page arXiv 2024

[33] [33]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Shan, Z., Zhang, Y., Yang, Q., Yang, H., Xu, Y., Hwang, J.N., Xu, X., Liu, S.: Contrastive pre-training with multi-view fusion for no-reference point cloud quality assessment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 25942–25951 (2024)

2024

[34] [34]

IEEE Robotics and Automation Letters (2026)

Shan, Z., Zhou, Y., Wu, G., Ji, Z., Wu, Z., Wang, Z.: Dockanywhere: Data-efficient visuomotor policy learning for mobile manipulation via novel demonstration generation. IEEE Robotics and Automation Letters (2026)

2026

[35] [35]

arXiv preprint arXiv:2509.21790 (2025)

Shang, Y., Jin, L., Ma, Y., Zhang, X., Gao, C., Wu, W., Li, Y.: Longscape: Advancing long- horizon embodied world models with context-aware moe. arXiv preprint arXiv:2509.21790 (2025)

work page arXiv 2025

[36] [36]

arXiv preprint arXiv:2512.06963 (2025)

Shen, Y., Wei, F., Du, Z., Liang, Y., Lu, Y., Yang, J., Zheng, N., Guo, B.: Videovla: Video generators can be generalizable robot manipulators. arXiv preprint arXiv:2512.06963 (2025)

work page arXiv 2025

[37] [37]

Neurocomputing568, 127063 (2024)

Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., Liu, Y.: Roformer: Enhanced transformer with rotary position embedding. Neurocomputing568, 127063 (2024)

2024

[38] [38]

arXiv preprint arXiv:2512.10675 (2025)

Team, G.R., Choromanski, K., Devin, C., Du, Y., Dwibedi, D., Gao, R., Jindal, A., Kipf, T., Kirmani, S., Leal, I., et al.: Evaluating gemini robotics policies in a veo world simulator. arXiv preprint arXiv:2512.10675 (2025)

work page arXiv 2025

[39] [39]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

International Journal of Computer Vision133(5), 3059–3078 (2025)

Wang, Y., Chen, X., Ma, X., Zhou, S., Huang, Z., Wang, Y., Yang, C., He, Y., Yu, J., Yang, P., et al.: Lavie: High-quality video generation with cascaded latent diffusion models. International Journal of Computer Vision133(5), 3059–3078 (2025)

2025

[41] [41]

IEEE transactions on image processing13(4), 600–612 (2004)

Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing13(4), 600–612 (2004)

2004

[42] [42]

Advances in Neural Information Processing Systems 37, 68082–68119 (2024) 16

Wu, J., Yin, S., Feng, N., He, X., Li, D., Hao, J., Long, M.: ivideogpt: Interactive videogpts are scalable world models. Advances in Neural Information Processing Systems 37, 68082–68119 (2024) 16

2024

[43] [43]

World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training

Xiao, J., Yang, Y., Chang, X., Chen, R., Xiong, F., Xu, M., Zheng, W.S., Zhang, Q.: World-env: Leveraging world model as a virtual environment for vla post-training. arXiv preprint arXiv:2509.24948 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[44] [44]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Xie, R., Liu, Y., Zhou, P., Zhao, C., Zhou, J., Zhang, K., Zhang, Z., Yang, J., Yang, Z., Tai, Y.: Star: Spatial-temporal augmentation with text-to-video models for real-world video super-resolution. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17108–17118 (2025)

2025

[45] [45]

RISE: Self-Improving Robot Policy with Compositional World Model

Yang, J., Lin, K., Li, J., Zhang, W., Lin, T., Wu, L., Su, Z., Zhao, H., Zhang, Y.Q., Chen, L., et al.: Rise: Self-improving robot policy with compositional world model. arXiv preprint arXiv:2602.11075 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[46] [46]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., et al.: Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[47] [47]

World Action Models are Zero-shot Policies

Ye, S., Ge, Y., Zheng, K., Gao, S., Yu, S., Kurian, G., Indupuru, S., Tan, Y.L., Zhu, C., Xiang, J., et al.: World action models are zero-shot policies. arXiv preprint arXiv:2602.15922 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[48] [48]

3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

Ze, Y., Zhang, G., Zhang, K., Hu, C., Wang, M., Xu, H.: 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations. arXiv preprint arXiv:2403.03954 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[49] [49]

RoDyn: Taming Interactive Robot-Dynamic 2.5D World Model for Robotic Manipulation

Zhang, C., Wu, Z., Lu, G., Tang, Y., Wang, Z.: imowm: Taming interactive multi-modal world model for robotic manipulation. arXiv preprint arXiv:2510.09036 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[50] [50]

arXiv preprint arXiv:2511.02097 (2025)

Zhang, P.F., Cheng, Y., Sun, X., Wang, S., Li, F., Zhu, L., Shen, H.T.: A step toward world models: A survey on robotic manipulation. arXiv preprint arXiv:2511.02097 (2025)

work page arXiv 2025

[51] [51]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 586–595 (2018)

2018

[52] [52]

arXiv preprint arXiv:2502.05179 (2025)

Zhang, S., Li, W., Chen, S., Ge, C., Sun, P., Zhang, Y., Jiang, Y., Yuan, Z., Peng, B., Luo, P.: Flashvideo: Flowing fidelity to detail for efficient high-resolution video generation. arXiv preprint arXiv:2502.05179 (2025)

work page arXiv 2025

[53] [53]

arXiv preprint arXiv:2506.09990 (2025)

Zhang, W., Hu, T., Zhang, H., Qiao, Y., Qin, Y., Li, Y., Liu, J., Kong, T., Liu, L., Ma, X.: Chain-of-action: Trajectory autoregressive modeling for robotic manipulation. arXiv preprint arXiv:2506.09990 (2025)

work page arXiv 2025

[54] [54]

IEEE Transactions on Circuits and Systems for Video Technology (2024)

Zhang, Y., Yang, Q., Shan, Z., Xu, Y.: Asynchronous feedback network for perceptual point cloud quality assessment. IEEE Transactions on Circuits and Systems for Video Technology (2024)

2024

[55] [55]

arXiv preprint arXiv:2512.23541 (2025)

Zhou, P., Chen, L., Chen, S., Chen, D., Zhao, W., Jin, R., Ren, G., Luo, J.: Act2goal: From world model to general goal-conditioned policy. arXiv preprint arXiv:2512.23541 (2025)

work page arXiv 2025

[56] [56]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Zhou, S., Yang, P., Wang, J., Luo, Y., Loy, C.C.: Upscale-a-video: Temporal-consistent diffusion model for real-world video super-resolution. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 2535–2545 (2024) 17

2024

[57] [57]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Zhu, F., Wu, H., Guo, S., Liu, Y., Cheang, C., Kong, T.: Irasim: A fine-grained world model for robot manipulation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9834–9844 (2025)

2025

[58] [58]

Zhu, F., Yan, Z., Hong, Z., Shou, Q., Ma, X., Guo, S.: Wmpo: World model-based policy optimization for vision-language-action models. arXiv preprint arXiv:2511.09515 (2025) 18 Appendix A Details of Latent Degradation The refinement stage aims to reconstruct a high-resolution latent videozhr∈Rc×t×hhr×whr from the low-resolution dynamics produced by the pre...

work page arXiv 2025