Beyond the Next Step: Variable-Length Latent World Models for Long-Horizon Planning

Qi Zhang; Tianqi Du; Yifei Wang; Yisen Wang

arxiv: 2606.21775 · v1 · pith:V33JQVPXnew · submitted 2026-06-19 · 💻 cs.LG · cs.AI

Beyond the Next Step: Variable-Length Latent World Models for Long-Horizon Planning

Tianqi Du , Qi Zhang , Yifei Wang , Yisen Wang This is my paper

Pith reviewed 2026-06-26 14:20 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords latent world modelsvariable-length predictionlong-horizon planningcurriculum trainingaction-conditioned dynamicsJEPA-style modelsreinforcement learning

0 comments

The pith

VLWMs train latent predictors on variable-length action sequences to avoid recursive rollout errors in long-horizon planning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current latent world models predict only the next state and must be unrolled repeatedly for planning, which accumulates errors and misaligns training with actual use. VLWMs instead learn a single predictor that maps action sequences of any chosen length directly to the resulting latent state. Training uses a curriculum that starts with short sequences and lengthens them over time to reach stable long-range forecasts. At inference, planning routines query the model at multiple horizons to score candidate action sequences. Experiments report a 13 percent average gain over prior one-step latent models, with the largest improvements on tasks that require extended foresight.

Core claim

The central claim is that conditioning latent dynamics directly on variable-length action sequences, rather than on single actions, lets the same model serve planning at different horizons and removes the need for recursive one-step rollout, with curriculum training enabling stable optimization from short to long sequences.

What carries the argument

A latent predictor that accepts action sequences of arbitrary length and outputs the future state after that sequence, trained via progressive horizon expansion.

If this is right

One network can now evaluate both short and long action plans without separate predictors or repeated rollouts.
Curriculum lengthening of action horizons stabilizes learning of extended dynamics.
Planning methods can exploit queries at multiple horizons to improve decision quality.
Gains appear largest on control tasks that demand extended planning horizons.
Average performance rises 13 percent over the prior LeWM baseline across tested datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The variable-length conditioning idea could extend to other sequence prediction domains such as video or language modeling.
Search algorithms might learn to select the most useful horizon length on the fly during planning.
Model-based reinforcement learning policies could be trained to output variable-length action chunks rather than single steps.
Real-world robot experiments would test whether curriculum-trained VLWMs transfer when dynamics contain unmodeled noise.

Load-bearing premise

Training directly on variable-length sequences yields stable long-range predictions without introducing new optimization failures or distribution mismatches that one-step models avoid.

What would settle it

A head-to-head test in which a one-step model with near-zero short-term error is unrolled to the same long horizon as a VLWM and the two are compared on final state accuracy.

Figures

Figures reproduced from arXiv: 2606.21775 by Qi Zhang, Tianqi Du, Yifei Wang, Yisen Wang.

**Figure 1.** Figure 1: Overview of Variable-length Latent World Models (VLWM). Compared to a one-step world model (LeWM, top), VLWM (bottom) trains a single predictor to forecast the latent at an arbitrary offset t→t+k by feeding a variable-length segment of action tokens, and at planning time replaces the long autoregressive rollout with a few chunked variable-length latent jumps, mitigating compounding error. Despite their suc… view at source ↗

**Figure 2.** Figure 2: Goal-conditioned planning success rate ( [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Visualization of rollouts on TwoRoom. Given the same input frame at T=0, we compare VLWM’s chunked variable-length rollout (middle) against LeWM’s recursive one-step rollout (bottom) at T ∈ {25, 50, 75, 100}, with the ground-truth trajectory shown on top. VLWM stays close to the real trajectory and successfully crosses the door into the second room, while LeWM accumulates error and remains stuck in the st… view at source ↗

read the original abstract

Recently, world models have emerged as a promising paradigm for building intelligent agents by learning predictive models that estimate future environment states conditioned on observations and actions. In particular, JEPA-style latent world models provide an efficient alternative to pixel space prediction by learning action-conditioned dynamics in compact representation spaces. However, existing latent world models typically rely on one-step prediction and must be recursively rolled out for long-horizon planning, which leads to compounding errors and a mismatch between training objectives and downstream planning tasks. To address this limitation, we propose Variable-length Latent World Models (VLWMs), a framework that learns to predict future latent states conditioned on action sequences of variable lengths. Instead of training only on one-step transitions, VLWMs directly model temporally extended dynamics, allowing the same predictor to evaluate action plans over different horizons. We further introduce a curriculum training strategy that progressively expands the action horizon, stabilizing optimization from short-range dynamics to long-range prediction. At test time, we design planning methods tailored to VLWMs to better exploit their variable-length predictive capabilities. Experiments on long-horizon control tasks show that VLWMs significantly improve latent space world models, achieving 13\% average improvement over the state-of-the-art LeWM across different datasets, with especially large gains on tasks requiring extended planning. These results suggest that VLWM provides a simple yet effective paradigm for improving long-horizon prediction and planning in latent world models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Variable-length training with curriculum is a reasonable idea but the 13% gain is likely inflated by changes in test-time planning rather than the training shift itself.

read the letter

The main takeaway is that this paper trains latent world models to handle action sequences of varying lengths instead of always doing one-step prediction, with a curriculum that ramps up the horizon. That setup tries to close the gap between training and long-horizon use.

What stands out as new is the direct modeling of multi-step dynamics in latent space plus the curriculum schedule. It builds on the JEPA-style baselines they cite and gives a concrete way to avoid recursive rollout during training.

The paper does a clear job naming the compounding error issue and why one-step models create a train-test mismatch. The curriculum is a simple, practical addition that could help optimization.

The soft spot is the results. The abstract notes they use planning methods tailored to VLWMs, which suggests the baseline LeWM was probably run with its original one-step planner while VLWM got a planner that directly uses variable lengths. Without an ablation that holds the planner fixed, the 13% average improvement cannot be cleanly credited to the training change. There are also no error bars, seed counts, or dataset details, so the number is hard to evaluate.

This is for people working on model-based RL and long-horizon planning in latent spaces. A reader who wants to test whether variable-length prediction helps would find the core idea useful, but they would need to reimplement and control the planner themselves.

It deserves peer review because the problem is real and the proposed fix is straightforward, even though the current evidence needs tighter controls to be convincing.

Referee Report

1 major / 2 minor

Summary. The paper proposes Variable-length Latent World Models (VLWMs) that directly predict future latent states conditioned on action sequences of variable lengths rather than one-step transitions. It introduces a curriculum that progressively increases the action horizon during training and designs planning methods at test time that exploit the variable-length predictions. Experiments on long-horizon control tasks report a 13% average improvement over the state-of-the-art LeWM, with larger gains on tasks requiring extended planning.

Significance. If the reported gains can be attributed to the variable-length training objective rather than differences in test-time planning, the approach would offer a direct way to reduce the train-test mismatch that arises when one-step latent predictors are rolled out recursively for long horizons. The curriculum strategy and the ability to evaluate plans at multiple horizons in a single forward pass are potentially useful for latent world models in general.

major comments (1)

[Abstract] Abstract: the central empirical claim is a 13% average improvement over LeWM 'with planning methods tailored to VLWMs to better exploit their variable-length predictive capabilities.' No ablation is described that holds the planning procedure fixed while varying only the training regime (one-step vs. variable-length). Because the abstract explicitly introduces planner changes at test time, the performance delta cannot yet be attributed to the training procedure itself; this directly affects the soundness of the main claim.

minor comments (2)

The abstract and any experimental section should report error bars, number of seeds, dataset splits, and exact baseline implementations (including whether LeWM was re-evaluated under the new planner).
Notation for the variable-length predictor and the curriculum schedule should be defined with explicit equations showing how the loss is computed for sequences of different lengths.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need to clarify the source of the reported performance gains. We address the major comment below and commit to revisions that strengthen the attribution of results.

read point-by-point responses

Referee: [Abstract] Abstract: the central empirical claim is a 13% average improvement over LeWM 'with planning methods tailored to VLWMs to better exploit their variable-length predictive capabilities.' No ablation is described that holds the planning procedure fixed while varying only the training regime (one-step vs. variable-length). Because the abstract explicitly introduces planner changes at test time, the performance delta cannot yet be attributed to the training procedure itself; this directly affects the soundness of the main claim.

Authors: We agree that the current presentation does not fully isolate the contribution of the variable-length training objective from the test-time planning adaptations. The core technical contribution is the variable-length prediction model trained with curriculum, which directly addresses the train-test mismatch of recursive one-step rollouts. The tailored planning methods (e.g., multi-horizon evaluation in a single forward pass) are enabled by this model and form an integral part of the proposed framework. Nevertheless, to strengthen the claim, we will add an ablation in the revised manuscript that applies an identical fixed planning procedure (standard model-predictive control with fixed horizon) to both the baseline one-step LeWM and the VLWM. We will also revise the abstract to state that the 13% gain is achieved by the full VLWM approach (variable-length training plus its associated planning methods) and report the new ablation results to quantify the isolated effect of the training regime. This revision directly addresses the soundness concern. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical method with external baseline comparison

full rationale

The paper proposes VLWMs as a new training framework (variable-length prediction + curriculum) evaluated via experiments against the external LeWM baseline. No derivation chain, equations, or first-principles results are presented that reduce to inputs by construction. Claims rest on reported performance deltas rather than tautological redefinitions or self-citation load-bearing arguments. The experimental design (including planner tailoring) may raise validity questions but does not constitute circularity under the specified patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are described beyond the high-level framework.

pith-pipeline@v0.9.1-grok · 5789 in / 1030 out tokens · 22232 ms · 2026-06-26T14:20:32.997426+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 10 linked inside Pith

[1]

V-jepa 2: Self- supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muck- ley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self- supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

Pith/arXiv arXiv 2025
[2]

Revisiting feature prediction for learning visual representa- tions from video.arXiv preprint arXiv:2404.08471, 2024

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mido Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representa- tions from video.arXiv preprint arXiv:2404.08471, 2024

Pith/arXiv arXiv 2024
[3]

Genie: Gener- ative interactive environments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Gener- ative interactive environments. InForty-first International Conference on Machine Learning, 2024

2024
[4]

Understanding world or predicting future? a comprehensive survey of world models.ACM Computing Surveys, 58(3):1–38, 2025

Jingtao Ding, Yunke Zhang, Yu Shang, Yuheng Zhang, Zefang Zong, Jie Feng, Yuan Yuan, Hongyuan Su, Nian Li, Nicholas Sukiennik, et al. Understanding world or predicting future? a comprehensive survey of world models.ACM Computing Surveys, 58(3):1–38, 2025

2025
[5]

World models.arXiv preprint arXiv:1803.10122, 2(3): 440, 2018

David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3): 440, 2018

Pith/arXiv arXiv 2018
[6]

Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019

Pith/arXiv arXiv 1912
[7]

TD-MPC2: Scalable, robust world models for continuous control

Nick Hansen, Hao Su, and Xiaolong Wang. TD-MPC2: Scalable, robust world models for continuous control. InICLR, 2024

2024
[8]

Self forcing: Bridging the train-test gap in autoregressive video diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion. InNeurIPS, 2026

2026
[9]

LeWorldModel: Stable end-to-end joint-embedding predictive architecture from pixels.arXiv preprint arXiv:2603.19312, 2026

Lucas Maes, Quentin Le Lidec, Damien Scieur, Yann LeCun, and Randall Balestriero. LeWorldModel: Stable end-to-end joint-embedding predictive architecture from pixels.arXiv preprint arXiv:2603.19312, 2026

Pith/arXiv arXiv 2026
[10]

Causal-jepa: Learning world models through object-level latent interventions.arXiv preprint arXiv:2602.11389, 2026

Heejeong Nam, Quentin Le Lidec, Lucas Maes, Yann LeCun, and Randall Balestriero. Causal-jepa: Learning world models through object-level latent interventions.arXiv preprint arXiv:2602.11389, 2026

Pith/arXiv arXiv 2026
[11]

Va-red 2: Video adaptive redundancy reduction

Bowen Pan, Rameswar Panda, Camilo Fosco, Chung-Ching Lin, Alex Andonian, Yue Meng, Kate Saenko, Aude Oliva, and Rogerio Feris. Va-red 2: Video adaptive redundancy reduction. arXiv preprint arXiv:2102.07887, 2021

arXiv 2021
[12]

Ogbench: Benchmark- ing offline goal-conditioned rl

Seohong Park, Kevin Frans, Benjamin Eysenbach, and Sergey Levine. Ogbench: Benchmark- ing offline goal-conditioned rl. InICLR, 2025

2025
[13]

Springer Science & Business Media, 2004

Reuven Y Rubinstein and Dirk P Kroese.The cross-entropy method: a unified approach to combinatorial optimization, Monte-Carlo simulation and machine learning. Springer Science & Business Media, 2004

2004
[14]

Vlad Sobal, Wancong Zhang, Kyunghyun Cho, Randall Balestriero, Tim G. J. Rudner, and Yann LeCun. Stress-testing offline reward-free reinforcement learning: A case for planning with latent dynamics models. In7th Robot Learning Workshop: Towards Robots with Human- Level Abilities, 2025

2025
[15]

Hunyuanworld 1.0: Generat- ing immersive, explorable, and interactive 3d worlds from words or pixels.arXiv preprint arXiv:2507.21809, 2025

HunyuanWorld Team, Zhenwei Wang, Yuhao Liu, Junta Wu, Zixiao Gu, Haoyuan Wang, Xuhui Zuo, Tianyu Huang, Wenhuan Li, Sheng Zhang, et al. Hunyuanworld 1.0: Generat- ing immersive, explorable, and interactive 3d worlds from words or pixels.arXiv preprint arXiv:2507.21809, 2025

arXiv 2025
[16]

Longcat-video technical report

Meituan LongCat Team, Xunliang Cai, Qilong Huang, Zhuoliang Kang, Hongyu Li, Shijun Liang, Liya Ma, Siyu Ren, Xiaoming Wei, Rixu Xie, et al. Longcat-video technical report. arXiv preprint arXiv:2510.22200, 2025. 10

arXiv 2025
[17]

Learning to generate long-term future via hierarchical prediction

Ruben Villegas, Jimei Yang, Yuliang Zou, Sungryull Sohn, Xunyu Lin, and Honglak Lee. Learning to generate long-term future via hierarchical prediction. InICML, 2017

2017
[18]

Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Pith/arXiv arXiv 2025
[19]

ThinkJEPA: Empowering latent world models with large vision-language reasoning model.arXiv preprint arXiv:2603.22281, 2026

Haichao Zhang, Yijiang Li, Shwai He, Tushar Nagarajan, Mingfei Chen, Jianglin Lu, Ang Li, and Yun Fu. ThinkJEPA: Empowering latent world models with large vision-language reasoning model.arXiv preprint arXiv:2603.22281, 2026

Pith/arXiv arXiv 2026
[20]

Hierarchical planning with latent world models.arXiv preprint arXiv:2604.03208, 2026

Wancong Zhang, Basile Terver, Artem Zholus, Soham Chitnis, Harsh Sutaria, Mido Assran, Randall Balestriero, Amir Bar, Adrien Bardes, Yann LeCun, et al. Hierarchical planning with latent world models.arXiv preprint arXiv:2604.03208, 2026

Pith/arXiv arXiv 2026
[21]

Dino-wm: World models on pre-trained visual features enable zero-shot planning.arXiv preprint arXiv:2411.04983, 2024

Gaoyue Zhou, Hengkai Pan, Yann LeCun, and Lerrel Pinto. Dino-wm: World models on pre-trained visual features enable zero-shot planning.arXiv preprint arXiv:2411.04983, 2024. 11 A Environment and Dataset Details We adopt the same three environments and datasets as LeWM [9], so that the planning protocol, data, and evaluation budgets are directly comparable...

Pith/arXiv arXiv 2024

[1] [1]

V-jepa 2: Self- supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muck- ley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self- supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

Pith/arXiv arXiv 2025

[2] [2]

Revisiting feature prediction for learning visual representa- tions from video.arXiv preprint arXiv:2404.08471, 2024

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mido Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representa- tions from video.arXiv preprint arXiv:2404.08471, 2024

Pith/arXiv arXiv 2024

[3] [3]

Genie: Gener- ative interactive environments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Gener- ative interactive environments. InForty-first International Conference on Machine Learning, 2024

2024

[4] [4]

Understanding world or predicting future? a comprehensive survey of world models.ACM Computing Surveys, 58(3):1–38, 2025

Jingtao Ding, Yunke Zhang, Yu Shang, Yuheng Zhang, Zefang Zong, Jie Feng, Yuan Yuan, Hongyuan Su, Nian Li, Nicholas Sukiennik, et al. Understanding world or predicting future? a comprehensive survey of world models.ACM Computing Surveys, 58(3):1–38, 2025

2025

[5] [5]

World models.arXiv preprint arXiv:1803.10122, 2(3): 440, 2018

David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3): 440, 2018

Pith/arXiv arXiv 2018

[6] [6]

Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019

Pith/arXiv arXiv 1912

[7] [7]

TD-MPC2: Scalable, robust world models for continuous control

Nick Hansen, Hao Su, and Xiaolong Wang. TD-MPC2: Scalable, robust world models for continuous control. InICLR, 2024

2024

[8] [8]

Self forcing: Bridging the train-test gap in autoregressive video diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion. InNeurIPS, 2026

2026

[9] [9]

LeWorldModel: Stable end-to-end joint-embedding predictive architecture from pixels.arXiv preprint arXiv:2603.19312, 2026

Lucas Maes, Quentin Le Lidec, Damien Scieur, Yann LeCun, and Randall Balestriero. LeWorldModel: Stable end-to-end joint-embedding predictive architecture from pixels.arXiv preprint arXiv:2603.19312, 2026

Pith/arXiv arXiv 2026

[10] [10]

Causal-jepa: Learning world models through object-level latent interventions.arXiv preprint arXiv:2602.11389, 2026

Heejeong Nam, Quentin Le Lidec, Lucas Maes, Yann LeCun, and Randall Balestriero. Causal-jepa: Learning world models through object-level latent interventions.arXiv preprint arXiv:2602.11389, 2026

Pith/arXiv arXiv 2026

[11] [11]

Va-red 2: Video adaptive redundancy reduction

Bowen Pan, Rameswar Panda, Camilo Fosco, Chung-Ching Lin, Alex Andonian, Yue Meng, Kate Saenko, Aude Oliva, and Rogerio Feris. Va-red 2: Video adaptive redundancy reduction. arXiv preprint arXiv:2102.07887, 2021

arXiv 2021

[12] [12]

Ogbench: Benchmark- ing offline goal-conditioned rl

Seohong Park, Kevin Frans, Benjamin Eysenbach, and Sergey Levine. Ogbench: Benchmark- ing offline goal-conditioned rl. InICLR, 2025

2025

[13] [13]

Springer Science & Business Media, 2004

Reuven Y Rubinstein and Dirk P Kroese.The cross-entropy method: a unified approach to combinatorial optimization, Monte-Carlo simulation and machine learning. Springer Science & Business Media, 2004

2004

[14] [14]

Vlad Sobal, Wancong Zhang, Kyunghyun Cho, Randall Balestriero, Tim G. J. Rudner, and Yann LeCun. Stress-testing offline reward-free reinforcement learning: A case for planning with latent dynamics models. In7th Robot Learning Workshop: Towards Robots with Human- Level Abilities, 2025

2025

[15] [15]

Hunyuanworld 1.0: Generat- ing immersive, explorable, and interactive 3d worlds from words or pixels.arXiv preprint arXiv:2507.21809, 2025

HunyuanWorld Team, Zhenwei Wang, Yuhao Liu, Junta Wu, Zixiao Gu, Haoyuan Wang, Xuhui Zuo, Tianyu Huang, Wenhuan Li, Sheng Zhang, et al. Hunyuanworld 1.0: Generat- ing immersive, explorable, and interactive 3d worlds from words or pixels.arXiv preprint arXiv:2507.21809, 2025

arXiv 2025

[16] [16]

Longcat-video technical report

Meituan LongCat Team, Xunliang Cai, Qilong Huang, Zhuoliang Kang, Hongyu Li, Shijun Liang, Liya Ma, Siyu Ren, Xiaoming Wei, Rixu Xie, et al. Longcat-video technical report. arXiv preprint arXiv:2510.22200, 2025. 10

arXiv 2025

[17] [17]

Learning to generate long-term future via hierarchical prediction

Ruben Villegas, Jimei Yang, Yuliang Zou, Sungryull Sohn, Xunyu Lin, and Honglak Lee. Learning to generate long-term future via hierarchical prediction. InICML, 2017

2017

[18] [18]

Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Pith/arXiv arXiv 2025

[19] [19]

ThinkJEPA: Empowering latent world models with large vision-language reasoning model.arXiv preprint arXiv:2603.22281, 2026

Haichao Zhang, Yijiang Li, Shwai He, Tushar Nagarajan, Mingfei Chen, Jianglin Lu, Ang Li, and Yun Fu. ThinkJEPA: Empowering latent world models with large vision-language reasoning model.arXiv preprint arXiv:2603.22281, 2026

Pith/arXiv arXiv 2026

[20] [20]

Hierarchical planning with latent world models.arXiv preprint arXiv:2604.03208, 2026

Wancong Zhang, Basile Terver, Artem Zholus, Soham Chitnis, Harsh Sutaria, Mido Assran, Randall Balestriero, Amir Bar, Adrien Bardes, Yann LeCun, et al. Hierarchical planning with latent world models.arXiv preprint arXiv:2604.03208, 2026

Pith/arXiv arXiv 2026

[21] [21]

Dino-wm: World models on pre-trained visual features enable zero-shot planning.arXiv preprint arXiv:2411.04983, 2024

Gaoyue Zhou, Hengkai Pan, Yann LeCun, and Lerrel Pinto. Dino-wm: World models on pre-trained visual features enable zero-shot planning.arXiv preprint arXiv:2411.04983, 2024. 11 A Environment and Dataset Details We adopt the same three environments and datasets as LeWM [9], so that the planning protocol, data, and evaluation budgets are directly comparable...

Pith/arXiv arXiv 2024