AR Forcing: Towards Long-Horizon Robot Navigation World Model

Aoqi Wang; Bingchuan Sun; Haibao Yu; Haiyan Liu; Huan Li; Jason Bao; Liang Xu; Lida Huang; Xuanyao Mao; Yan Wang

arxiv: 2605.31314 · v1 · pith:BYTDEJXGnew · submitted 2026-05-29 · 💻 cs.RO

AR Forcing: Towards Long-Horizon Robot Navigation World Model

Yifei Yang , Zehua Fan , Huan Li , Aoqi Wang , Lida Huang , Haibao Yu , Haiyan Liu , Xuanyao Mao

show 4 more authors

Jason Bao Liang Xu Bingchuan Sun Yan Wang

This is my paper

Pith reviewed 2026-06-28 21:54 UTC · model grok-4.3

classification 💻 cs.RO

keywords AR Forcingdiffusion world modelsrobot navigationautoregressive trainingdistribution shiftlong-horizon predictiontrajectory accuracy

0 comments

The pith

AR Forcing trains diffusion robot navigation models by feeding their own predictions back into the training loop to close the gap with autoregressive inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Diffusion-based world models for robot navigation are trained with parallel supervision but must generate sequences autoregressively at test time. This mismatch creates a distribution shift that degrades image quality and trajectory accuracy over long horizons. AR Forcing addresses the shift by running the standard diffusion noise-prediction loss inside an autoregressive loop, so the model repeatedly conditions on its own prior outputs during training. The approach keeps the original diffusion architecture and sampler unchanged and adds no extra networks or losses. Experiments across four navigation datasets show gains in long-horizon image consistency and path accuracy in both familiar and unseen environments.

Core claim

AR Forcing integrates the standard diffusion loss into an autoregressive training loop in which, at each step, the model conditions on its own previously generated frames to predict the noise for the current frame. This procedure exposes the model to the same state distribution it will encounter at inference time, reducing the train-inference discrepancy that otherwise accumulates over long robot trajectories.

What carries the argument

AR Forcing: an autoregressive training loop that inserts the model's own predictions into the conditioning context before each single-step diffusion noise prediction, thereby matching the inference distribution without altering the underlying diffusion model or sampler.

If this is right

Long-horizon generated images remain more consistent because the model has been optimized under its own prediction distribution.
Predicted trajectories achieve higher accuracy on the RECON, SCAND, HuRoN, and TartanDrive datasets relative to parallel-supervision baselines.
The model exhibits greater robustness when navigating both known and previously unseen environments without requiring new loss terms or auxiliary networks.
The original diffusion training objective and sampling procedure are retained, allowing direct integration into existing diffusion world-model pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same self-prediction loop could be applied to other sequential diffusion tasks such as video prediction where train-test mismatch also arises.
Because no extra components are introduced, AR Forcing can be combined with future improvements to the base diffusion architecture without redesign.
If single-step accuracy remains unchanged while long-horizon performance rises, the method demonstrates that distribution alignment alone can suffice for stability gains.

Load-bearing premise

The distribution shift caused by training with parallel supervision but testing autoregressively is the primary reason long-horizon predictions become unstable.

What would settle it

Train identical diffusion navigation models with and without AR Forcing on the same multi-domain dataset, then compare image consistency metrics and trajectory error after 20 or more autoregressive steps; no improvement or a drop in single-step accuracy would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.31314 by Aoqi Wang, Bingchuan Sun, Haibao Yu, Haiyan Liu, Huan Li, Jason Bao, Liang Xu, Lida Huang, Xuanyao Mao, Yan Wang, Yifei Yang, Zehua Fan.

**Figure 2.** Figure 2: Architecture of AR Forcing. In AR Forcing (ours), the model predicts the next frame, feeds its own prediction back into the context, and applies the same diffusion loss at each step. This aligns training with inference rollouts without extra losses or architectural changes, improving long-horizon stability and planning robustness. NWM trains diffusion world models with parallel supervision: given a fixed… view at source ↗

**Figure 3.** Figure 3: Qualitative comparison of predicted trajectories. [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: Visual Qualitative Comparisons on Unknown Environment Go Stan [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: Long-horizon Prediction Stability on Unknown Environment Go [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 1.** Figure 1: Qualitative 16-second rollouts on RECON. [PITH_FULL_IMAGE:figures/full_fig_p021_1.png] view at source ↗

**Figure 2.** Figure 2: Qualitative 16-second rollouts on TartanDrive. [PITH_FULL_IMAGE:figures/full_fig_p022_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative 16-second rollouts on SCAND. [PITH_FULL_IMAGE:figures/full_fig_p023_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative 16-second rollouts on HuRoN. [PITH_FULL_IMAGE:figures/full_fig_p024_4.png] view at source ↗

**Figure 5.** Figure 5: Zero-shot qualitative results on the unknown environment Go Stan [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗

read the original abstract

The diffusion based robot navigation world models are typically trained using parallel supervision, while autoregressive inference is employed during path planning. This results in a distribution shift between training and inference, which destabilizes the performance over long-horizon prediction. We propose AR Forcing, an autoregressive training strategy, which integrates the standard diffusion loss into the autoregressive training loop. At each step, the model uses its own predictions to update the context and optimize the single step noise prediction objective, thereby explicitly exposing the model to the inference state distribution during training. Our method does not require additional discriminators or distribution-matching losses, retains the original diffusion framework and sampler, and is easy to integrate. Experiments on multi-domain navigation datasets (RECON, SCAND, HuRoN, TartanDrive) show that compared with strong baselines, AR Forcing improved the consistency of generated images during long-horizon navigation and the accuracy of predicted trajectories, enhancing robustness of the model in complex known and unknown environments. We will release the code soon.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AR Forcing is a simple training-loop change that feeds a diffusion model's own predictions back as context to reduce the train-inference gap in long-horizon navigation, with reported gains on four datasets but limited detail on the size of the effect.

read the letter

The paper's core move is to replace parallel supervision with an autoregressive training loop for diffusion-based robot navigation world models. At each step the model conditions on its own previous outputs when predicting the next noise, so the training distribution matches what happens at inference. No new loss terms or samplers are added, which keeps the change lightweight and easy to drop into existing code.

This is the part that is actually new: the explicit integration of self-prediction into the diffusion objective for this domain. The experiments run on RECON, SCAND, HuRoN, and TartanDrive and claim better long-horizon image consistency plus more accurate trajectories in both familiar and unseen settings. That matches the practical bottleneck the abstract identifies.

The main limitation is that the description stays high-level. Without the actual numbers, ablations, or implementation pseudocode it is difficult to judge how large the improvement is or whether single-step accuracy trades off. If the full paper shows clear margins over strong baselines and checks for compounding errors, the result holds; otherwise the claim rests on the abstract alone.

The work is aimed at robotics groups already using diffusion world models for navigation. Anyone trying to make autoregressive rollouts more stable will find the idea worth testing. It is worth sending to peer review so the experiments can be examined in detail.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes AR Forcing, an autoregressive training strategy for diffusion-based robot navigation world models. It integrates the standard diffusion loss into the autoregressive training loop by using the model's own predictions to update the context at each step, thereby exposing the model to the inference distribution. The approach is presented as a simple modification that requires no additional discriminators or losses and retains the original sampler. Experiments on the RECON, SCAND, HuRoN, and TartanDrive datasets are claimed to show improved long-horizon image consistency and trajectory accuracy relative to strong baselines in both known and unknown environments.

Significance. If the reported gains are reproducible and the method generalizes, AR Forcing would offer a lightweight way to mitigate exposure bias in diffusion world models for long-horizon robot navigation tasks, without altering the core diffusion framework or introducing new training objectives.

major comments (2)

[Abstract] The provided manuscript text consists solely of the abstract, which states the method and claims empirical gains but supplies no equations, quantitative metrics, ablation results, or implementation details; the central claim cannot be verified from the given text alone.
[Abstract] The core premise that distribution shift is the dominant source of long-horizon instability and that feeding self-predictions back during training will close this gap without introducing compensating instabilities is stated but not supported by any derivation or analysis in the visible text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their review. The comments appear to reference only the abstract; the full manuscript contains dedicated sections with equations, quantitative results, ablations, and analysis. We respond point-by-point below.

read point-by-point responses

Referee: [Abstract] The provided manuscript text consists solely of the abstract, which states the method and claims empirical gains but supplies no equations, quantitative metrics, ablation results, or implementation details; the central claim cannot be verified from the given text alone.

Authors: The full manuscript (Sections 3–5 and appendix) supplies the requested elements: the AR Forcing objective is formalized with equations integrating the diffusion noise-prediction loss into the autoregressive loop; Tables 1–3 report quantitative metrics (image consistency via PSNR/SSIM/LPIPS, trajectory accuracy via ADE/FDE) on RECON, SCAND, HuRoN, and TartanDrive; ablation studies isolate the effect of self-prediction feedback; and implementation details (training schedule, context length, sampler) are provided. The central empirical claims are therefore verifiable from the complete document. revision: no
Referee: [Abstract] The core premise that distribution shift is the dominant source of long-horizon instability and that feeding self-predictions back during training will close this gap without introducing compensating instabilities is stated but not supported by any derivation or analysis in the visible text.

Authors: Section 2 motivates the distribution shift between parallel training and autoregressive inference with reference to exposure bias literature. Section 3 derives the training procedure that explicitly conditions on model-generated context at each step. Section 4 supplies empirical analysis: long-horizon rollouts show consistent gains in image fidelity and trajectory accuracy over strong baselines without degradation in short-horizon performance or introduction of new failure modes, supporting that the feedback loop does not create compensating instabilities on the evaluated navigation datasets. revision: no

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents AR Forcing as a direct modification to the training loop of an existing diffusion model, where self-generated predictions are fed back as context while retaining the standard single-step noise prediction loss. No equations, fitted parameters, or derived quantities are shown that reduce the claimed long-horizon consistency gains to a definitional equivalence with the inputs. The central premise (distribution shift between parallel training and autoregressive inference) is addressed by an explicit procedural change whose effect is evaluated empirically on external datasets, without load-bearing self-citations or ansatzes that collapse the argument. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, new entities, or non-standard axioms are introduced beyond the standard diffusion noise-prediction objective and the domain assumption of autoregressive inference use.

axioms (2)

standard math Diffusion models are trained via single-step noise prediction loss
Invoked when the method retains the original diffusion loss.
domain assumption Path planning employs autoregressive inference on generated frames
Stated as the source of the train-inference mismatch.

pith-pipeline@v0.9.1-grok · 5741 in / 1173 out tokens · 27905 ms · 2026-06-28T21:54:23.645335+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 22 canonical work pages · 11 internal anchors

[1]

Advances in Neural Information Processing Systems37, 58757–58791 (2024)

Alonso,E.,Jelley,A.,Micheli,V.,Kanervisto,A.,Storkey,A.J.,Pearce,T.,Fleuret, F.: Diffusion for world modeling: Visual details matter in atari. Advances in Neural Information Processing Systems37, 58757–58791 (2024)

2024
[2]

In: Proceedings of the 31st International Conference on Computational Linguistics

Asada, M., Miwa, M.: Addressing the training-inference discrepancy in discrete diffusion for text generation. In: Proceedings of the 31st International Conference on Computational Linguistics. pp. 7156–7164 (2025)

2025
[3]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Bar, A., Zhou, G., Tran, D., Darrell, T., LeCun, Y.: Navigation world models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 15791–15801 (2025)

2025
[4]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S.W., Fidler, S., Kreis, K.: Align your latents: High-resolution video synthesis with latent diffusion mod- els. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 22563–22575 (2023)

2023
[6]

In: Forty-first International Conference on Machine Learning (2024)

Bruce, J., Dennis, M.D., Edwards, A., Parker-Holder, J., Shi, Y., Hughes, E., Lai, M., Mavalankar, A., Steigerwald, R., Apps, C., et al.: Genie: Generative interactive environments. In: Forty-first International Conference on Machine Learning (2024)

2024
[7]

Advances in Neural Information Processing Systems37, 24081–24125 (2024)

Chen, B., Martí Monsó, D., Du, Y., Simchowitz, M., Tedrake, R., Sitzmann, V.: Diffusion forcing: Next-token prediction meets full-sequence diffusion. Advances in Neural Information Processing Systems37, 24081–24125 (2024)

2024
[8]

arXiv preprint arXiv:2401.02644 (2024)

Chen, C., Deng, F., Kawaguchi, K., Gulcehre, C., Ahn, S.: Simple hierarchical planning with diffusion. arXiv preprint arXiv:2401.02644 (2024)

work page arXiv 2024
[9]

In: ACM SIGGRAPH 2024 Conference Papers

Chen, R., Shi, M., Huang, S., Tan, P., Komura, T., Chen, X.: Taming diffusion probabilistic models for character control. In: ACM SIGGRAPH 2024 Conference Papers. pp. 1–10 (2024)

2024
[10]

The International Journal of Robotics Research44(10-11), 1684–1704 (2025)

Chi,C.,Xu,Z.,Feng,S.,Cousineau,E.,Du,Y.,Burchfiel,B.,Tedrake,R.,Song,S.: Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research44(10-11), 1684–1704 (2025)

2025
[11]

Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

Cui, J., Wu, J., Li, M., Yang, T., Li, X., Wang, R., Bai, A., Ban, Y., Hsieh, C.J.: Self-forcing++: Towards minute-scale high-quality video generation. arXiv preprint arXiv:2510.02283 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data

Fu,S.,Tamir,N.,Sundaram,S.,Chai,L.,Zhang,R.,Dekel,T.,Isola,P.:Dreamsim: Learning new dimensions of human visual similarity using synthetic data. arXiv preprint arXiv:2306.09344 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Advances in Neural Information Processing Systems36, 41914–41931 (2023)

Geng, Z., Pokle, A., Kolter, J.Z.: One-step diffusion distillation via deep equilib- rium models. Advances in Neural Information Processing Systems36, 41914–41931 (2023)

2023
[14]

World Models

Ha, D., Schmidhuber, J.: World models. arXiv preprint arXiv:1803.10122 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[15]

Mastering Diverse Domains through World Models

Hafner, D., Pasukonis, J., Ba, J., Lillicrap, T.: Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Advances in neural information processing systems30(2017)

Heusel,M.,Ramsauer,H.,Unterthiner,T.,Nessler,B.,Hochreiter,S.:Ganstrained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30(2017)

2017
[17]

IEEE Robotics and Automation Letters9(1), 49–56 (2023) 16 Y

Hirose, N., Shah, D., Sridhar, A., Levine, S.: Sacson: Scalable autonomous control for social navigation. IEEE Robotics and Automation Letters9(1), 49–56 (2023) 16 Y. Yang Author et al

2023
[18]

GAIA-1: A Generative World Model for Autonomous Driving

Hu, A., Russell, L., Yeo, H., Murez, Z., Fedoseev, G., Kendall, A., Shotton, J., Corrado, G.: Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Huang, X., Li, Z., He, G., Zhou, M., Shechtman, E.: Self forcing: Bridging the train-test gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

IEEE Robotics and Automation Letters 7(4), 11807–11814 (2022)

Karnan, H., Nair, A., Xiao, X., Warnell, G., Pirk, S., Toshev, A., Hart, J., Biswas, J., Stone, P.: Socially compliant navigation dataset (scand): A large-scale dataset of demonstrations for social navigation. IEEE Robotics and Automation Letters 7(4), 11807–11814 (2022)

2022
[21]

arXiv preprint arXiv:2506.00867 (2025)

Lee, K., Choi, J.: Local manifold approximation and projection for manifold-aware diffusion planning. arXiv preprint arXiv:2506.00867 (2025)

work page arXiv 2025
[22]

Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

Liu, K., Hu, W., Xu, J., Shan, Y., Lu, S.: Rolling forcing: Autoregressive long video diffusion in real time. arXiv preprint arXiv:2509.25161 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

IEEE/ASME Transactions on Mechatronics (2025)

Liu, Y., Chen, W., Bai, Y., Liang, X., Li, G., Gao, W., Lin, L.: Aligning cyber space with physical world: A comprehensive survey on embodied ai. IEEE/ASME Transactions on Mechatronics (2025)

2025
[24]

Decoupled Weight Decay Regularization

Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[25]

Advances in Neural Information Processing Systems36, 76525–76546 (2023)

Luo, W., Hu, T., Zhang, S., Sun, J., Li, Z., Zhang, Z.: Diff-instruct: A universal approach for transferring knowledge from pre-trained diffusion models. Advances in Neural Information Processing Systems36, 76525–76546 (2023)

2023
[26]

Mazé, F., Ahmed, F.: Diffusion models beat gans on topology optimization. In: Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence. pp. 9108–9116 (2023)

2023
[27]

Transformers are sample-efficient world models.arXiv preprint arXiv:2209.00588, 2022

Micheli,V.,Alonso,E.,Fleuret,F.:Transformersaresample-efficientworldmodels. arXiv preprint arXiv:2209.00588 (2022)

work page arXiv 2022
[28]

In: International Con- ference on Machine Learning

Ni, F., Hao, J., Mu, Y., Yuan, Y., Zheng, Y., Wang, B., Liang, Z.: Metadiffuser: Diffusion model as conditional planner for offline meta-rl. In: International Con- ference on Machine Learning. pp. 26087–26105. PMLR (2023)

2023
[29]

In: International conference on machine learning

Nichol, A.Q., Dhariwal, P.: Improved denoising diffusion probabilistic models. In: International conference on machine learning. pp. 8162–8171. PMLR (2021)

2021
[30]

arXiv preprint arXiv:2308.15321 (2023)

Ning, M., Li, M., Su, J., Salah, A.A., Ertugrul, I.O.: Elucidating the exposure bias in diffusion models. arXiv preprint arXiv:2308.15321 (2023)

work page arXiv 2023
[31]

Peebles,W.,Xie,S.:Scalablediffusionmodelswithtransformers.In:Proceedingsof the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023)

2023
[32]

arXiv preprint arXiv:2402.09470 (2024)

Ruhe, D., Heek, J., Salimans, T., Hoogeboom, E.: Rolling diffusion models. arXiv preprint arXiv:2402.09470 (2024)

work page arXiv 2024
[33]

The diffusion duality.arXiv preprint arXiv:2506.10892,

Sahoo, S.S., Deschenaux, J., Gokaslan, A., Wang, G., Chiu, J., Kuleshov, V.: The diffusion duality. arXiv preprint arXiv:2506.10892 (2025)

work page arXiv 2025
[34]

In: European Conference on Computer Vision

Sauer, A., Lorenz, D., Blattmann, A., Rombach, R.: Adversarial diffusion distilla- tion. In: European Conference on Computer Vision. pp. 87–103. Springer (2024)

2024
[35]

arXiv preprint arXiv:2104.05859 (2021)

Shah,D.,Eysenbach,B.,Kahn,G.,Rhinehart,N.,Levine,S.:Rapidexplorationfor open-world navigation with latent goal models. arXiv preprint arXiv:2104.05859 (2021)

work page arXiv 2021
[36]

arXiv preprint arXiv:2210.03370 (2022)

Shah, D., Sridhar, A., Bhorkar, A., Hirose, N., Levine, S.: Gnm: A general naviga- tion model to drive any robot. arXiv preprint arXiv:2210.03370 (2022)

work page arXiv 2022
[37]

In: Interna- tional Conference on Machine Learning

Song, Y., Dhariwal, P., Chen, M., Sutskever, I.: Consistency models. In: Interna- tional Conference on Machine Learning. pp. 32211–32252. PMLR (2023) AR Forcing: Towards Long-Horizon Robot Navigation World Model 17

2023
[38]

In: 2024 IEEE International Conference on Robotics and Automation (ICRA)

Sridhar, A., Shah, D., Glossop, C., Levine, S.: Nomad: Goal masked diffusion poli- cies for navigation and exploration. In: 2024 IEEE International Conference on Robotics and Automation (ICRA). pp. 63–70. IEEE (2024)

2024
[39]

In: Proc

Sturm, J., Burgard, W., Cremers, D.: Evaluating egomotion and structure-from- motion approaches using the tum rgb-d benchmark. In: Proc. of the Workshop on Color-Depth Camera Fusion in Robotics at the IEEE/RJS International Confer- ence on Intelligent Robot Systems (IROS). vol. 13, p. 6 (2012)

2012
[40]

Sun, J., Jiang, Y., Qiu, J., Nobel, P., Kochenderfer, M.J., Schwager, M.: Conformal predictionforuncertainty-awareplanningwithdiffusiondynamicsmodel.Advances in Neural Information Processing Systems36, 80324–80337 (2023)

2023
[41]

Tian, K., Jiang, Y., Yuan, Z., Peng, B., Wang, L.: Visual autoregressive modeling: Scalableimagegenerationvianext-scaleprediction.Advancesinneuralinformation processing systems37, 84839–84865 (2024)

2024
[42]

In: 2022 International Conference on Robotics and Automation (ICRA)

Triest, S., Sivaprakasam, M., Wang, S.J., Wang, W., Johnson, A.M., Scherer, S.: Tartandrive: A large-scale dataset for learning off-road dynamics models. In: 2022 International Conference on Robotics and Automation (ICRA). pp. 2546–2552. IEEE (2022)

2022
[43]

Diffusion Models Are Real-Time Game Engines

Valevski, D., Leviathan, Y., Arar, M., Fruchter, S.: Diffusion models are real-time game engines. arXiv preprint arXiv:2408.14837 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

arXiv preprint arXiv:2309.06599 (2023)

Venkatraman, S., Khaitan, S., Akella, R.T., Dolan, J., Schneider, J., Berseth, G.: Reasoning with latent diffusion in offline reinforcement learning. arXiv preprint arXiv:2309.06599 (2023)

work page arXiv 2023
[45]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wang, Y., He, J., Fan, L., Li, H., Chen, Y., Zhang, Z.: Driving into the future: Mul- tiview visual forecasting and planning with world model for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14749–14759 (2024)

2024
[46]

In: Conference on robot learning

Wu, P., Escontrela, A., Hafner, D., Abbeel, P., Goldberg, K.: Daydreamer: World models for physical robot learning. In: Conference on robot learning. pp. 2226–
[47]

Learning Interactive Real-World Simulators

Yang, M., Du, Y., Ghasemipour, K., Tompson, J., Schuurmans, D., Abbeel, P.: Learning interactive real-world simulators. arXiv preprint arXiv:2310.061141(2), 6 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Yang,Z.,Chen,Y.,Wang,J.,Manivasagam,S.,Ma,W.C.,Yang,A.J.,Urtasun,R.: Unisim: A neural closed-loop sensor simulator. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1389–1399 (2023)

2023
[49]

Advances in neural information processing systems37, 47455–47487 (2024)

Yin, T., Gharbi, M., Park, T., Zhang, R., Shechtman, E., Durand, F., Freeman, B.: Improved distribution matching distillation for fast image synthesis. Advances in neural information processing systems37, 47455–47487 (2024)

2024
[50]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Yin, T., Gharbi, M., Zhang, R., Shechtman, E., Durand, F., Freeman, W.T., Park, T.: One-step diffusion with distribution matching distillation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6613– 6623 (2024)

2024
[51]

Efficient video diffusion mod- els via content-frame motion-latent decomposition.arXiv preprint arXiv:2403.14148, 2024

Yu, S., Nie, W., Huang, D.A., Li, B., Shin, J., Anandkumar, A.: Efficient video diffusion models via content-frame motion-latent decomposition. arXiv preprint arXiv:2403.14148 (2024)

work page arXiv 2024
[52]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Yu, S., Sohn, K., Kim, S., Shin, J.: Video probabilistic diffusion models in projected latent space. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18456–18466 (2023)

2023
[53]

In: 13th International Conference on Learning Representations, ICLR 2025

Zhang, J., Liu, D., Park, E., Zhang, S., Xu, C.: Anti-exposure bias in diffusion mod- els. In: 13th International Conference on Learning Representations, ICLR 2025. pp. 59458–59478. International Conference on Learning Representations, ICLR (2025) 18 Y. Yang Author et al

2025
[54]

arXiv preprint arXiv:2311.01017 (2023)

Zhang, L., Xiong, Y., Yang, Z., Casas, S., Hu, R., Urtasun, R.: Copilot4d: Learning unsupervised world models for autonomous driving via discrete diffusion. arXiv preprint arXiv:2311.01017 (2023)

work page arXiv 2023
[55]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 586–595 (2018)

2018
[56]

Advances in Neural Information Processing Systems36, 27147–27166 (2023)

Zhang, W., Wang, G., Sun, J., Yuan, Y., Huang, G.: Storm: Efficient stochastic transformer based world models for reinforcement learning. Advances in Neural Information Processing Systems36, 27147–27166 (2023)

2023
[57]

Zhou, M., Zheng, H., Wang, Z., Yin, M., Huang, H.: Score identity distillation: Ex- ponentially fast distillation of pretrained diffusion models for one-step generation. In: Forty-first International Conference on Machine Learning (2024) Supplementary Material 1 Implementation Details 1.1 Architecture Details We encode all input frames with the Stable Dif...

2024

[1] [1]

Advances in Neural Information Processing Systems37, 58757–58791 (2024)

Alonso,E.,Jelley,A.,Micheli,V.,Kanervisto,A.,Storkey,A.J.,Pearce,T.,Fleuret, F.: Diffusion for world modeling: Visual details matter in atari. Advances in Neural Information Processing Systems37, 58757–58791 (2024)

2024

[2] [2]

In: Proceedings of the 31st International Conference on Computational Linguistics

Asada, M., Miwa, M.: Addressing the training-inference discrepancy in discrete diffusion for text generation. In: Proceedings of the 31st International Conference on Computational Linguistics. pp. 7156–7164 (2025)

2025

[3] [3]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Bar, A., Zhou, G., Tran, D., Darrell, T., LeCun, Y.: Navigation world models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 15791–15801 (2025)

2025

[4] [4]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S.W., Fidler, S., Kreis, K.: Align your latents: High-resolution video synthesis with latent diffusion mod- els. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 22563–22575 (2023)

2023

[6] [6]

In: Forty-first International Conference on Machine Learning (2024)

Bruce, J., Dennis, M.D., Edwards, A., Parker-Holder, J., Shi, Y., Hughes, E., Lai, M., Mavalankar, A., Steigerwald, R., Apps, C., et al.: Genie: Generative interactive environments. In: Forty-first International Conference on Machine Learning (2024)

2024

[7] [7]

Advances in Neural Information Processing Systems37, 24081–24125 (2024)

Chen, B., Martí Monsó, D., Du, Y., Simchowitz, M., Tedrake, R., Sitzmann, V.: Diffusion forcing: Next-token prediction meets full-sequence diffusion. Advances in Neural Information Processing Systems37, 24081–24125 (2024)

2024

[8] [8]

arXiv preprint arXiv:2401.02644 (2024)

Chen, C., Deng, F., Kawaguchi, K., Gulcehre, C., Ahn, S.: Simple hierarchical planning with diffusion. arXiv preprint arXiv:2401.02644 (2024)

work page arXiv 2024

[9] [9]

In: ACM SIGGRAPH 2024 Conference Papers

Chen, R., Shi, M., Huang, S., Tan, P., Komura, T., Chen, X.: Taming diffusion probabilistic models for character control. In: ACM SIGGRAPH 2024 Conference Papers. pp. 1–10 (2024)

2024

[10] [10]

The International Journal of Robotics Research44(10-11), 1684–1704 (2025)

Chi,C.,Xu,Z.,Feng,S.,Cousineau,E.,Du,Y.,Burchfiel,B.,Tedrake,R.,Song,S.: Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research44(10-11), 1684–1704 (2025)

2025

[11] [11]

Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

Cui, J., Wu, J., Li, M., Yang, T., Li, X., Wang, R., Bai, A., Ban, Y., Hsieh, C.J.: Self-forcing++: Towards minute-scale high-quality video generation. arXiv preprint arXiv:2510.02283 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data

Fu,S.,Tamir,N.,Sundaram,S.,Chai,L.,Zhang,R.,Dekel,T.,Isola,P.:Dreamsim: Learning new dimensions of human visual similarity using synthetic data. arXiv preprint arXiv:2306.09344 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [13]

Advances in Neural Information Processing Systems36, 41914–41931 (2023)

Geng, Z., Pokle, A., Kolter, J.Z.: One-step diffusion distillation via deep equilib- rium models. Advances in Neural Information Processing Systems36, 41914–41931 (2023)

2023

[14] [14]

World Models

Ha, D., Schmidhuber, J.: World models. arXiv preprint arXiv:1803.10122 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[15] [15]

Mastering Diverse Domains through World Models

Hafner, D., Pasukonis, J., Ba, J., Lillicrap, T.: Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [16]

Advances in neural information processing systems30(2017)

Heusel,M.,Ramsauer,H.,Unterthiner,T.,Nessler,B.,Hochreiter,S.:Ganstrained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30(2017)

2017

[17] [17]

IEEE Robotics and Automation Letters9(1), 49–56 (2023) 16 Y

Hirose, N., Shah, D., Sridhar, A., Levine, S.: Sacson: Scalable autonomous control for social navigation. IEEE Robotics and Automation Letters9(1), 49–56 (2023) 16 Y. Yang Author et al

2023

[18] [18]

GAIA-1: A Generative World Model for Autonomous Driving

Hu, A., Russell, L., Yeo, H., Murez, Z., Fedoseev, G., Kendall, A., Shotton, J., Corrado, G.: Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Huang, X., Li, Z., He, G., Zhou, M., Shechtman, E.: Self forcing: Bridging the train-test gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

IEEE Robotics and Automation Letters 7(4), 11807–11814 (2022)

Karnan, H., Nair, A., Xiao, X., Warnell, G., Pirk, S., Toshev, A., Hart, J., Biswas, J., Stone, P.: Socially compliant navigation dataset (scand): A large-scale dataset of demonstrations for social navigation. IEEE Robotics and Automation Letters 7(4), 11807–11814 (2022)

2022

[21] [21]

arXiv preprint arXiv:2506.00867 (2025)

Lee, K., Choi, J.: Local manifold approximation and projection for manifold-aware diffusion planning. arXiv preprint arXiv:2506.00867 (2025)

work page arXiv 2025

[22] [22]

Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

Liu, K., Hu, W., Xu, J., Shan, Y., Lu, S.: Rolling forcing: Autoregressive long video diffusion in real time. arXiv preprint arXiv:2509.25161 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

IEEE/ASME Transactions on Mechatronics (2025)

Liu, Y., Chen, W., Bai, Y., Liang, X., Li, G., Gao, W., Lin, L.: Aligning cyber space with physical world: A comprehensive survey on embodied ai. IEEE/ASME Transactions on Mechatronics (2025)

2025

[24] [24]

Decoupled Weight Decay Regularization

Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[25] [25]

Advances in Neural Information Processing Systems36, 76525–76546 (2023)

Luo, W., Hu, T., Zhang, S., Sun, J., Li, Z., Zhang, Z.: Diff-instruct: A universal approach for transferring knowledge from pre-trained diffusion models. Advances in Neural Information Processing Systems36, 76525–76546 (2023)

2023

[26] [26]

Mazé, F., Ahmed, F.: Diffusion models beat gans on topology optimization. In: Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence. pp. 9108–9116 (2023)

2023

[27] [27]

Transformers are sample-efficient world models.arXiv preprint arXiv:2209.00588, 2022

Micheli,V.,Alonso,E.,Fleuret,F.:Transformersaresample-efficientworldmodels. arXiv preprint arXiv:2209.00588 (2022)

work page arXiv 2022

[28] [28]

In: International Con- ference on Machine Learning

Ni, F., Hao, J., Mu, Y., Yuan, Y., Zheng, Y., Wang, B., Liang, Z.: Metadiffuser: Diffusion model as conditional planner for offline meta-rl. In: International Con- ference on Machine Learning. pp. 26087–26105. PMLR (2023)

2023

[29] [29]

In: International conference on machine learning

Nichol, A.Q., Dhariwal, P.: Improved denoising diffusion probabilistic models. In: International conference on machine learning. pp. 8162–8171. PMLR (2021)

2021

[30] [30]

arXiv preprint arXiv:2308.15321 (2023)

Ning, M., Li, M., Su, J., Salah, A.A., Ertugrul, I.O.: Elucidating the exposure bias in diffusion models. arXiv preprint arXiv:2308.15321 (2023)

work page arXiv 2023

[31] [31]

Peebles,W.,Xie,S.:Scalablediffusionmodelswithtransformers.In:Proceedingsof the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023)

2023

[32] [32]

arXiv preprint arXiv:2402.09470 (2024)

Ruhe, D., Heek, J., Salimans, T., Hoogeboom, E.: Rolling diffusion models. arXiv preprint arXiv:2402.09470 (2024)

work page arXiv 2024

[33] [33]

The diffusion duality.arXiv preprint arXiv:2506.10892,

Sahoo, S.S., Deschenaux, J., Gokaslan, A., Wang, G., Chiu, J., Kuleshov, V.: The diffusion duality. arXiv preprint arXiv:2506.10892 (2025)

work page arXiv 2025

[34] [34]

In: European Conference on Computer Vision

Sauer, A., Lorenz, D., Blattmann, A., Rombach, R.: Adversarial diffusion distilla- tion. In: European Conference on Computer Vision. pp. 87–103. Springer (2024)

2024

[35] [35]

arXiv preprint arXiv:2104.05859 (2021)

Shah,D.,Eysenbach,B.,Kahn,G.,Rhinehart,N.,Levine,S.:Rapidexplorationfor open-world navigation with latent goal models. arXiv preprint arXiv:2104.05859 (2021)

work page arXiv 2021

[36] [36]

arXiv preprint arXiv:2210.03370 (2022)

Shah, D., Sridhar, A., Bhorkar, A., Hirose, N., Levine, S.: Gnm: A general naviga- tion model to drive any robot. arXiv preprint arXiv:2210.03370 (2022)

work page arXiv 2022

[37] [37]

In: Interna- tional Conference on Machine Learning

Song, Y., Dhariwal, P., Chen, M., Sutskever, I.: Consistency models. In: Interna- tional Conference on Machine Learning. pp. 32211–32252. PMLR (2023) AR Forcing: Towards Long-Horizon Robot Navigation World Model 17

2023

[38] [38]

In: 2024 IEEE International Conference on Robotics and Automation (ICRA)

Sridhar, A., Shah, D., Glossop, C., Levine, S.: Nomad: Goal masked diffusion poli- cies for navigation and exploration. In: 2024 IEEE International Conference on Robotics and Automation (ICRA). pp. 63–70. IEEE (2024)

2024

[39] [39]

In: Proc

Sturm, J., Burgard, W., Cremers, D.: Evaluating egomotion and structure-from- motion approaches using the tum rgb-d benchmark. In: Proc. of the Workshop on Color-Depth Camera Fusion in Robotics at the IEEE/RJS International Confer- ence on Intelligent Robot Systems (IROS). vol. 13, p. 6 (2012)

2012

[40] [40]

Sun, J., Jiang, Y., Qiu, J., Nobel, P., Kochenderfer, M.J., Schwager, M.: Conformal predictionforuncertainty-awareplanningwithdiffusiondynamicsmodel.Advances in Neural Information Processing Systems36, 80324–80337 (2023)

2023

[41] [41]

Tian, K., Jiang, Y., Yuan, Z., Peng, B., Wang, L.: Visual autoregressive modeling: Scalableimagegenerationvianext-scaleprediction.Advancesinneuralinformation processing systems37, 84839–84865 (2024)

2024

[42] [42]

In: 2022 International Conference on Robotics and Automation (ICRA)

Triest, S., Sivaprakasam, M., Wang, S.J., Wang, W., Johnson, A.M., Scherer, S.: Tartandrive: A large-scale dataset for learning off-road dynamics models. In: 2022 International Conference on Robotics and Automation (ICRA). pp. 2546–2552. IEEE (2022)

2022

[43] [43]

Diffusion Models Are Real-Time Game Engines

Valevski, D., Leviathan, Y., Arar, M., Fruchter, S.: Diffusion models are real-time game engines. arXiv preprint arXiv:2408.14837 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[44] [44]

arXiv preprint arXiv:2309.06599 (2023)

Venkatraman, S., Khaitan, S., Akella, R.T., Dolan, J., Schneider, J., Berseth, G.: Reasoning with latent diffusion in offline reinforcement learning. arXiv preprint arXiv:2309.06599 (2023)

work page arXiv 2023

[45] [45]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wang, Y., He, J., Fan, L., Li, H., Chen, Y., Zhang, Z.: Driving into the future: Mul- tiview visual forecasting and planning with world model for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14749–14759 (2024)

2024

[46] [46]

In: Conference on robot learning

Wu, P., Escontrela, A., Hafner, D., Abbeel, P., Goldberg, K.: Daydreamer: World models for physical robot learning. In: Conference on robot learning. pp. 2226–

[47] [47]

Learning Interactive Real-World Simulators

Yang, M., Du, Y., Ghasemipour, K., Tompson, J., Schuurmans, D., Abbeel, P.: Learning interactive real-world simulators. arXiv preprint arXiv:2310.061141(2), 6 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[48] [48]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Yang,Z.,Chen,Y.,Wang,J.,Manivasagam,S.,Ma,W.C.,Yang,A.J.,Urtasun,R.: Unisim: A neural closed-loop sensor simulator. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1389–1399 (2023)

2023

[49] [49]

Advances in neural information processing systems37, 47455–47487 (2024)

Yin, T., Gharbi, M., Park, T., Zhang, R., Shechtman, E., Durand, F., Freeman, B.: Improved distribution matching distillation for fast image synthesis. Advances in neural information processing systems37, 47455–47487 (2024)

2024

[50] [50]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Yin, T., Gharbi, M., Zhang, R., Shechtman, E., Durand, F., Freeman, W.T., Park, T.: One-step diffusion with distribution matching distillation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6613– 6623 (2024)

2024

[51] [51]

Efficient video diffusion mod- els via content-frame motion-latent decomposition.arXiv preprint arXiv:2403.14148, 2024

Yu, S., Nie, W., Huang, D.A., Li, B., Shin, J., Anandkumar, A.: Efficient video diffusion models via content-frame motion-latent decomposition. arXiv preprint arXiv:2403.14148 (2024)

work page arXiv 2024

[52] [52]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Yu, S., Sohn, K., Kim, S., Shin, J.: Video probabilistic diffusion models in projected latent space. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18456–18466 (2023)

2023

[53] [53]

In: 13th International Conference on Learning Representations, ICLR 2025

Zhang, J., Liu, D., Park, E., Zhang, S., Xu, C.: Anti-exposure bias in diffusion mod- els. In: 13th International Conference on Learning Representations, ICLR 2025. pp. 59458–59478. International Conference on Learning Representations, ICLR (2025) 18 Y. Yang Author et al

2025

[54] [54]

arXiv preprint arXiv:2311.01017 (2023)

Zhang, L., Xiong, Y., Yang, Z., Casas, S., Hu, R., Urtasun, R.: Copilot4d: Learning unsupervised world models for autonomous driving via discrete diffusion. arXiv preprint arXiv:2311.01017 (2023)

work page arXiv 2023

[55] [55]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 586–595 (2018)

2018

[56] [56]

Advances in Neural Information Processing Systems36, 27147–27166 (2023)

Zhang, W., Wang, G., Sun, J., Yuan, Y., Huang, G.: Storm: Efficient stochastic transformer based world models for reinforcement learning. Advances in Neural Information Processing Systems36, 27147–27166 (2023)

2023

[57] [57]

Zhou, M., Zheng, H., Wang, Z., Yin, M., Huang, H.: Score identity distillation: Ex- ponentially fast distillation of pretrained diffusion models for one-step generation. In: Forty-first International Conference on Machine Learning (2024) Supplementary Material 1 Implementation Details 1.1 Architecture Details We encode all input frames with the Stable Dif...

2024