AstraNav-World: World Model for Foresight Control and Consistency

Botao Ren, Chengyu Bai, Fei Liu, Haochen Bai, Jintao Chen, Junjun Hu, Minghua Luo, Mu Xu, Shanghang Zhang, Shichao Xie, Xiaolong Wu, Xinda Xue, Zedong Chu, Ziyi Chen

Authors on Pith no claims yet

classification 💻 cs.CV

keywords astranav-worldembodiedforesightmodelnavigationreal-worldvisualworld

0 comments

read the original abstract

Embodied navigation in open, dynamic environments demands accurate foresight of how the world will evolve and how actions will unfold over time. We propose AstraNav-World, an end-to-end world model that jointly reasons about future visual states and action sequences within a unified probabilistic framework. Our framework integrates a diffusion-based video generator with a vision-language policy, enabling synchronized rollouts where predicted scenes and planned actions are updated simultaneously. Training optimizes two complementary objectives: generating action-conditioned multi-step visual predictions and deriving trajectories conditioned on those predicted visuals. This bidirectional constraint makes visual predictions executable and keeps decisions grounded in physically consistent, task-relevant futures, mitigating cumulative errors common in decoupled "envision-then-plan" pipelines. Experiments across diverse embodied navigation benchmarks show improved trajectory accuracy and higher success rates. Ablations confirm the necessity of tight vision-action coupling and unified training, with either branch removal degrading both prediction quality and policy reliability. In real-world testing, AstraNav-World demonstrated exceptional zero-shot capabilities, adapting to previously unseen scenarios without any real-world fine-tuning. These results suggest that AstraNav-World captures transferable spatial understanding and planning-relevant navigation dynamics, rather than merely overfitting to simulation-specific data distribution. Overall, by unifying foresight vision and control within a single generative model, we move closer to reliable, interpretable, and general-purpose embodied agents that operate robustly in open-ended real-world settings.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PathPainter: Transferring the Generalization Ability of Image Generation Models to Embodied Navigation
cs.RO 2026-05 unverdicted novelty 6.0

PathPainter transfers image generation models to embodied navigation by generating traversability masks from BEV images and language instructions while using cross-view localization to reduce odometry drift.
AsyncShield: A Plug-and-Play Edge Adapter for Asynchronous Cloud-based VLA Navigation
cs.RO 2026-04 unverdicted novelty 6.0

AsyncShield restores VLA geometric intent from latency via kinematic pose mapping and uses PPO-Lagrangian to balance tracking with LiDAR safety constraints in a plug-and-play module.
What Limits Vision-and-Language Navigation ?
cs.RO 2026-05 unverdicted novelty 5.0

StereoNav reaches new benchmark highs on R2R-CE and RxR-CE and improves real-robot reliability by supplying persistent target-location priors and stereo-derived geometry that stay stable under lighting changes and blur.
Explore Like Humans: Autonomous Exploration with Online SG-Memo Construction for Embodied Agents
cs.CV 2026-04 unverdicted novelty 5.0

ABot-Explorer unifies online exploration and hierarchical semantic memory construction via VLM-distilled navigational affordances for improved embodied navigation efficiency.
Dual-Anchoring: Addressing State Drift in Vision-Language Navigation
cs.CV 2026-04 unverdicted novelty 5.0

Dual-Anchoring adds explicit progress tokens and retrospective landmark verification to VLN agents, cutting state drift and lifting success rate 15.2% overall with 24.7% gains on long trajectories.
Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory
cs.CV 2026-04 unverdicted novelty 4.0

Matrix-Game 3.0 delivers 720p real-time video generation at 40 FPS with minute-scale memory consistency by combining residual self-correction training, camera-aware memory injection, and DMD-based autoregressive disti...