Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout

Hao Tan , Licheng Yu , Mohit Bansal

Authors on Pith no claims yet

classification 💻 cs.CL cs.CVcs.LG

keywords unseenenvironmentsagentenvironmentlearningnavigatetripletsapproaches

read the original abstract

A grand goal in AI is to build a robot that can accurately navigate based on natural language instructions, which requires the agent to perceive the scene, understand and ground language, and act in the real-world environment. One key challenge here is to learn to navigate in new environments that are unseen during training. Most of the existing approaches perform dramatically worse in unseen environments as compared to seen ones. In this paper, we present a generalizable navigational agent. Our agent is trained in two stages. The first stage is training via mixed imitation and reinforcement learning, combining the benefits from both off-policy and on-policy optimization. The second stage is fine-tuning via newly-introduced 'unseen' triplets (environment, path, instruction). To generate these unseen triplets, we propose a simple but effective 'environmental dropout' method to mimic unseen environments, which overcomes the problem of limited seen environment variability. Next, we apply semi-supervised learning (via back-translation) on these dropped-out environments to generate new paths and instructions. Empirically, we show that our agent is substantially better at generalizability when fine-tuned with these triplets, outperforming the state-of-art approaches by a large margin on the private unseen test set of the Room-to-Room task, and achieving the top rank on the leaderboard.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TagaVLM: Topology-Aware Global Action Reasoning for Vision-Language Navigation
cs.CV 2026-03 conditional novelty 6.0

TagaVLM embeds topological structures into VLMs via residual attention and interleaved prompts, achieving 51.09% success rate on R2R unseen environments and outperforming prior large-model methods.
Dual-Anchoring: Addressing State Drift in Vision-Language Navigation
cs.CV 2026-04 unverdicted novelty 5.0

Dual-Anchoring adds explicit progress tokens and retrospective landmark verification to VLN agents, cutting state drift and lifting success rate 15.2% overall with 24.7% gains on long trajectories.
Think before Go: Hierarchical Reasoning for Image-goal Navigation
cs.RO 2026-04 unverdicted novelty 5.0

HRNav decomposes image-goal navigation into VLM-based short-horizon planning and RL-based execution with a wandering suppression penalty to improve performance in complex unseen settings.