pith. sign in

arxiv: 2606.22424 · v2 · pith:6EVXMX4Enew · submitted 2026-06-21 · 💻 cs.CV

FlowDec: Temporal Conditional Flow Decorruptor for Robust Continuous Vision-Language Navigation

Pith reviewed 2026-06-26 10:22 UTC · model grok-4.3

classification 💻 cs.CV
keywords FlowDecvision-language navigationimage restorationgenerative flowtemporal conditioningvisual corruptionsembodied navigationcontinuous environments
0
0 comments X

The pith

FlowDec restores corrupted images for vision-language navigation agents by conditioning generative flows on temporal history and filtering outputs via action centroids.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FlowDec, a specialized image restoration approach for agents that must follow language instructions while moving through continuous physical spaces. It combines a hybrid temporal conditioning step that steers a generative flow model using prior observations with an action-centroid guided filter that evaluates and combines restored frames. The goal is to counteract the sharp drop in large-model performance caused by real-world visual corruptions such as noise or blur. Experiments reported in the paper indicate that this combination yields both higher navigation success rates and faster generation times than prior decorruption techniques.

Core claim

FlowDec integrates a hybrid temporal conditioning strategy to align the generative flow path with historical context and employs action-centroid guided filtering to dynamically assess and integrate outputs, resulting in superior performance in navigation accuracy and generation latency for VLN-CE tasks under visual corruptions.

What carries the argument

Temporal Conditional Flow Decorruptor (FlowDec), a generative flow framework that aligns restoration with historical context through hybrid conditioning and integrates outputs via action-centroid guided filtering.

Load-bearing premise

The hybrid temporal conditioning strategy aligns the generative flow path with historical context and the action-centroid guided filtering reliably integrates outputs.

What would settle it

Applying FlowDec to a new set of visual corruption types or a different large model backbone and measuring no gain in navigation accuracy or an increase in latency would falsify the performance claim.

Figures

Figures reproduced from arXiv: 2606.22424 by Changhao Chen, Yufei Zhang.

Figure 1
Figure 1. Figure 1: LM-based VLN models suffer from corruption despite their strong generaliza￾tion ability, while FlowDec can effectively alleviate the influence from corruption. The experiment is conducted on R2R-CE dataset using NaVid [48] as backbone. 12, 30, 46–48]. By leveraging their strong reasoning and cross-modal alignment capabilities [4,22,54], LM-based agents such as NaVid [48] demonstrate compet￾itive long-horiz… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of training phase. (A) Model utilizes three types of conditions for training. The proportion of c1 remains constant, while c3 gradually replaces c2 as the number of epochs increases. (B) Ground-truth image pairs are used to construct action centroids, which capture expected latent differentials per atomic action. 3.2 Temporal Conditional Flow Matching Learning Strategy Robust viewpoint recovery in… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the inference phase. The model primarily generates using condition c1. The auxiliary condition c3 is invoked only when the distance to the corresponding action centroid exceeds a threshold wθ. where a denotes the type of atomic action. This centroid captures the expected latent differences associated with specific actions, facilitating consistent image reconstruction in continuous navigation ta… view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of recovered images. (A) Recovery performance of different models under Gaussian noise, Snow, Contrast, and JPEG compression. (B) Recovery perfor￾mance of different models for the same trajectory under Shot noise. Note in particular that for the same wall (marked within the frame), only our method yields results with high visual consistency. effective generalization. In contrast, weather effec… view at source ↗
Figure 5
Figure 5. Figure 5: Illustration of experiments in real-world environment. The two rows of images on the right side represent the original image and the decorrupted image respectively. typically comprising ∼70 steps, the cumulative overhead of diffusion baselines renders them infeasible for real-time operation. FlowDec, uniquely, reconciles stringent robustness requirements with low-latency constraints, establishing it as a p… view at source ↗
Figure 6
Figure 6. Figure 6: Summary of ablation study results. All results are measured by SPL. inference. SPL results ( [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of 12 corruption types on the sampled image in R2R-CE bench￾mark. B Benchmarks, Backbone and Training Details B.1 Benchmark All experiments are conducted on two prominent VLN-CE datasets: R2R-CE [24] and RxR-CE [25]. To assess robust navigation under visual degradations, we introduce corruptions to the visual modality. Following [34,43], we apply 12 cor￾ruption types for comprehensive evaluat… view at source ↗
Figure 8
Figure 8. Figure 8: Visualization of action-centroid guided filtering. C More Experiment Results C.1 Visualization of Action-Centroid Guided Filtering [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Q-Q plots of differential latents for each action type [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Visualization of images denoised with different steps, using fog as the corrup￾tion type [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗
read the original abstract

Vision-and-Language Navigation in Continuous Environments (VLN-CE) requires agents to follow natural-language instructions in unseen scenes. While Large Models (LMs) have advanced VLN-CE, their performance remains severely degraded by real-world visual corruptions, a critical yet underexplored domain constraint. We introduce Temporal Conditional Flow Decorruptor (FlowDec), a novel image restoration framework tailored for LM-based VLN-CE. FlowDec integrates a hybrid temporal conditioning strategy to align the generative flow path with historical context and employs action-centroid guided filtering to dynamically assess and integrate outputs. Extensive experiments demonstrate that FlowDec outperforms state-of-the-art decorruption methods in both navigation accuracy and generation latency. Our approach establishes a robust, efficient paradigm for resilient embodied navigation in unpredictable real-world conditions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces FlowDec, a temporal conditional flow decorruptor for robust Vision-and-Language Navigation in Continuous Environments (VLN-CE). It proposes a hybrid temporal conditioning strategy to align generative flow paths with historical context and an action-centroid guided filtering mechanism to assess and integrate outputs. The central claim is that extensive experiments show FlowDec outperforms state-of-the-art decorruption methods in navigation accuracy and generation latency.

Significance. If the performance claims hold with proper validation, the work would address a practical gap in deploying large-model VLN-CE agents under real-world visual corruptions, potentially offering an efficient flow-based restoration paradigm that integrates temporal history and action guidance.

major comments (1)
  1. [Abstract] Abstract: the claim that 'extensive experiments demonstrate that FlowDec outperforms state-of-the-art decorruption methods in both navigation accuracy and generation latency' is unsupported by any data, baselines, tables, figures, error bars, or method details, which is load-bearing for the paper's primary empirical contribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the review and the opportunity to clarify. We respond point-by-point to the major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'extensive experiments demonstrate that FlowDec outperforms state-of-the-art decorruption methods in both navigation accuracy and generation latency' is unsupported by any data, baselines, tables, figures, error bars, or method details, which is load-bearing for the paper's primary empirical contribution.

    Authors: We agree that the abstract's empirical claim requires explicit supporting evidence in the manuscript. The current version provides only the abstract and does not include the referenced data, baselines, tables, figures, error bars, or method details. We will revise the abstract to remove or qualify the unsupported claim and add the necessary experimental sections with proper validation in the next version. revision: yes

Circularity Check

0 steps flagged

No circularity detected; abstract contains no equations, derivations, or self-referential claims

full rationale

The provided abstract and reader summary contain no equations, parameter-fitting steps, self-citations, or uniqueness theorems. Claims of outperformance are empirical assertions without any derivation chain that could reduce to inputs by construction. No load-bearing steps match any enumerated circularity pattern, so the paper's central claims cannot be shown to be circular from the given text. This is the expected outcome when no technical derivation is visible.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no equations, training procedures or modeling choices, so no free parameters, axioms or invented entities can be identified.

pith-pipeline@v0.9.1-grok · 5659 in / 1007 out tokens · 38740 ms · 2026-06-26T10:22:40.418453+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 24 canonical work pages · 9 internal anchors

  1. [1]

    On Evaluation of Embodied Navigation Agents

    Anderson, P., Chang, A., Chaplot, D.S., Dosovitskiy, A., Gupta, S., Koltun, V., Kosecka, J., Malik, J., Mottaghi, R., Savva, M., et al.: On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757 (2018)

  2. [2]

    In: Conference on Robot Learning

    Anderson, P., Shrivastava, A., Truong, J., Majumdar, A., Parikh, D., Batra, D., Lee, S.: Sim-to-real transfer for vision-and-language navigation. In: Conference on Robot Learning. pp. 671–681. PMLR (2021)

  3. [3]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., Sünderhauf, N., Reid, I., Gould, S., Van Den Hengel, A.: Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3674–3683 (2018)

  4. [4]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al.: pi_0 : A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164 (2024)

  5. [5]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow image editing instructions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18392–18402 (2023)

  6. [6]

    arXiv preprint arXiv:2401.07314 (2024)

    Chen, J., Lin, B., Xu, R., Chai, Z., Liang, X., Wong, K.Y.K.: Mapgpt: Map-guided prompting with adaptive path planning for vision-and-language navigation. arXiv preprint arXiv:2401.07314 (2024)

  7. [7]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Chen, J., Chen, J., Chao, H., Yang, M.: Image blind denoising with generative adversarial network based noise modeling. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3155–3164 (2018)

  8. [8]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Chen, S., Guhur, P.L., Tapaswi, M., Schmid, C., Laptev, I.: Think global, act local: Dual-scale graph transformer for vision-and-language navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16537–16547 (2022)

  9. [9]

    SceneGraphFusion: Incremental 3d scene graph prediction from RGB-d sequences,

    Chen, X., He, K.: Exploring simple siamese representation learning. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (Jun 2021).https://doi.org/10.1109/cvpr46437.2021.01549,http://dx.doi. org/10.1109/cvpr46437.2021.01549

  10. [10]

    arXiv preprint arXiv:2412.04453 (2024)

    Cheng,A.C.,Ji,Y.,Yang,Z.,Gongye,Z.,Zou,X.,Kautz,J.,Bıyık,E.,Yin,H.,Liu, S., Wang, X.: Navila: Legged robot vision-language-action model for navigation. arXiv preprint arXiv:2412.04453 (2024)

  11. [11]

    See https://vicuna.lmsys.org2(3), 6 (2023)

    Chiang, W.L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J.E., et al.: Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna.lmsys.org2(3), 6 (2023)

  12. [12]

    Zhang et al

    Contributors, I.: InternNav: InternRobotics’ open platform for building generalized navigation foundation models.https://github.com/InternRobotics/InternNav (2025) 16 Y. Zhang et al

  13. [13]

    Flow matching in latent space

    Dao, Q., Phung, H., Nguyen, B., Tran, A.: Flow matching in latent space. arXiv preprint arXiv:2307.08698 (2023)

  14. [14]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Gao, J., Zhang, J., Liu, X., Darrell, T., Shelhamer, E., Wang, D.: Back to the source: Diffusion-driven adaptation to test-time corruption. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11786– 11796 (2023)

  15. [15]

    arXiv preprint arXiv:2311.13209 (2023)

    Gao, J., Yao, X., Xu, C.: Fast-slow test-time adaptation for online vision-and- language navigation. arXiv preprint arXiv:2311.13209 (2023)

  16. [16]

    In: 2nd CoRL Workshop on Learning Effective Abstractions for Planning (2024)

    Gode, S., Nayak, A., Burgard, W.: Flownav: Learning efficient navigation poli- cies via conditional flow matching. In: 2nd CoRL Workshop on Learning Effective Abstractions for Planning (2024)

  17. [17]

    In: Proceedings of the Computer Vision and Pattern Recogni- tion Conference

    Guo, J., Zhao, J., Du, C., Wang, Y., Ge, C., Ni, Z., Song, S., Shi, H., Huang, G.: Everything to the synthetic: Diffusion-driven test-time adaptation via synthetic- domain alignment. In: Proceedings of the Computer Vision and Pattern Recogni- tion Conference. pp. 30503–30513 (2025)

  18. [18]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Guo, S., Yan, Z., Zhang, K., Zuo, W., Zhang, L.: Toward convolutional blind denoising of real photographs. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 1712–1722 (2019)

  19. [19]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Hendrycks, D., Zou, A., Mazeika, M., Tang, L., Li, B., Song, D., Steinhardt, J.: Pixmix: Dreamlike pictures comprehensively improve safety measures. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (Jun 2022).https://doi.org/10.1109/cvpr52688.2022.01628,http://dx.doi. org/10.1109/cvpr52688.2022.01628

  20. [20]

    arXiv preprint arXiv:2501.17403 (2025)

    Hong, H., Qiao, Y., Wang, S., Liu, J., Wu, Q.: General scene adaptation for vision- and-language navigation. arXiv preprint arXiv:2501.17403 (2025)

  21. [21]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Kamath, A., Anderson, P., Wang, S., Koh, J.Y., Ku, A., Waters, A., Yang, Y., Baldridge, J., Parekh, Z.: A new path: Scaling vision-and-language navigation with synthetic instructions and imitation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10813–10823 (2023)

  22. [22]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., et al.: Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246 (2024)

  23. [23]

    arXiv: Machine Learn- ing,arXiv: Machine Learning (Dec 2013)

    Kingma, D., Welling, M.: Auto-encoding variational bayes. arXiv: Machine Learn- ing,arXiv: Machine Learning (Dec 2013)

  24. [24]

    In: European Confer- ence on Computer Vision

    Krantz, J., Wijmans, E., Majumdar, A., Batra, D., Lee, S.: Beyond the nav-graph: Vision-and-language navigation in continuous environments. In: European Confer- ence on Computer Vision. pp. 104–120. Springer (2020)

  25. [25]

    arXiv preprint arXiv:2010.07954 (2020)

    Ku, A., Anderson, P., Patel, R., Ie, E., Baldridge, J.: Room-across-room: Multilin- gual vision-and-language navigation with dense spatiotemporal grounding. arXiv preprint arXiv:2010.07954 (2020)

  26. [26]

    International Journal of Computer Vision133(1), 31–64 (2025)

    Liang, J., He, R., Tan, T.: A comprehensive survey on test-time adaptation under distribution shifts. International Journal of Computer Vision133(1), 31–64 (2025)

  27. [27]

    In: Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision

    Lin, K., Chen, P., Huang, D., Li, T.H., Tan, M., Gan, C.: Learning vision-and- language navigation from youtube videos. In: Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision. pp. 8317–8326 (2023)

  28. [28]

    Flow Matching for Generative Modeling

    Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022)

  29. [29]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003 (2022) FlowDec 17

  30. [30]

    arXiv preprint arXiv:2406.04882 (2024)

    Long, Y., Cai, W., Wang, H., Zhan, G., Dong, H.: Instructnav: Zero-shot sys- tem for generic instruction navigation in unexplored environment. arXiv preprint arXiv:2406.04882 (2024)

  31. [31]

    Science robotics 7(62), eabk2822 (2022)

    Miki, T., Lee, J., Hwangbo, J., Wellhausen, L., Koltun, V., Hutter, M.: Learning robust perceptive locomotion for quadrupedal robots in the wild. Science robotics 7(62), eabk2822 (2022)

  32. [32]

    In: International conference on machine learning

    Niu, S., Wu, J., Zhang, Y., Chen, Y., Zheng, S., Zhao, P., Tan, M.: Efficient test- time model adaptation without forgetting. In: International conference on machine learning. pp. 16888–16905. PMLR (2022)

  33. [33]

    arXiv preprint arXiv:2302.12400 (2023)

    Niu, S., Wu, J., Zhang, Y., Wen, Z., Chen, Y., Zhao, P., Tan, M.: Towards sta- ble test-time adaptation in dynamic wild world. arXiv preprint arXiv:2302.12400 (2023)

  34. [34]

    In: European Conference on Computer Vision

    Oh, Y., Lee, J., Choi, J., Jung, D., Hwang, U., Yoon, S.: Efficient diffusion-driven corruption editor for test-time adaptation. In: European Conference on Computer Vision. pp. 184–201. Springer (2024)

  35. [35]

    arXiv preprint arXiv:2310.07889 (2023)

    Pan, B., Panda, R., Jin, S., Feris, R., Oliva, A., Isola, P., Kim, Y.: Lang- nav: Language as a perceptual representation for navigation. arXiv preprint arXiv:2310.07889 (2023)

  36. [36]

    Artificial Intelligence Review p

    Park, S.M., Kim, Y.G.: Visual language navigation: a survey and open challenges. Artificial Intelligence Review p. 365–427 (Jan 2023).https://doi.org/10.1007/ s10462-022-10174-9,http://dx.doi.org/10.1007/s10462-022-10174-9

  37. [37]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Qi, Y., Wu, Q., Anderson, P., Wang, X., Wang, W.Y., Shen, C., Hengel, A.v.d.: Reverie: Remote embodied visual referring expression in real indoor environments. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9982–9991 (2020)

  38. [38]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Savva, M., Kadian, A., Maksymets, O., Zhao, Y., Wijmans, E., Jain, B., Straub, J., Liu, J., Koltun, V., Malik, J., et al.: Habitat: A platform for embodied ai research. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9339–9347 (2019)

  39. [39]

    IEEE Transactions on Multimedia (2025)

    Tan, M., Chen, P., Zhi, H., Mai, J., Rosman, B., Ji, D., Zeng, R.: Source-free elastic model adaptation for vision-and-language navigation. IEEE Transactions on Multimedia (2025)

  40. [40]

    Improving and generalizing flow-based generative models with minibatch optimal transport

    Tong, A., Fatras, K., Malkin, N., Huguet, G., Zhang, Y., Rector-Brooks, J., Wolf, G.,Bengio,Y.:Improvingandgeneralizingflow-basedgenerativemodelswithmini- batch optimal transport. arXiv preprint arXiv:2302.00482 (2023)

  41. [41]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Tsai, Y.Y., Chen, F.C., Chen, A.Y., Yang, J., Su, C.C., Sun, M., Kuo, C.H.: Gda: Generalized diffusion for robust test-time adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 23242– 23251 (2024)

  42. [42]

    Tent: Fully Test-time Adaptation by Entropy Minimization

    Wang, D., Shelhamer, E., Liu, S., Olshausen, B., Darrell, T.: Tent: Fully test-time adaptation by entropy minimization. arXiv preprint arXiv:2006.10726 (2020)

  43. [43]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Wang, Q., Fink, O., Van Gool, L., Dai, D.: Continual test-time domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7201–7211 (2022)

  44. [44]

    In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference

    Yang, J., Yang, S., Gupta, A.W., Han, R., Fei-Fei, L., Xie, S.: Thinking in space: How multimodal large language models see, remember, and recall spaces. In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference. pp. 10632– 10643 (2025)

  45. [45]

    Advances in neural information pro- cessing systems32(2019) 18 Y

    Yue, Z., Yong, H., Zhao, Q., Meng, D., Zhang, L.: Variational denoising network: Toward blind noise modeling and removal. Advances in neural information pro- cessing systems32(2019) 18 Y. Zhang et al

  46. [46]

    arXiv preprint arXiv:2509.22548 (2025)

    Zeng, S., Qi, D., Chang, X., Xiong, F., Xie, S., Wu, X., Liang, S., Xu, M., Wei, X.: Janusvln: Decoupling semantics and spatiality with dual implicit memory for vision-language navigation. arXiv preprint arXiv:2509.22548 (2025)

  47. [47]

    Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks

    Zhang, J., Wang, K., Wang, S., Li, M., Liu, H., Wei, S., Wang, Z., Zhang, Z., Wang, H.: Uni-navid: A video-based vision-language-action model for unifying embodied navigation tasks. arXiv preprint arXiv:2412.06224 (2024)

  48. [48]

    NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation

    Zhang, J., Wang, K., Xu, R., Zhou, G., Hong, Y., Fang, X., Wu, Q., Zhang, Z., Wang, H.: Navid: Video-based vlm plans the next step for vision-and-language navigation. arXiv preprint arXiv:2402.15852 (2024)

  49. [49]

    Machine Intelligence Research20(6), 822–836 (2023)

    Zhang, K., Li, Y., Liang, J., Cao, J., Zhang, Y., Tang, H., Fan, D.P., Timofte, R., Gool, L.V.: Practical blind image denoising via swin-conv-unet and data synthesis. Machine Intelligence Research20(6), 822–836 (2023)

  50. [50]

    Vision-and-language navigation today and tomorrow: A survey in the era of foundation models

    Zhang, Y., Ma, Z., Li, J., Qiao, Y., Wang, Z., Chai, J., Wu, Q., Bansal, M., Ko- rdjamshidi, P.: Vision-and-language navigation today and tomorrow: A survey in the era of foundation models. arXiv preprint arXiv:2407.07035 (2024)

  51. [51]

    In: Proceedings of the 33rd ACM International Conference on Multimedia

    Zhang, Y., Xu, Y., Wei, H., Lin, Z., Zou, X., Chen, C., Zhuang, H.: Analytic continual test-time adaptation for multi-modality corruption. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 1929–1937 (2025)

  52. [52]

    In: European Confer- ence on Computer Vision

    Zhou, G., Hong, Y., Wang, Z., Wang, X.E., Wu, Q.: Navgpt-2: Unleashing naviga- tional reasoning capability for large vision-language models. In: European Confer- ence on Computer Vision. pp. 260–278. Springer (2024)

  53. [53]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Zhou, G., Hong, Y., Wu, Q.: Navgpt: Explicit reasoning in vision-and-language navigation with large language models. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 7641–7649 (2024)

  54. [54]

    In: Conference on Robot Learning

    Zitkovich, B., Yu, T., Xu, S., Xu, P., Xiao, T., Xia, F., Wu, J., Wohlhart, P., Welker, S., Wahid, A., et al.: Rt-2: Vision-language-action models transfer web knowledge to robotic control. In: Conference on Robot Learning. pp. 2165–2183. PMLR (2023) FlowDec 19 A Pseudo-code for FlowDec Thepseudocodefor the FlowDec is shown in two phases. – Training(Algor...