arxiv: 2604.26067 · v1 · submitted 2026-04-28 · 💻 cs.CV

Recognition: unknown

RADIO-ViPE: Online Tightly Coupled Multi-Modal Fusion for Open-Vocabulary Semantic SLAM in Dynamic Environments

Zaid Nasser , Mikhail Iumanov , Tianhao Li , Maxim Popov , Jaafar Mahmoud , Sergey Kolyubin

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:34 UTC · model grok-4.3

classification 💻 cs.CV

keywords semantic SLAMopen-vocabularydynamic environmentsmonocular videomulti-modal fusionfactor graphrobust optimizationfoundation models

0 comments

The pith

An online system performs open-vocabulary semantic SLAM in dynamic environments from raw monocular RGB video by tightly coupling language and vision embeddings with geometry.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a system to enable associating natural language queries with 3D objects and regions in changing scenes using only standard video input. It does this by embedding features from foundation models into every stage of the SLAM pipeline, including how the map is built and optimized. Adaptive kernels help ignore or downweight contributions from moving parts of the scene. If the approach holds, it would allow semantic understanding in robotics and video processing without the usual requirements for special sensors or offline computation. This matters because real environments are rarely static and hardware is often limited to cameras alone.

Core claim

RADIO-ViPE is presented as an online semantic SLAM system for open-vocabulary grounding that tightly couples multi-modal embeddings from models such as RADIO with geometric scene information in initialization, optimization, and factor graph connections, employing adaptive robust kernels to handle dynamic objects and rearrangements, all while operating directly on raw monocular RGB streams without requiring camera intrinsics, depth, or pose initialization, and achieving state-of-the-art results on the dynamic TUM-RGBD benchmark.

What carries the argument

Tight multi-modal coupling of vision-language embeddings with geometric factors inside a factor graph using adaptive robust kernels for dynamic handling.

Load-bearing premise

Embeddings from agglomerative foundation models can be consistently and tightly integrated with geometric information across the entire SLAM pipeline without needing prior camera intrinsics, depth, or pose.

What would settle it

If the system is tested on the dynamic TUM-RGBD benchmark and does not outperform or match existing methods in semantic grounding accuracy or fails to handle moving objects as claimed, the effectiveness of the tight coupling would be disproven.

Figures

Figures reproduced from arXiv: 2604.26067 by Jaafar Mahmoud, Maxim Popov, Mikhail Iumanov, Sergey Kolyubin, Tianhao Li, Zaid Nasser.

**Figure 1.** Figure 1: RADIO-ViPE: An online, ready-to-deploy semantic SLAM system view at source ↗

**Figure 2.** Figure 2: RADIO-ViPE pipeline ties—grounded in RADIO [1]—to generate language-aligned features within the SigLIP [31] embedding space. Furthermore, we operate RADSeg using a sliding-window approach, performing inference over overlapping image regions and subsequently refining the aggregated feature map through a self-attention mechanism. It allowed us to provide an optimal balance between spatial discriminability a… view at source ↗

**Figure 3.** Figure 3: Adaptive robust kernels based on Barron’s general loss [8]. view at source ↗

**Figure 4.** Figure 4: Ablation study on RADIO PCA feature dimensionality for semantic mapping on the Replica dataset. D=256 (—•—) closely matches the full-dimensional baseline (∆mIoU < 1%) view at source ↗

**Figure 5.** Figure 5: Quantitative results of RADIO-ViPE on Replica with different text view at source ↗

read the original abstract

We present RADIO-ViPE (Reduce All Domains Into One -- Video Pose Engine), an online semantic SLAM system that enables geometry-aware open-vocabulary grounding, associating arbitrary natural language queries with localized 3D regions and objects in dynamic environments. Unlike existing approaches that require calibrated, posed RGB-D input, RADIO-ViPE operates directly on raw monocular RGB video streams, requiring no prior camera intrinsics, depth sensors, or pose initialization. The system tightly couples multi-modal embeddings -- spanning vision and language -- derived from agglomerative foundation models (e.g., RADIO) with geometric scene information. This coupling takes place in initialization, optimization and factor graph connections to improve the consistency of the map from multiple modalities. The optimization is wrapped within adaptive robust kernels, designed to handle both actively moving objects and agent-displaced scene elements (e.g., furniture rearranged during ego-centric session). Experiments demonstrate that RADIO-ViPE achieves state-of-the-art results on the dynamic TUM-RGBD benchmark while maintaining competitive performance against offline open-vocabulary methods that rely on calibrated data and static scene assumptions. RADIO-ViPE bridges a critical gap in real-world deployment, enabling robust open-vocabulary semantic grounding for autonomous robotics and unconstrained in-the-wild video streams. Project page: https://be2rlab.github.io/radio_vipe

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RADIO-ViPE puts forward an online monocular system that tightly fuses vision-language embeddings into SLAM for dynamic open-vocabulary mapping, but the scale recovery without intrinsics or depth looks like the main risk.

read the letter

The paper introduces a new online architecture that couples embeddings from models like RADIO into initialization, optimization, and factor-graph edges while using adaptive robust kernels to handle moving objects and rearranged scene elements. This is a concrete extension past the offline or static-scene systems that dominate the cited prior work, and it targets a real deployment gap by dropping the need for calibrated RGB-D or pose initialization upfront.

Referee Report

2 major / 0 minor

Summary. The paper presents RADIO-ViPE, an online tightly-coupled multi-modal semantic SLAM system that performs open-vocabulary grounding by fusing embeddings from agglomerative foundation models (e.g., RADIO) with geometric information. It operates directly on raw monocular RGB streams without camera intrinsics, depth sensors, or pose initialization, using adaptive robust kernels in a factor-graph optimization to handle dynamic objects and scene changes, and reports SOTA results on the dynamic TUM-RGBD benchmark while remaining competitive with offline calibrated methods.

Significance. If the performance claims and scale-recovery mechanism hold under the stated assumptions, the work would be significant for enabling real-world deployment of open-vocabulary semantic SLAM in unconstrained dynamic environments using only monocular RGB, removing reliance on calibrated sensors or static-scene priors that limit prior systems.

major comments (2)

The central claim that multi-modal embeddings can be tightly coupled into initialization, optimization, and factor-graph edges to recover consistent metric-scale 3D maps from raw monocular RGB in dynamic scenes is load-bearing; the skeptic note correctly identifies unresolved scale ambiguity, and without explicit demonstration (e.g., via scale-consistency metrics or comparison against known intrinsics) the SOTA result on TUM-RGBD may not generalize beyond the benchmark's motion statistics.
Experiments section: the abstract asserts SOTA performance on dynamic TUM-RGBD and competitive results versus offline methods, yet supplies no quantitative metrics, error bars, ablation tables, or implementation details on how embedding consistency constrains bundle-adjustment scale or rejects moving elements; this absence prevents verification of the tight-coupling contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on RADIO-ViPE. We address each major comment below with clarifications on our scale-recovery approach and experimental reporting, and we commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: The central claim that multi-modal embeddings can be tightly coupled into initialization, optimization, and factor-graph edges to recover consistent metric-scale 3D maps from raw monocular RGB in dynamic scenes is load-bearing; the skeptic note correctly identifies unresolved scale ambiguity, and without explicit demonstration (e.g., via scale-consistency metrics or comparison against known intrinsics) the SOTA result on TUM-RGBD may not generalize beyond the benchmark's motion statistics.

Authors: We acknowledge that the manuscript does not include dedicated scale-consistency metrics or direct comparisons against known intrinsics. The system recovers consistent scale through the tight integration of vision-language embeddings with geometric factors and adaptive robust kernels, which enforce multi-view and multi-modal consistency even in dynamic scenes; this is what enables competitive results against calibrated offline methods on TUM-RGBD without supplying intrinsics at runtime. To address the concern directly, the revised manuscript will add an explicit scale-consistency analysis subsection, including quantitative metrics and comparisons using the benchmark's ground-truth intrinsics. revision: yes
Referee: Experiments section: the abstract asserts SOTA performance on dynamic TUM-RGBD and competitive results versus offline methods, yet supplies no quantitative metrics, error bars, ablation tables, or implementation details on how embedding consistency constrains bundle-adjustment scale or rejects moving elements; this absence prevents verification of the tight-coupling contribution.

Authors: We agree that the current experiments section would benefit from expanded quantitative support. While the manuscript reports SOTA results on dynamic TUM-RGBD and competitive performance versus offline baselines, we will revise the experiments to include error bars, full ablation tables isolating the contribution of embedding consistency to scale constraint and dynamic rejection, and additional implementation details on the factor-graph edges and robust kernels. These additions will make the tight-coupling benefits verifiable. revision: yes

Circularity Check

0 steps flagged

No circularity detected; claims rest on external benchmarks

full rationale

The paper describes a monocular semantic SLAM architecture that couples RADIO embeddings with geometric factors in initialization, optimization, and factor graphs, but presents no equations, derivations, or parameter-fitting steps. All performance assertions are grounded in experiments on the independent dynamic TUM-RGBD benchmark rather than any self-referential reduction or self-citation chain that would force the result by construction. The system description is therefore self-contained against external validation data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is limited to the core assumptions stated there.

axioms (1)

domain assumption Embeddings from foundation models such as RADIO are consistent enough to be tightly coupled with geometric scene information across initialization, optimization, and factor-graph stages.
The abstract states that the system derives multi-modal embeddings from these models and couples them with geometry.

pith-pipeline@v0.9.0 · 5562 in / 1308 out tokens · 58749 ms · 2026-05-07T16:34:58.570996+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 5 canonical work pages · 1 internal anchor

[1]

Am-radio: Agglomerative vision foundation model reduce all domains into one,

M. Ranzinger and G. e. a. Heinrich, “Am-radio: Agglomerative vision foundation model reduce all domains into one,” inIEEE/CVF Con- ference on Computer Vision and Pattern Recognition, June 2024, pp. 12 490–12 500

2024
[2]

A benchmark for the evaluation of rgb-d slam systems,

J. Sturm and e. a. N. Engelhard, “A benchmark for the evaluation of rgb-d slam systems,” inThe International Conference on Intelligent Robot Systems, October 2012

2012
[3]

Ego4d: Around the World in 3,000 Hours of Egocentric Video,

K. Grauman, A. Westbury, and E. a. Byrne, “Ego4d: Around the World in 3,000 Hours of Egocentric Video,” inIEEE/CVF Computer Vision and Pattern Recognition (CVPR), 2022

2022
[4]

Aria digital twin: A new bench- mark dataset for egocentric 3d machine perception,

X. Pan and e. a. Charron, Nicholas, “Aria digital twin: A new bench- mark dataset for egocentric 3d machine perception,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 20 133–20 143

2023
[5]

arXiv preprint arXiv:2508.10934 (2025)

J. Huang and e. a. Zhou, “Vipe: Video pose engine for 3d geomet- ric perception,” inNVIDIA Research Whitepapers arXiv:2508.10934, 2025

work page arXiv 2025
[6]

RADSeg: Unleashing Parameter and Compute Efficient Zero-Shot Open-Vocabulary Segmentation Using Agglomerative Models

O. Alama and e. a. Jariwala, Darshil, “Radseg: Unleashing parameter and compute efficient zero-shot open-vocabulary segmentation using agglomerative models,”arXiv preprint arXiv:2511.19704, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Rvwo: A robust visual-wheel slam system for mobile robots in dynamic environments,

J. Mahmoud, A. Penkovskiy, H. T. Long Vuong, A. Burkov, and S. Kolyubin, “Rvwo: A robust visual-wheel slam system for mobile robots in dynamic environments,” in2023 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2023, pp. 3468–3474

2023
[8]

A general and adaptive robust loss function,

J. T. Barron, “A general and adaptive robust loss function,” in2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 4326–4334

2019
[9]

Mast3r-slam: Real- time dense slam with 3d reconstruction priors,

R. Murai, E. Dexheimer, and A. J. Davison, “Mast3r-slam: Real- time dense slam with 3d reconstruction priors,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 16 695–16 705

2025
[10]

arXiv preprint arXiv:2505.12549 (2025)

D. Maggio, H. Lim, and L. Carlone, “Vggt-slam: Dense rgb slam optimized on the sl (4) manifold,”arXiv preprint arXiv:2505.12549, 2025

work page arXiv 2025
[11]

Dust3r: Geometric 3d vision made easy,

S. Wang, V . Leroy, Y . Cabon, B. Chidlovskii, and J. Revaud, “Dust3r: Geometric 3d vision made easy,” inCVPR, 2024

2024
[12]

Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam,

E. a. Campos, Carlos, “Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam,”IEEE Transactions on Robotics, vol. 37, no. 6, p. 1874–1890, Dec. 2021

2021
[13]

Kimera: an open-source library for real-time metric-semantic localization and mapping,

A. Rosinol and e. a. Abate, Marcus, “Kimera: an open-source library for real-time metric-semantic localization and mapping,” in2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020, pp. 1689–1696

2020
[14]

Rgbds-slam: A rgb-d semantic dense slam based on 3d multi level pyramid gaussian splatting,

Z. C. et al, “Rgbds-slam: A rgb-d semantic dense slam based on 3d multi level pyramid gaussian splatting,” 2024

2024
[15]

Samslam: A visual slam based on segment anything model for dynamic environment,

X. Chen, T. Wang, H. Mai, and L. Yang, “Samslam: A visual slam based on segment anything model for dynamic environment,” in2024 8th International Conference on Robotics, Control and Automation (ICRCA), 2024, pp. 91–97

2024
[16]

Beyond bare queries: Open- vocabulary object grounding with 3d scene graph,

S. Linok, T. Zemskova, S. Ladanova, R. Titkov, D. Yudin, M. Monastyrny, and A. Valenkov, “Beyond bare queries: Open- vocabulary object grounding with 3d scene graph,” 2024

2024
[17]

Conceptgraphs: Open- vocabulary 3d scene graphs for perception and planning,

Q. Gu and E. a. Alihusein Kuwajerwala, “Conceptgraphs: Open- vocabulary 3d scene graphs for perception and planning,” 2023

2023
[18]

Hierarchical open-vocabulary 3d scene graphs for language-grounded robot navigation,

A. Werby and e. a. Huang, Chenguang, “Hierarchical open-vocabulary 3d scene graphs for language-grounded robot navigation,” inRobotics: Science and Systems XX, ser. RSS2024. Robotics: Science and Systems Foundation, July 2024

2024
[19]

Openscene: 3d scene understanding with open vocabularies,

S. Peng, K. Genova, C. M. Jiang, A. Tagliasacchi, M. Pollefeys, and T. Funkhouser, “Openscene: 3d scene understanding with open vocabularies,” 2023

2023
[20]

Openmask3d: Open-vocabulary 3d instance segmen- tation,

A. Takmaz, E. Fedele, R. W. Sumner, M. Pollefeys, F. Tombari, and F. Engelmann, “Openmask3d: Open-vocabulary 3d instance segmen- tation,” 2023

2023
[21]

Clio: Real-time task-driven open-set 3d scene graphs,

D. Maggio, Y . Chang, N. Hughes, M. Trang, D. Griffith, C. Dougherty, E. Cristofalo, L. Schmid, and L. Carlone, “Clio: Real-time task-driven open-set 3d scene graphs,” 2024

2024
[22]

Ovo-slam: Open- vocabulary online simultaneous localization and mapping,

T. B. Martins, M. R. Oswald, and J. Civera, “Ovo-slam: Open- vocabulary online simultaneous localization and mapping,” 2024

2024
[23]

Rayfronts: Open-set semantic ray frontiers for online scene understanding and exploration,

O. Alama and e. a. Bhattacharya, Avigyan, “Rayfronts: Open-set semantic ray frontiers for online scene understanding and exploration,” in2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2025, pp. 5930–5937

2025
[24]

The replica dataset: A digital replica of indoor spaces,

J. Straub and E. a. Thomas Whelan, “The replica dataset: A digital replica of indoor spaces,” 2019

2019
[25]

DROID-SLAM: Deep Visual SLAM for Monoc- ular, Stereo, and RGB-D Cameras,

Z. Teed and J. Deng, “DROID-SLAM: Deep Visual SLAM for Monoc- ular, Stereo, and RGB-D Cameras,”Advances in neural information processing systems, 2021

2021
[26]

Unidepth: Universal monocular metric depth estimation,

L. e. a. Piccinelli, “Unidepth: Universal monocular metric depth estimation,” in2024 (CVPR), 2024, pp. 10 106–10 116

2024
[27]

Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision,

R. e. a. Wang, “Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision,” in(CVPR), 2025, pp. 5261–5271

2025
[28]

Adaptive robust kernels for non-linear least squares problems,

N. e. a. Chebrolu, “Adaptive robust kernels for non-linear least squares problems,”IEEE Robotics and Automation Letters, vol. 6, no. 2, pp. 2240–2247, 2021

2021
[29]

GeoCalib: Single-image Calibration with Geometric Optimization,

A. V . et al, “GeoCalib: Single-image Calibration with Geometric Optimization,” inECCV, 2024

2024
[30]

Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation,

M. Hu, W. Yin, and e. a. Zhang, “Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation,”IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 2024

2024
[31]

Sigmoid loss for language image pre-training,

X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, “Sigmoid loss for language image pre-training,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 11 975–11 986

2023
[32]

Dynaslam ii: Tightly-coupled multi-object tracking and slam,

B. Bescos, C. Campos, J. D. Tard ´os, and J. Neira, “Dynaslam ii: Tightly-coupled multi-object tracking and slam,” 2020. [Online]. Available: https://arxiv.org/abs/2010.07820

work page arXiv 2020
[33]

Dld-slam: Rgb- d visual simultaneous localisation and mapping in indoor dynamic environments based on deep learning,

H. Yu, Q. Wang, C. Yan, Y . Feng, Y . Sun, and L. Li, “Dld-slam: Rgb- d visual simultaneous localisation and mapping in indoor dynamic environments based on deep learning,”Remote Sensing, vol. 16, no. 2, 2024

2024
[34]

V3d-slam: Robust rgb-d slam in dynamic environments with 3d semantic geometry voting,

T. Dang, K. Nguyen, and M. Huber, “V3d-slam: Robust rgb-d slam in dynamic environments with 3d semantic geometry voting,” in2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2024, pp. 7847–7853

2024
[35]

Dgs-slam: A fast and robust rgbd slam in dynamic environments combined by geometric and semantic information,

L. Yan, X. Hu, and e. a. Zhao, “Dgs-slam: A fast and robust rgbd slam in dynamic environments combined by geometric and semantic information,”Remote Sensing, vol. 14, no. 3, p. 795, 2022

2022
[36]

Rodyn-slam: Robust dynamic dense rgb-d slam with neural radiance fields,

H. Jiang, Y . Xu, K. Li, J. Feng, and L. Zhang, “Rodyn-slam: Robust dynamic dense rgb-d slam with neural radiance fields,” 2024

2024
[37]

Dynamon: Motion-aware fast and robust camera localization for dynamic neural radiance fields,

N. Schischka and e. a. Schieber, “Dynamon: Motion-aware fast and robust camera localization for dynamic neural radiance fields,”IEEE Robotics and Automation Letters, pp. 1–8, 2024

2024
[38]

Conceptfusion: Open-set multimodal 3d map- ping,

e. a. Jatavallabhula, “Conceptfusion: Open-set multimodal 3d map- ping,”Robotics: Science and Systems (RSS), 2023

2023
[39]

Pay attention to your neighbours: Training-free open-vocabulary semantic segmentation,

S. Hajimiri, I. Ben Ayed, and J. Dolz, “Pay attention to your neighbours: Training-free open-vocabulary semantic segmentation,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2025

2025
[40]

Harnessing vision foundation models for high-performance, training- free open vocabulary segmentation.arXiv preprint arXiv:2411.09219, 2024

Y . Shi, M. Dong, and C. Xu, “Harnessing vision foundation models for high-performance, training-free open vocabulary segmentation,”arXiv preprint arXiv:2411.09219, 2024

work page arXiv 2024
[41]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection,

S. Liu, Z. Zeng, and e. a. Ren, “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” inEuropean conference on computer vision. Springer, 2024, pp. 38–55

2024
[42]

Segment anything,

E. a. Alexander Kirillov, Eric Mintun, “Segment anything,” 2023

2023