SemanticXR: Low Power and Real-time Queryable Semantic Mapping with an Object-Level Device-Cloud Architecture

Connor Smith; Devdeep Ray; Rahul Singh; Sarita Adve

arxiv: 2606.12849 · v1 · pith:5XKGARMTnew · submitted 2026-06-11 · 💻 cs.DC · cs.CV· cs.RO

SemanticXR: Low Power and Real-time Queryable Semantic Mapping with an Object-Level Device-Cloud Architecture

Rahul Singh , Devdeep Ray , Connor Smith , Sarita Adve This is my paper

Pith reviewed 2026-06-27 06:12 UTC · model grok-4.3

classification 💻 cs.DC cs.CVcs.RO

keywords semantic mappingextended realitydevice-cloud architectureobject-level organizationopen-vocabularyreal-time queryinglow-power XRsemantic XR

0 comments

The pith

Elevating semantically identifiable objects to first-class units enables real-time open-vocabulary semantic mapping on XR devices with cloud offload.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SemanticXR, a device-cloud architecture for semantic mapping in extended reality. It establishes that organizing around objects as primary units of communication, execution, and memory allows real-time querying and mapping while respecting mobile power, bandwidth, and memory limits. A sympathetic reader would care because this supports grounded interactions such as AI assistants and spatial search without constant high resource demands. The system reports concrete gains against a baseline using identical perception models, including faster server mapping and bounded device costs.

Core claim

SemanticXR is presented as the first device-cloud system for real-time, open-vocabulary semantic mapping and querying under XR power, bandwidth, and memory constraints. By elevating semantically identifiable objects to first-class units of communication, execution, and memory across the device and server, object-level parallelism and geometry downsampling improve server mapping latency, depth-mapping co-design reduces upstream bandwidth, and an object-level sparse local map with incremental updates and prioritization enables network-robust querying with bounded memory and downstream bandwidth. Configurable resource-quality trade-offs allow adaptation to application needs and conditions. Agai

What carries the argument

Object-level organization that treats semantically identifiable objects as first-class units of communication, execution, and memory across the device-cloud boundary.

If this is right

Server-side mapping latency improves by 2.2X at equal semantic quality through object-level parallelism and geometry downsampling.
Upstream bandwidth remains under 2.5 Mbps via object-level depth-mapping co-design.
Device query latency stays below 100 ms for up to 10,000 objects even during network drops due to the sparse local map and update prioritization.
The device supports tens of thousands of objects within 500 MB while scaling downstream bandwidth with map changes rather than total scene size.
The system adds only 2% device power during normal operation through object-level sparse local map and configurable trade-offs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The configurable resource-quality trade-offs could let applications tune mapping dynamically to battery state or network quality in untested operating regimes.
Similar object-centric partitioning may reduce communication costs in other cloud-offloaded mobile perception tasks such as real-time scene understanding.
Over longer sessions the bounded-memory design could allow persistent maps across multiple XR sessions without proportional growth in device storage.

Load-bearing premise

Treating semantically identifiable objects as first-class units will produce the stated latency, bandwidth, and memory gains without hidden accuracy costs or unaccounted overheads in the perception pipeline.

What would settle it

Running SemanticXR and the device-cloud baseline on the same XR device and server with a fixed scene containing several thousand objects, then measuring server mapping latency, upstream bandwidth, and device query latency under controlled network drops, would falsify the claims if the object-level version shows no improvement or exceeds the reported thresholds.

Figures

Figures reproduced from arXiv: 2606.12849 by Connor Smith, Devdeep Ray, Rahul Singh, Sarita Adve.

**Figure 1.** Figure 1: SemanticXR overview. SemanticXR enables real-time, open-vocabulary semantic mapping for low-power XR through [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Representative semantic mapping pipeline using 2D foun [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Server-side semantic mapping latency across 8 scenes [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Average query latency for server-side queries [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Local query latency and memory footprint as a function [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Per-update downstream transfer size as a function of up [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 8.** Figure 8: SemanticXR deployed end-to-end with an iPad client: (a) [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

read the original abstract

Semantic mapping is a core service that enables grounded interactions in emerging Extended Reality (XR) applications such as AI assistants and spatial object search. Deploying this capability on mobile XR devices requires a system that is open-vocabulary, real-time, and low-power. Existing approaches are compute-intensive and assume server-class resources. Cloud offloading offers a practical path, but no existing system splits semantic mapping across the device-cloud boundary or manages its communication, execution, and memory footprint. We present SemanticXR, the first device-cloud system for real-time, open-vocabulary semantic mapping and querying under XR power, bandwidth, and memory constraints. Our key insight is to elevate semantically identifiable objects to first-class units of communication, execution, and memory across the device and server. On the server, object-level parallelism and geometry downsampling improve mapping latency, while object-level depth-mapping co-design reduces upstream bandwidth. On the device, an object-level sparse local map with incremental updates and update prioritization enables network-robust querying with bounded memory and downstream bandwidth. Object-level configurable resource usage vs. quality trade-offs let applications and the system adapt mapping to application requirements and operating conditions, respectively. Against a device-cloud baseline with the same perception models, object-level organization improves server-side mapping latency by 2.2X at equal semantic quality. Depth-mapping co-design maintains upstream bandwidth under 2.5 Mbps. On the device, SemanticXR sustains sub-100 ms query latency for up to 10,000 objects even under network drops, supports tens of thousands of objects within 500 MB, and scales downstream bandwidth with map changes, not total scene size. The system adds only 2% device power during normal operation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SemanticXR's object-level device-cloud split for semantic mapping delivers concrete latency and bandwidth gains worth referee time, though the 2.2X claim needs full pipeline validation.

read the letter

The main point is that SemanticXR organizes semantic mapping around objects as first-class units across device and cloud. This produces a 2.2X server-side mapping latency improvement at equal semantic quality, upstream bandwidth under 2.5 Mbps, and sub-100 ms device queries even with network drops.

The work is new in making the device-cloud boundary object-centric rather than frame- or feature-centric. The design uses server parallelism and geometry downsampling, device-side sparse incremental maps with prioritization, and configurable quality tradeoffs. It does well by reporting specific system-level numbers against a matched baseline with the same perception models, directly targeting XR constraints on power, bandwidth, and memory.

The soft spot is whether object association, merging, and device perception overheads are fully costed into the reported gains. The stress-test note is reasonable here: if those steps add latency or alter inputs, the net 2.2X could shrink. The abstract ties results to equal semantic quality after the pipeline, so if the paper shows ablations covering the full flow this concern stays minor. The measurements otherwise look like standard system evaluation.

This paper is for XR systems and mobile-cloud researchers. Readers building real-time semantic apps on constrained devices would take away usable architecture ideas and numbers. It deserves a serious referee because the split is novel, the claims are quantitative, and the evaluation targets the right metrics.

Referee Report

2 major / 2 minor

Summary. The manuscript presents SemanticXR, the first device-cloud system for real-time open-vocabulary semantic mapping and querying under XR constraints. The central contribution is elevating semantically identifiable objects to first-class units of communication, execution, and memory. This enables server-side object-level parallelism and geometry downsampling for mapping, depth-mapping co-design for bandwidth reduction, and device-side sparse local maps with incremental updates and prioritization for query robustness. Against a device-cloud baseline using identical perception models, the system reports 2.2X server-side mapping latency improvement at equal semantic quality, upstream bandwidth under 2.5 Mbps, sub-100 ms queries for 10k objects under network drops, support for tens of thousands of objects in 500 MB, and only 2% added device power.

Significance. If the quantitative results survive full-pipeline accounting, the work demonstrates a practical object-level split that could enable grounded XR interactions on mobile devices. The co-design of depth mapping, object association, and configurable quality trade-offs addresses real deployment constraints in bandwidth, power, and memory that prior cloud-offload approaches have not jointly solved.

major comments (2)

[Abstract and Evaluation section] Abstract and Evaluation section: The 2.2X server-side mapping latency claim at equal semantic quality does not specify whether the baseline and proposed system both include the full costs of on-device object detection/segmentation, cross-frame association, merging logic, and depth co-design. If these steps are required for consistency in the object-level pipeline but are not charged to the baseline, the net gain may be overstated.
[§5 (or equivalent results section)] §5 (or equivalent results section), latency and quality tables: The equal-semantic-quality condition is stated but the manuscript must report how quality is measured after all stages (including any accuracy impact from geometry downsampling or incremental updates) and provide error bars or multiple runs to support the 2.2X figure as load-bearing evidence.

minor comments (2)

[Abstract] The abstract states specific numbers (2.2X, <2.5 Mbps, sub-100 ms, 2% power) without referencing the corresponding tables or figures; add explicit cross-references.
[System Architecture section] Notation for object-level units (e.g., how objects are represented for communication vs. memory) should be introduced earlier and used consistently in the architecture diagrams.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on evaluation clarity. We address each major comment below and will revise the manuscript to strengthen the presentation of results.

read point-by-point responses

Referee: [Abstract and Evaluation section] Abstract and Evaluation section: The 2.2X server-side mapping latency claim at equal semantic quality does not specify whether the baseline and proposed system both include the full costs of on-device object detection/segmentation, cross-frame association, merging logic, and depth co-design. If these steps are required for consistency in the object-level pipeline but are not charged to the baseline, the net gain may be overstated.

Authors: The 2.2X improvement refers specifically to server-side mapping latency after data arrival from the device. Both systems employ identical perception models, so on-device object detection/segmentation, cross-frame association, and depth co-design incur the same costs and do not affect the server-side comparison. The baseline is a device-cloud system without object-level organization; the reported gain arises from server-side object-level parallelism and geometry downsampling. Merging logic is part of the server pipeline in both cases. To eliminate ambiguity we will expand the evaluation section with an explicit baseline description confirming consistent accounting of all stages. revision: yes
Referee: [§5 (or equivalent results section)] §5 (or equivalent results section), latency and quality tables: The equal-semantic-quality condition is stated but the manuscript must report how quality is measured after all stages (including any accuracy impact from geometry downsampling or incremental updates) and provide error bars or multiple runs to support the 2.2X figure as load-bearing evidence.

Authors: We agree that the quality metric and its post-processing impact require explicit reporting. Semantic quality is quantified via mean Intersection-over-Union (mIoU) computed on final object labels after geometry downsampling and incremental map updates. We will add this definition to the results section, include the measured mIoU values confirming equal quality between systems, and report error bars derived from multiple independent runs to substantiate the 2.2X latency figure. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on measured system performance, not derivations or fitted predictions

full rationale

The paper is a systems contribution whose central claims (2.2X server-side latency improvement at equal semantic quality, bandwidth and memory bounds) are presented as empirical results from an implemented device-cloud prototype evaluated against a baseline using identical perception models. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the provided abstract or description. The object-level organization is an architectural choice whose benefits are quantified by direct measurement rather than reduced to inputs by construction. This is the expected outcome for a measurement-driven systems paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities beyond the core design choice of object-level organization. The key insight is treated as a domain assumption rather than a derived result.

axioms (1)

domain assumption Elevating semantically identifiable objects to first-class units of communication, execution, and memory improves latency, bandwidth, and memory usage under XR constraints
This is stated as the key insight that enables the system benefits.

pith-pipeline@v0.9.1-grok · 5858 in / 1404 out tokens · 22794 ms · 2026-06-27T06:12:05.681473+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 21 canonical work pages

[1]

Apple iPad Pro, 2020

Apple Inc. Apple iPad Pro, 2020. Tablet device with LiDAR sensor. 9

2020
[2]

Apple ARKit, 2023

Apple Inc. Apple ARKit, 2023. Apple’s AR Developer tool. 9

2023
[3]

Behroozi, Y

A. Behroozi, Y . Chen, V . Fruchter, L. Subramanian, S. Srikanth, and S. Mahlke. Slimslam: An adaptive runtime for visual-inertial simul- taneous localization and mapping. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, V olume 3, ASPLOS ’24, p. 900–915. Association for Comp...

work page doi:10.1145/3620666.3651361 2024
[4]

F. L. Busch, T. Homberger, J. Ortega-Peimbert, Q. Yang, and O. An- dersson. One map to find them all: Real-time open-vocabulary map- ping for zero-shot multi-object navigation. In2025 IEEE International Conference on Robotics and Automation (ICRA), pp. 14835–14842,
[5]

doi: 10.1109/ICRA55743.2025.11128393 2, 3, 9

work page doi:10.1109/icra55743.2025.11128393 2025
[6]

G. Chen, F. H’acha, L. V’a ˇsa, and M. Dasari. Tvmc: Time- varying mesh compression using volume-tracked reference meshes. In Proceedings of the 16th ACM Multimedia Systems Conference, MM- Sys ’25, p. 79–89. Association for Computing Machinery, New York, NY , USA, 2025. doi: 10.1145/3712676.3714440 2

work page doi:10.1145/3712676.3714440 2025
[7]

K. Chen, T. Li, H.-S. Kim, D. E. Culler, and R. H. Katz. Marvel: En- abling mobile augmented reality with low energy and low latency. In Proceedings of the 16th ACM Conference on Embedded Networked Sensor Systems, SenSys ’18, p. 292–304. Association for Comput- ing Machinery, New York, NY , USA, 2018. doi: 10.1145/3274783. 3274834 2, 10

work page doi:10.1145/3274783 2018
[8]

Cherti, R

M. Cherti, R. Beaumont, R. Wightman, M. Wortsman, G. Ilharco, C. Gordon, C. Schuhmann, L. Schmidt, and J. Jitsev. Repro- ducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2818–2829, 2023. 3

2023
[9]

Dhakal, X

A. Dhakal, X. Ran, Y . Wang, J. Chen, and K. K. Ramakrishnan. Slam- share: visual simultaneous localization and mapping for real-time multi-user augmented reality. In Proceedings of the 18th International Conference on Emerging Networking EXperiments and Technologies, CoNEXT ’22, p. 293–306. Association for Computing Machinery, New York, NY , USA, 2022. d...

work page doi:10.1145/3555050.3569142 2022
[10]

S. Gao, J. Liu, Q. Jiang, F. Sinclair, W. Sentosa, B. Godfrey, and S. Adve. Xrgo: Design and evaluation of rendering offload for low- power extended reality devices. In Proceedings of the 16th ACM Multimedia Systems Conference, MMSys ’25, p. 124–135. Associ- ation for Computing Machinery, New York, NY , USA, 2025. doi: 10. 1145/3712676.3714444 2, 5, 10

arXiv 2025
[11]

Immersive stream for xr overview

Google. Immersive stream for xr overview. Streaming for extended reality. 2
[12]

Project astra, 2024

Google. Project astra, 2024. Google’s universal AR AI agent. 10

2024
[13]

Q. Gu, A. Kuwajerwala, S. Morin, K. M. Jatavallabhula, B. Sen, A. Agarwal, C. Rivera, W. Paul, K. Ellis, R. Chellappa, C. Gan, C. M. de Melo, J. B. Tenenbaum, A. Torralba, F. Shkurti, and L. Paull. Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pp. 5021–5...

work page doi:10.1109/icra57147 2024
[14]

Hughes, Y

N. Hughes, Y . Chang, and L. Carlone. Hydra: A real-time spatial perception system for 3D scene graph construction and optimization
[15]

Hughes, Y

N. Hughes, Y . Chang, S. Hu, R. Talak, R. Abdulhai, J. Strader, and L. Carlone. Foundations of spatial perception for robotics: Hierarchi- cal representations and real-time systems. The International Journal of Robotics Research, 2024. doi: 10.1177/02783649241229725 9

work page doi:10.1177/02783649241229725 2024
[16]

Huzaifa, R

M. Huzaifa, R. Desai, S. Grayson, X. Jiang, Y . Jing, J. Lee, F. Lu, Y . Pang, J. Ravichandran, F. Sinclair, B. Tian, H. Yuan, J. Zhang, and S. V . Adve. Illixr: An open testbed to enable extended reality systems research. IEEE Micro, 42(4):97–106, 2022. doi: 10.1109/MM.2022. 3161018 5

work page doi:10.1109/mm.2022 2022
[17]

In: Kim, B., Yue, Y., Chaudhuri, S., Fragkiadaki, K., Khan, M., Sun, Y

G. Ilharco, M. Wortsman, R. Wightman, C. Gordon, N. Carlini, R. Taori, A. Dave, V . Shankar, H. Namkoong, J. Miller, H. Hajishirzi, A. Farhadi, and L. Schmidt. Openclip, July 2021. If you use this software, please cite it as below. doi: 10.5281/zenodo.5143773 3

work page doi:10.5281/zenodo.5143773 2021
[18]

Jatavallabhula, A

K. Jatavallabhula, A. Kuwajerwala, Q. Gu, M. Omama, T. Chen, S. Li, G. Iyer, S. Saryazdi, N. Keetha, A. Tewari, J. Tenenbaum, C. de Melo, M. Krishna, L. Paull, F. Shkurti, and A. Torralba. Conceptfusion: Open-set multimodal 3d mapping. Robotics: Science and Systems (RSS), 2023. 2, 3, 5, 6, 7, 9

2023
[19]

Jiang, Y

Q. Jiang, Y . Pang, W. Sentosa, S. Gao, M. Huzaifa, J. Zhang, J. Perez- Ramirez, D. Das, D. Gonzalez-Aguirre, B. Godfrey, and S. Adve. Re- motevio: Offloading head tracking in an end-to-end xr system. In Proceedings of the 16th ACM Multimedia Systems Conference, MM- Sys ’25, p. 101–112. Association for Computing Machinery, New York, NY , USA, 2025. doi: 1...

work page doi:10.1145/3712676.3714442 2025
[20]

T. Jin, M. Dasa, C. Smith, K. Apicharttrisorn, S. Seshan, and A. Rowe. Meshreduce: Scalable and bandwidth efficient 3d scene capture. In 2024 IEEE Conference Virtual Reality and 3D User Interfaces (VR), pp. 20–30, 2024. doi: 10.1109/VR58804.2024.00026 2

work page doi:10.1109/vr58804.2024.00026 2024
[21]

Kerbl, G

B. Kerbl, G. Kopanas, T. Leimk ¨uhler, and G. Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4), July 2023. 10

2023
[22]

J. Kerr, C. M. Kim, K. Goldberg, A. Kanazawa, and M. Tancik. Lerf: Language embedded radiance fields. In International Conference on Computer Vision (ICCV), 2023. 2, 10

2023
[23]

Kirillov, E

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, et al. Segment anything. arXiv preprint arXiv:2304.02643, 2023. 3

Pith/arXiv arXiv 2023
[24]

Z. J. Kong, Q. Xu, and Y . C. Hu. Arise: High-capacity ar offloading inference serving via proactive scheduling. InProceedings of the 22nd Annual International Conference on Mobile Systems, Applications and Services, MOBISYS ’24, p. 451–464. Association for Comput- ing Machinery, New York, NY , USA, 2024. doi: 10.1145/3643832. 3661894 2, 10

work page doi:10.1145/3643832 2024
[25]

Z. J. Kong, Q. Xu, J. Meng, and Y . C. Hu. Accumo: Accuracy- centric multitask offloading in edge-assisted mobile augmented real- ity. In Proceedings of the 29th Annual International Conference on Mobile Computing and Networking, ACM MobiCom ’23. Associa- tion for Computing Machinery, New York, NY , USA, 2023. doi: 10. 1145/3570361.3592531 2, 10

arXiv 2023
[26]

Z. Lai, Y . C. Hu, Y . Cui, L. Sun, and N. Dai. Furion: Engineering high-quality immersive virtual reality on today’s mobile devices. In Proceedings of the 23rd Annual International Conference on Mobile Computing and Networking, MobiCom ’17, p. 409–421. Association for Computing Machinery, New York, NY , USA, 2017. doi: 10.1145/ 3117811.3117815 10

arXiv 2017
[27]

M. Li, S. Liu, H. Zhou, G. Zhu, N. Cheng, T. Deng, and H. Wang. Sgs- slam: Semantic gaussian splatting for neural dense slam. p. 163–179. Springer-Verlag, Berlin, Heidelberg, 2024. doi: 10.1007/978-3-031 -72751-1 10 10

work page doi:10.1007/978-3-031 2024
[28]

L. Liu, H. Li, and M. Gruteser. Edge assisted real-time object detec- tion for mobile augmented reality. In The 25th Annual International Conference on Mobile Computing and Networking, MobiCom ’19. Association for Computing Machinery, New York, NY , USA, 2019. doi: 10.1145/3300061.3300116 2, 10

work page doi:10.1145/3300061.3300116 2019
[29]

S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023. 3, 5

Pith/arXiv arXiv 2023
[30]

E. Lu, S. Bharadwaj, M. Dasari, C. Smith, S. Seshan, and A. Rowe. Renderfusion: Balancing local and remote rendering for interactive 3d scenes. In 2023 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), pp. 312–321, 2023. doi: 10.1109/ ISMAR59233.2023.00046 10

arXiv 2023
[31]

S. Lu, H. Chang, E. P. Jing, A. Boularias, and K. Bekris. Ovir-3d: Open-vocabulary 3d instance retrieval without training on 3d data. In 7th Annual Conference on Robot Learning, 2023. 2, 3, 10

2023
[32]

Maggio, Y

D. Maggio, Y . Chang, N. Hughes, M. Trang, D. Griffith, C. Dougherty, E. Cristofalo, L. Schmid, and L. Carlone. Clio: Real-time task-driven open-set 3d scene graphs. IEEE Robotics and Automation Letters, 9(10):8921–8928, 2024. doi: 10.1109/LRA.2024.3451395 2, 5, 6, 7, 9, 10

work page doi:10.1109/lra.2024.3451395 2024
[33]

Y . Mao, J. Zhong, C. Fang, J. Zheng, R. Tang, H. Zhu, P. Tan, and Z. Zhou. Spatiallm: Training large language models for structured indoor modeling, 2025. 9

2025
[34]

Mildenhall, P

B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoor- thi, and R. Ng. Nerf: representing scenes as neural radiance fields for view synthesis. Commun. ACM, 65(1):99–106, Dec. 2021. doi: 10. 1145/3503250 10

2021
[35]

Cloudxr, 2023

NVIDIA. Cloudxr, 2023. Streaming for extended reality. 2, 10

2023
[36]

tegrastats Utility.https://docs.nvidia.com/ drive/drive_os_5.1.6.1L/nvvib_docs/index.html#page/ DRIVE_OS_Linux_SDK_Development_Guide/Utilities/util_ tegrastats.html, 2024

NVIDIA. tegrastats Utility.https://docs.nvidia.com/ drive/drive_os_5.1.6.1L/nvvib_docs/index.html#page/ DRIVE_OS_Linux_SDK_Development_Guide/Utilities/util_ tegrastats.html, 2024. 6

2024
[37]

Y . Pang, S. Kondguli, S. Wang, and S. Adve. Ada: A distributed, power-aware, real-time scene provider for xr. IEEE Transactions on Visualization and Computer Graphics, 31(11):9677–9687, 2025. doi: 10.1109/TVCG.2025.3616835 2, 5, 8, 10

work page doi:10.1109/tvcg.2025.3616835 2025
[38]

Z. Peng, T. Shao, L. Yong, J. Zhou, Y . Yang, J. Wang, and K. Zhou. Rtg-slam: Real-time 3d reconstruction at scale using gaussian splat- ting. 2024. 6

2024
[39]

Z. Peng, K. Zhou, and T. Shao. Gaussian-plus-sdf slam: High-fidelity 3d reconstruction at 150+ fps. Computational Visual Media, 2025. 6

2025
[40]

M. Qin, W. Li, J. Zhou, H. Wang, and H. Pfister. Langsplat: 3d lan- guage gaussian splatting. arXiv preprint arXiv:2312.16084, 2023. 2, 10

arXiv 2023
[41]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervi- sion, 2021. 3

2021
[42]

N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R¨adle, C. Rolland, L. Gustafson, E. Mintun, J. Pan, K. V . Alwala, N. Carion, C.-Y . Wu, R. Girshick, P. Doll ´ar, and C. Feichtenhofer. Sam 2: Segment anything in images and videos, 2024. 3

2024
[43]

Schuhmann, R

C. Schuhmann, R. Beaumont, R. Vencu, C. W. Gordon, R. Wight- man, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, P. Schramowski, S. R. Kundurthy, K. Crowson, L. Schmidt, R. Kacz- marczyk, and J. Jitsev. LAION-5b: An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processin...

2022
[44]

S. Shi, J. Cui, Z. Jiang, Z. Yan, G. Xing, J. Niu, and Z. Ouyang. Vips: real-time perception fusion for infrastructure-assisted autonomous driving. In Proceedings of the 28th Annual International Conference on Mobile Computing And Networking, MobiCom ’22, p. 133–146. Association for Computing Machinery, New York, NY , USA, 2022. doi: 10.1145/3495243.3560539 2, 10

work page doi:10.1145/3495243.3560539 2022
[45]

Srinidhi, E

S. Srinidhi, E. Lu, and A. Rowe. Xair: An xr platform that in- tegrates large language models with the physical world. In 2024 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), pp. 759–767, 2024. doi: 10.1109/ISMAR62088.2024. 00091 10

work page doi:10.1109/ismar62088.2024 2024
[46]

Straub, T

J. Straub, T. Whelan, L. Ma, Y . Chen, E. Wijmans, S. Green, J. J. En- gel, R. Mur-Artal, C. Ren, S. Verma, A. Clarkson, M. Yan, B. Budge, Y . Yan, X. Pan, J. Yon, Y . Zou, K. Leon, N. Carter, J. Briales, T. Gillingham, E. Mueggler, L. Pesqueira, M. Savva, D. Batra, H. M. Strasdat, R. D. Nardi, M. Goesele, S. Lovegrove, and R. Newcombe. 11 The Replica dat...

Pith/arXiv arXiv 1906
[47]

Takmaz, E

A. Takmaz, E. Fedele, R. W. Sumner, M. Pollefeys, F. Tombari, and F. Engelmann. OpenMask3D: Open-V ocabulary 3D Instance Seg- mentation. In Advances in Neural Information Processing Systems (NeurIPS), 2023. 2, 5, 7, 10

2023
[48]

Y . Tang, J. Zhang, Y . Lan, Y . Guo, D. Dong, C. Zhu, and K. Xu. Onlineanyseg: Online zero-shot 3d segmentation by visual foundation model guided 2d mask merging. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3676–3685, June 2025. 2, 3, 5, 6, 9, 10

2025
[49]

P. K. A. Vasu*, H. P. Ansari*, F. Faghri*, R. Vemulapalli, and O. Tuzel. Mobileclip: Fast image-text models through multi-modal reinforced training. In CVPR, 2024. 5

2024
[50]

S.-C. Wu, J. Wald, K. Tateno, N. Navab, and F. Tombari. Scene- GraphFusion: Incremental 3D Scene Graph Prediction from RGB-D Sequences. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021. 9

2021
[51]

C. Xu, R. Kumaran, N. Stier, K. Yu, and T. H ¨ollerer. Multi- modal 3d fusion and in-situ learning for spatially aware ai. In 2024 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), 2024. doi: 10.1109/ISMAR62088.2024.00063 3, 9

work page doi:10.1109/ismar62088.2024.00063 2024
[52]

X. Xu, H. Chen, L. Zhao, Z. Wang, J. Zhou, and J. Lu. Embod- iedsam: Online segment any 3d thing in real time. arXiv preprint arXiv:2408.11811, 2024. 10

arXiv 2024
[53]

2024 , url =

K. Yamazaki, T. Hanyu, K. V o, T. Pham, M. Tran, G. Doretto, A. Nguyen, and N. Le. Open-fusion: Real-time open-vocabulary 3d mapping and queryable scene representation. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pp. 9411–9417, 2024. doi: 10.1109/ICRA57147.2024.10610193 2, 3, 5, 9

work page doi:10.1109/icra57147.2024.10610193 2024
[54]

Y . Yang, H. Yang, J. Zhou, P. Chen, H. Zhang, Y . Du, and C. Gan. 3d-mem: 3d scene memory for embodied exploration and reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 17294–17303, June 2025. 3

2025
[55]

Zhang, D

C. Zhang, D. Han, Y . Qiao, J. U. Kim, S.-H. Bae, S. Lee, and C. S. Hong. Faster segment anything: Towards lightweight sam for mobile applications. arXiv preprint arXiv:2306.14289, 2023. 5

Pith/arXiv arXiv 2023
[56]

Zhang, X

Q. Zhang, X. Zhang, R. Zhu, F. Bai, M. Naserian, and Z. M. Mao. Robust real-time multi-vehicle collaboration on asynchronous sen- sors. In Proceedings of the 29th Annual International Conference on Mobile Computing and Networking, ACM MobiCom ’23. Associa- tion for Computing Machinery, New York, NY , USA, 2023. doi: 10. 1145/3570361.3613271 2, 10

arXiv 2023
[57]

Zhang, Z

W. Zhang, Z. He, L. Liu, Z. Jia, Y . Liu, M. Gruteser, D. Raychaudhuri, and Y . Zhang. Elf: accelerate high-resolution mobile deep vision with content-aware parallel offloading. In Proceedings of the 27th Annual International Conference on Mobile Computing and Networking, Mo- biCom ’21, p. 201–214. Association for Computing Machinery, New York, NY , USA, ...

work page doi:10.1145/3447993.3448628 2021
[58]

Zhang, X

Y . Zhang, X. Huang, J. Ma, Z. Li, Z. Luo, Y . Xie, Y . Qin, T. Luo, Y . Li, S. Liu, et al. Recognize anything: A strong image tagging model. arXiv preprint arXiv:2306.03514, 2023. 3, 5

arXiv 2023
[59]

Zhong, J

Y . Zhong, J. Yang, P. Zhang, C. Li, N. Codella, L. H. Li, L. Zhou, X. Dai, L. Yuan, Y . Li, et al. Regionclip: Region-based language- image pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16793–16803, 2022. 3

2022
[60]

X. Zou, J. Yang, H. Zhang, F. Li, L. Li, J. Wang, L. Wang, J. Gao, and Y . J. Lee. Segment everything everywhere all at once. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23. Curran Associates Inc., Red Hook, NY , USA, 2024. 3 12

2024

[1] [1]

Apple iPad Pro, 2020

Apple Inc. Apple iPad Pro, 2020. Tablet device with LiDAR sensor. 9

2020

[2] [2]

Apple ARKit, 2023

Apple Inc. Apple ARKit, 2023. Apple’s AR Developer tool. 9

2023

[3] [3]

Behroozi, Y

A. Behroozi, Y . Chen, V . Fruchter, L. Subramanian, S. Srikanth, and S. Mahlke. Slimslam: An adaptive runtime for visual-inertial simul- taneous localization and mapping. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, V olume 3, ASPLOS ’24, p. 900–915. Association for Comp...

work page doi:10.1145/3620666.3651361 2024

[4] [4]

F. L. Busch, T. Homberger, J. Ortega-Peimbert, Q. Yang, and O. An- dersson. One map to find them all: Real-time open-vocabulary map- ping for zero-shot multi-object navigation. In2025 IEEE International Conference on Robotics and Automation (ICRA), pp. 14835–14842,

[5] [5]

doi: 10.1109/ICRA55743.2025.11128393 2, 3, 9

work page doi:10.1109/icra55743.2025.11128393 2025

[6] [6]

G. Chen, F. H’acha, L. V’a ˇsa, and M. Dasari. Tvmc: Time- varying mesh compression using volume-tracked reference meshes. In Proceedings of the 16th ACM Multimedia Systems Conference, MM- Sys ’25, p. 79–89. Association for Computing Machinery, New York, NY , USA, 2025. doi: 10.1145/3712676.3714440 2

work page doi:10.1145/3712676.3714440 2025

[7] [7]

K. Chen, T. Li, H.-S. Kim, D. E. Culler, and R. H. Katz. Marvel: En- abling mobile augmented reality with low energy and low latency. In Proceedings of the 16th ACM Conference on Embedded Networked Sensor Systems, SenSys ’18, p. 292–304. Association for Comput- ing Machinery, New York, NY , USA, 2018. doi: 10.1145/3274783. 3274834 2, 10

work page doi:10.1145/3274783 2018

[8] [8]

Cherti, R

M. Cherti, R. Beaumont, R. Wightman, M. Wortsman, G. Ilharco, C. Gordon, C. Schuhmann, L. Schmidt, and J. Jitsev. Repro- ducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2818–2829, 2023. 3

2023

[9] [9]

Dhakal, X

A. Dhakal, X. Ran, Y . Wang, J. Chen, and K. K. Ramakrishnan. Slam- share: visual simultaneous localization and mapping for real-time multi-user augmented reality. In Proceedings of the 18th International Conference on Emerging Networking EXperiments and Technologies, CoNEXT ’22, p. 293–306. Association for Computing Machinery, New York, NY , USA, 2022. d...

work page doi:10.1145/3555050.3569142 2022

[10] [10]

S. Gao, J. Liu, Q. Jiang, F. Sinclair, W. Sentosa, B. Godfrey, and S. Adve. Xrgo: Design and evaluation of rendering offload for low- power extended reality devices. In Proceedings of the 16th ACM Multimedia Systems Conference, MMSys ’25, p. 124–135. Associ- ation for Computing Machinery, New York, NY , USA, 2025. doi: 10. 1145/3712676.3714444 2, 5, 10

arXiv 2025

[11] [11]

Immersive stream for xr overview

Google. Immersive stream for xr overview. Streaming for extended reality. 2

[12] [12]

Project astra, 2024

Google. Project astra, 2024. Google’s universal AR AI agent. 10

2024

[13] [13]

Q. Gu, A. Kuwajerwala, S. Morin, K. M. Jatavallabhula, B. Sen, A. Agarwal, C. Rivera, W. Paul, K. Ellis, R. Chellappa, C. Gan, C. M. de Melo, J. B. Tenenbaum, A. Torralba, F. Shkurti, and L. Paull. Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pp. 5021–5...

work page doi:10.1109/icra57147 2024

[14] [14]

Hughes, Y

N. Hughes, Y . Chang, and L. Carlone. Hydra: A real-time spatial perception system for 3D scene graph construction and optimization

[15] [15]

Hughes, Y

N. Hughes, Y . Chang, S. Hu, R. Talak, R. Abdulhai, J. Strader, and L. Carlone. Foundations of spatial perception for robotics: Hierarchi- cal representations and real-time systems. The International Journal of Robotics Research, 2024. doi: 10.1177/02783649241229725 9

work page doi:10.1177/02783649241229725 2024

[16] [16]

Huzaifa, R

M. Huzaifa, R. Desai, S. Grayson, X. Jiang, Y . Jing, J. Lee, F. Lu, Y . Pang, J. Ravichandran, F. Sinclair, B. Tian, H. Yuan, J. Zhang, and S. V . Adve. Illixr: An open testbed to enable extended reality systems research. IEEE Micro, 42(4):97–106, 2022. doi: 10.1109/MM.2022. 3161018 5

work page doi:10.1109/mm.2022 2022

[17] [17]

In: Kim, B., Yue, Y., Chaudhuri, S., Fragkiadaki, K., Khan, M., Sun, Y

G. Ilharco, M. Wortsman, R. Wightman, C. Gordon, N. Carlini, R. Taori, A. Dave, V . Shankar, H. Namkoong, J. Miller, H. Hajishirzi, A. Farhadi, and L. Schmidt. Openclip, July 2021. If you use this software, please cite it as below. doi: 10.5281/zenodo.5143773 3

work page doi:10.5281/zenodo.5143773 2021

[18] [18]

Jatavallabhula, A

K. Jatavallabhula, A. Kuwajerwala, Q. Gu, M. Omama, T. Chen, S. Li, G. Iyer, S. Saryazdi, N. Keetha, A. Tewari, J. Tenenbaum, C. de Melo, M. Krishna, L. Paull, F. Shkurti, and A. Torralba. Conceptfusion: Open-set multimodal 3d mapping. Robotics: Science and Systems (RSS), 2023. 2, 3, 5, 6, 7, 9

2023

[19] [19]

Jiang, Y

Q. Jiang, Y . Pang, W. Sentosa, S. Gao, M. Huzaifa, J. Zhang, J. Perez- Ramirez, D. Das, D. Gonzalez-Aguirre, B. Godfrey, and S. Adve. Re- motevio: Offloading head tracking in an end-to-end xr system. In Proceedings of the 16th ACM Multimedia Systems Conference, MM- Sys ’25, p. 101–112. Association for Computing Machinery, New York, NY , USA, 2025. doi: 1...

work page doi:10.1145/3712676.3714442 2025

[20] [20]

T. Jin, M. Dasa, C. Smith, K. Apicharttrisorn, S. Seshan, and A. Rowe. Meshreduce: Scalable and bandwidth efficient 3d scene capture. In 2024 IEEE Conference Virtual Reality and 3D User Interfaces (VR), pp. 20–30, 2024. doi: 10.1109/VR58804.2024.00026 2

work page doi:10.1109/vr58804.2024.00026 2024

[21] [21]

Kerbl, G

B. Kerbl, G. Kopanas, T. Leimk ¨uhler, and G. Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4), July 2023. 10

2023

[22] [22]

J. Kerr, C. M. Kim, K. Goldberg, A. Kanazawa, and M. Tancik. Lerf: Language embedded radiance fields. In International Conference on Computer Vision (ICCV), 2023. 2, 10

2023

[23] [23]

Kirillov, E

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, et al. Segment anything. arXiv preprint arXiv:2304.02643, 2023. 3

Pith/arXiv arXiv 2023

[24] [24]

Z. J. Kong, Q. Xu, and Y . C. Hu. Arise: High-capacity ar offloading inference serving via proactive scheduling. InProceedings of the 22nd Annual International Conference on Mobile Systems, Applications and Services, MOBISYS ’24, p. 451–464. Association for Comput- ing Machinery, New York, NY , USA, 2024. doi: 10.1145/3643832. 3661894 2, 10

work page doi:10.1145/3643832 2024

[25] [25]

Z. J. Kong, Q. Xu, J. Meng, and Y . C. Hu. Accumo: Accuracy- centric multitask offloading in edge-assisted mobile augmented real- ity. In Proceedings of the 29th Annual International Conference on Mobile Computing and Networking, ACM MobiCom ’23. Associa- tion for Computing Machinery, New York, NY , USA, 2023. doi: 10. 1145/3570361.3592531 2, 10

arXiv 2023

[26] [26]

Z. Lai, Y . C. Hu, Y . Cui, L. Sun, and N. Dai. Furion: Engineering high-quality immersive virtual reality on today’s mobile devices. In Proceedings of the 23rd Annual International Conference on Mobile Computing and Networking, MobiCom ’17, p. 409–421. Association for Computing Machinery, New York, NY , USA, 2017. doi: 10.1145/ 3117811.3117815 10

arXiv 2017

[27] [27]

M. Li, S. Liu, H. Zhou, G. Zhu, N. Cheng, T. Deng, and H. Wang. Sgs- slam: Semantic gaussian splatting for neural dense slam. p. 163–179. Springer-Verlag, Berlin, Heidelberg, 2024. doi: 10.1007/978-3-031 -72751-1 10 10

work page doi:10.1007/978-3-031 2024

[28] [28]

L. Liu, H. Li, and M. Gruteser. Edge assisted real-time object detec- tion for mobile augmented reality. In The 25th Annual International Conference on Mobile Computing and Networking, MobiCom ’19. Association for Computing Machinery, New York, NY , USA, 2019. doi: 10.1145/3300061.3300116 2, 10

work page doi:10.1145/3300061.3300116 2019

[29] [29]

S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023. 3, 5

Pith/arXiv arXiv 2023

[30] [30]

E. Lu, S. Bharadwaj, M. Dasari, C. Smith, S. Seshan, and A. Rowe. Renderfusion: Balancing local and remote rendering for interactive 3d scenes. In 2023 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), pp. 312–321, 2023. doi: 10.1109/ ISMAR59233.2023.00046 10

arXiv 2023

[31] [31]

S. Lu, H. Chang, E. P. Jing, A. Boularias, and K. Bekris. Ovir-3d: Open-vocabulary 3d instance retrieval without training on 3d data. In 7th Annual Conference on Robot Learning, 2023. 2, 3, 10

2023

[32] [32]

Maggio, Y

D. Maggio, Y . Chang, N. Hughes, M. Trang, D. Griffith, C. Dougherty, E. Cristofalo, L. Schmid, and L. Carlone. Clio: Real-time task-driven open-set 3d scene graphs. IEEE Robotics and Automation Letters, 9(10):8921–8928, 2024. doi: 10.1109/LRA.2024.3451395 2, 5, 6, 7, 9, 10

work page doi:10.1109/lra.2024.3451395 2024

[33] [33]

Y . Mao, J. Zhong, C. Fang, J. Zheng, R. Tang, H. Zhu, P. Tan, and Z. Zhou. Spatiallm: Training large language models for structured indoor modeling, 2025. 9

2025

[34] [34]

Mildenhall, P

B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoor- thi, and R. Ng. Nerf: representing scenes as neural radiance fields for view synthesis. Commun. ACM, 65(1):99–106, Dec. 2021. doi: 10. 1145/3503250 10

2021

[35] [35]

Cloudxr, 2023

NVIDIA. Cloudxr, 2023. Streaming for extended reality. 2, 10

2023

[36] [36]

tegrastats Utility.https://docs.nvidia.com/ drive/drive_os_5.1.6.1L/nvvib_docs/index.html#page/ DRIVE_OS_Linux_SDK_Development_Guide/Utilities/util_ tegrastats.html, 2024

NVIDIA. tegrastats Utility.https://docs.nvidia.com/ drive/drive_os_5.1.6.1L/nvvib_docs/index.html#page/ DRIVE_OS_Linux_SDK_Development_Guide/Utilities/util_ tegrastats.html, 2024. 6

2024

[37] [37]

Y . Pang, S. Kondguli, S. Wang, and S. Adve. Ada: A distributed, power-aware, real-time scene provider for xr. IEEE Transactions on Visualization and Computer Graphics, 31(11):9677–9687, 2025. doi: 10.1109/TVCG.2025.3616835 2, 5, 8, 10

work page doi:10.1109/tvcg.2025.3616835 2025

[38] [38]

Z. Peng, T. Shao, L. Yong, J. Zhou, Y . Yang, J. Wang, and K. Zhou. Rtg-slam: Real-time 3d reconstruction at scale using gaussian splat- ting. 2024. 6

2024

[39] [39]

Z. Peng, K. Zhou, and T. Shao. Gaussian-plus-sdf slam: High-fidelity 3d reconstruction at 150+ fps. Computational Visual Media, 2025. 6

2025

[40] [40]

M. Qin, W. Li, J. Zhou, H. Wang, and H. Pfister. Langsplat: 3d lan- guage gaussian splatting. arXiv preprint arXiv:2312.16084, 2023. 2, 10

arXiv 2023

[41] [41]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervi- sion, 2021. 3

2021

[42] [42]

N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R¨adle, C. Rolland, L. Gustafson, E. Mintun, J. Pan, K. V . Alwala, N. Carion, C.-Y . Wu, R. Girshick, P. Doll ´ar, and C. Feichtenhofer. Sam 2: Segment anything in images and videos, 2024. 3

2024

[43] [43]

Schuhmann, R

C. Schuhmann, R. Beaumont, R. Vencu, C. W. Gordon, R. Wight- man, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, P. Schramowski, S. R. Kundurthy, K. Crowson, L. Schmidt, R. Kacz- marczyk, and J. Jitsev. LAION-5b: An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processin...

2022

[44] [44]

S. Shi, J. Cui, Z. Jiang, Z. Yan, G. Xing, J. Niu, and Z. Ouyang. Vips: real-time perception fusion for infrastructure-assisted autonomous driving. In Proceedings of the 28th Annual International Conference on Mobile Computing And Networking, MobiCom ’22, p. 133–146. Association for Computing Machinery, New York, NY , USA, 2022. doi: 10.1145/3495243.3560539 2, 10

work page doi:10.1145/3495243.3560539 2022

[45] [45]

Srinidhi, E

S. Srinidhi, E. Lu, and A. Rowe. Xair: An xr platform that in- tegrates large language models with the physical world. In 2024 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), pp. 759–767, 2024. doi: 10.1109/ISMAR62088.2024. 00091 10

work page doi:10.1109/ismar62088.2024 2024

[46] [46]

Straub, T

J. Straub, T. Whelan, L. Ma, Y . Chen, E. Wijmans, S. Green, J. J. En- gel, R. Mur-Artal, C. Ren, S. Verma, A. Clarkson, M. Yan, B. Budge, Y . Yan, X. Pan, J. Yon, Y . Zou, K. Leon, N. Carter, J. Briales, T. Gillingham, E. Mueggler, L. Pesqueira, M. Savva, D. Batra, H. M. Strasdat, R. D. Nardi, M. Goesele, S. Lovegrove, and R. Newcombe. 11 The Replica dat...

Pith/arXiv arXiv 1906

[47] [47]

Takmaz, E

A. Takmaz, E. Fedele, R. W. Sumner, M. Pollefeys, F. Tombari, and F. Engelmann. OpenMask3D: Open-V ocabulary 3D Instance Seg- mentation. In Advances in Neural Information Processing Systems (NeurIPS), 2023. 2, 5, 7, 10

2023

[48] [48]

Y . Tang, J. Zhang, Y . Lan, Y . Guo, D. Dong, C. Zhu, and K. Xu. Onlineanyseg: Online zero-shot 3d segmentation by visual foundation model guided 2d mask merging. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3676–3685, June 2025. 2, 3, 5, 6, 9, 10

2025

[49] [49]

P. K. A. Vasu*, H. P. Ansari*, F. Faghri*, R. Vemulapalli, and O. Tuzel. Mobileclip: Fast image-text models through multi-modal reinforced training. In CVPR, 2024. 5

2024

[50] [50]

S.-C. Wu, J. Wald, K. Tateno, N. Navab, and F. Tombari. Scene- GraphFusion: Incremental 3D Scene Graph Prediction from RGB-D Sequences. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021. 9

2021

[51] [51]

C. Xu, R. Kumaran, N. Stier, K. Yu, and T. H ¨ollerer. Multi- modal 3d fusion and in-situ learning for spatially aware ai. In 2024 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), 2024. doi: 10.1109/ISMAR62088.2024.00063 3, 9

work page doi:10.1109/ismar62088.2024.00063 2024

[52] [52]

X. Xu, H. Chen, L. Zhao, Z. Wang, J. Zhou, and J. Lu. Embod- iedsam: Online segment any 3d thing in real time. arXiv preprint arXiv:2408.11811, 2024. 10

arXiv 2024

[53] [53]

2024 , url =

K. Yamazaki, T. Hanyu, K. V o, T. Pham, M. Tran, G. Doretto, A. Nguyen, and N. Le. Open-fusion: Real-time open-vocabulary 3d mapping and queryable scene representation. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pp. 9411–9417, 2024. doi: 10.1109/ICRA57147.2024.10610193 2, 3, 5, 9

work page doi:10.1109/icra57147.2024.10610193 2024

[54] [54]

Y . Yang, H. Yang, J. Zhou, P. Chen, H. Zhang, Y . Du, and C. Gan. 3d-mem: 3d scene memory for embodied exploration and reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 17294–17303, June 2025. 3

2025

[55] [55]

Zhang, D

C. Zhang, D. Han, Y . Qiao, J. U. Kim, S.-H. Bae, S. Lee, and C. S. Hong. Faster segment anything: Towards lightweight sam for mobile applications. arXiv preprint arXiv:2306.14289, 2023. 5

Pith/arXiv arXiv 2023

[56] [56]

Zhang, X

Q. Zhang, X. Zhang, R. Zhu, F. Bai, M. Naserian, and Z. M. Mao. Robust real-time multi-vehicle collaboration on asynchronous sen- sors. In Proceedings of the 29th Annual International Conference on Mobile Computing and Networking, ACM MobiCom ’23. Associa- tion for Computing Machinery, New York, NY , USA, 2023. doi: 10. 1145/3570361.3613271 2, 10

arXiv 2023

[57] [57]

Zhang, Z

W. Zhang, Z. He, L. Liu, Z. Jia, Y . Liu, M. Gruteser, D. Raychaudhuri, and Y . Zhang. Elf: accelerate high-resolution mobile deep vision with content-aware parallel offloading. In Proceedings of the 27th Annual International Conference on Mobile Computing and Networking, Mo- biCom ’21, p. 201–214. Association for Computing Machinery, New York, NY , USA, ...

work page doi:10.1145/3447993.3448628 2021

[58] [58]

Zhang, X

Y . Zhang, X. Huang, J. Ma, Z. Li, Z. Luo, Y . Xie, Y . Qin, T. Luo, Y . Li, S. Liu, et al. Recognize anything: A strong image tagging model. arXiv preprint arXiv:2306.03514, 2023. 3, 5

arXiv 2023

[59] [59]

Zhong, J

Y . Zhong, J. Yang, P. Zhang, C. Li, N. Codella, L. H. Li, L. Zhou, X. Dai, L. Yuan, Y . Li, et al. Regionclip: Region-based language- image pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16793–16803, 2022. 3

2022

[60] [60]

X. Zou, J. Yang, H. Zhang, F. Li, L. Li, J. Wang, L. Wang, J. Gao, and Y . J. Lee. Segment everything everywhere all at once. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23. Curran Associates Inc., Red Hook, NY , USA, 2024. 3 12

2024