arxiv: 2605.10484 · v1 · submitted 2026-05-11 · 💻 cs.CV · cs.RO

Recognition: no theorem link

OpenSGA: Efficient 3D Scene Graph Alignment in the Open World

Gang Chen, Javier Alonso-Mora, Sebasti\'an Barbas Laina, Stefan Leutenegger

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:55 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords 3D scene graph alignmentopen world roboticsscene understandingvision-language featuresobject correspondencemulti-agent mappingScanNet-SG

0 comments

The pith

Fusing vision-language, text and geometry lets a new framework align objects in partially overlapping 3D scene graphs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a unified framework that finds matching objects between two 3D scene graphs built from different robot observations. It combines vision-language features, text labels, geometric descriptors and spatial relations inside a distance-gated attention encoder and a minimum-cost-flow allocator. The method is designed to work for both single-frame-to-full-scan and subscan-to-subscan cases even when coordinate frames differ substantially and when object categories are open. The authors also release a large automated dataset covering hundreds of thousands of examples and thousands of categories. If the approach holds, robots gain a practical way to relocalize at the object level and to merge maps across agents without closed vocabularies.

Core claim

OpenSGA predicts object correspondences in 3D scene graphs by fusing vision-language, textual and geometric features together with spatial context. A distance-gated spatial attention encoder processes the combined features, a minimum-cost-flow allocator assigns matches, and a global scene embedding generator provides additional consistency. The framework handles both frame-to-scan and subscan-to-subscan alignment under large coordinate discrepancies and open-set categories, and it is trained and tested on the newly introduced ScanNet-SG dataset of over 700k samples.

What carries the argument

Multi-modal feature fusion inside a distance-gated spatial attention encoder followed by minimum-cost-flow allocation to establish correspondences.

If this is right

Robots obtain reliable object-level relocalization when they revisit a location.
Multiple agents can fuse their individual maps into a consistent global map at the object level.
Scene understanding and mapping become feasible without limiting the robot to a fixed, closed set of object labels.
Alignment remains possible even when the coordinate systems of the two observations are poorly aligned.
Large-scale training and systematic evaluation become practical thanks to the released ScanNet-SG dataset.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fused representation could support incremental lifelong mapping by reusing prior alignments when a robot returns to an area.
Adding temporal or motion features to the fusion step might extend the method to dynamic scenes with moving objects.
The automated annotation pipeline that combines ScanNet labels with GPT-4o tagging could be reused to create open-world benchmarks for other 3D perception tasks.

Load-bearing premise

The combination of vision-language, textual and geometric features plus spatial context is enough to find accurate object matches despite large coordinate differences and open-set categories.

What would settle it

A controlled test set of scene pairs with substantially larger coordinate offsets or many unseen object categories where alignment precision falls below strong geometric baselines would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.10484 by Gang Chen, Javier Alonso-Mora, Sebasti\'an Barbas Laina, Stefan Leutenegger.

**Figure 1.** Figure 1: A visualization of the F2S (A) and S2S (B) alignment tasks in our ScanNet-SG dataset. Correspondences are predicted by our released high-performance model, and correct correspondences are illustrated with green lines. SG denotes scene graph. In (A), the scene graph of Frame 1133 in Scene0000 01 is shown on the left, while the scene graph of the full scan (Scene0000 00) of the same scene is shown on the rig… view at source ↗

**Figure 2.** Figure 2: The system architecture of our 3D scene graph alignment network. From left to right: input scene graphs, alignment network composed of encoder, matcher and allocator, output matching result. Each node is composed of 3D position x, visual language feature f vl, text feature ft and geometry feature f g . The green lines in the output block illustrate the matched nodes. “Nbr” is the abbreviation of “neighbor”… view at source ↗

**Figure 3.** Figure 3: Illustration of our object feature extraction and scene graph construction pipeline. Blue arrows indicate additional steps applied when building scene graphs from multiple frames, such as scans or subscans. threshold dth. For scene graphs constructed from scans or subscans comprising multiple frames, the following additional step is applied: 5) Multiview fusion We use instance masks and depth images from m… view at source ↗

**Figure 4.** Figure 4: Visualization of matching results produced by different models (A–E). In each subplot, the central point cloud corresponds to the full scan of Scene0673 00. The query frames are selected from a separate scan of the same scene, Scene0673 01 (cross-scan), and are shown in the top-left corner as instance-segmented RGB images labeled (1)–(7), corresponding to ScanNet frame IDs 15, 105, 252, 639, 1050, 1521, an… view at source ↗

**Figure 5.** Figure 5: Many-to-one matching illustration. Rows A and C show the instance segmentation and F2S matching results for test samples from Scene0673, and Scene0679, respectively. The instance segmentation images in the left column illustrate under-segmentation in two representative cases: (A) two adjacent sofas, and (B) two curtain segments. In each case, the highlighted instances (red rectangles) are spatially close a… view at source ↗

**Figure 6.** Figure 6: Radar plot comparing four learning-based methods across five metrics on F2S task. Accuracy and F1 score are shown on a [0, 1] scale. Parameters, training time (per epoch on an NVIDIA A40 GPU), and inference time (per sample on an NVIDIA 3080Ti GPU) are normalized to the ranges [5, 100] M, [5, 100] h, and [10, 150] ms, respectively. reduces the coordinate-system mismatch and alleviates the extreme partial-o… view at source ↗

**Figure 7.** Figure 7: Radar plot comparing four learning-based methods across five metrics on S2S task. Accuracy and F1 score are shown on a [0, 1] scale. Parameters, training time (per epoch on an NVIDIA A40 GPU), and inference time (per sample on an NVIDIA 3080Ti GPU) are normalized to the ranges [5, 100] M, [0.01, 3.5] h, and [10, 150] ms, respectively. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Overlap ratio 0 0.2 0.4 0.6 0.8 … view at source ↗

**Figure 8.** Figure 8: Line plot of accuracy (left) and F1 score (right) changing with different overlap ratio of two subscans. correspondence in the other subscan, making true negatives dominate the evaluation. While true negatives are reflected in accuracy, they do not contribute to the F1 score. ROMAN (Peterson et al. (2025)) and our models, which consider geometric consistency, are more reliable at predicting the “no-match” … view at source ↗

**Figure 9.** Figure 9: Comparison of different methods (A–E) on the S2S alignment task. The subscan pairs in (1)—(4) are: (1) Scene0673 00 frames 1383–2139 and Scene0673 01 frames 1584–1989; (2) Scene0673 00 frames 654–1140 and Scene0673 02 frames 234–1032; (3) Scene0698 00 frames 732–942 and Scene0698 01 frames 837–1302; and (4) Scene0642 00 frames 168–777 and Scene0642 01 frames 1890–2367. The node overlap ratio between two su… view at source ↗

**Figure 10.** Figure 10: Examples of 3D scene graphs constructed using ScanNet human-annotated object labels (ScanNet-SG-509). The blue points show the centers of objects. An edge is added to connect two objects if their distance is smaller than two meters. Scene0000 Scene0050 Scene0700 [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: Examples of 3D scene graphs constructed using GPT-4o–tagged objects ScanNet-SG-GPT. The three scenes correspond to those shown in the first row of [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

**Figure 12.** Figure 12: Dataset statistics for the frame-to-scan (F2S) matching task. (A–B) Top-30 most frequent object categories in the ScanNet-SG-509 and ScanNet-SG-GPT groups, respectively. (C–D) Distribution of node counts in frame-level and scan-level scene graphs for the ScanNet-SG-509 group. (E–F) Corresponding node count distributions for the ScanNet-SG-GPT group. 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 … view at source ↗

**Figure 13.** Figure 13: Dataset statistics for the subscan-to-subscan (S2S) matching task (ScanNet-SG-Subscan group). (A) Distribution of node overlap ratio of two matched subscans. (B) Distribution of node counts in subscans. correspondences predicted by our Ours H DGSA + MNN model trained in Section 6.4. We conduct evaluation on test Scenes 600 to 625, resulting in 15,596 samples for the F2S group (using Cross Scan data) and 7… view at source ↗

read the original abstract

Scene graph alignment establishes object correspondences between two 3D scene graphs constructed from partially overlapping observations. This enables efficient scene understanding and object-level relocalization when a robot revisits a place, as well as global map fusion across multiple agents. Such capabilities are essential for robots that require long-term memory for long-horizon tasks involving interactions with the environment. Existing approaches mainly focus on subscan-to-subscan (S2S) alignment and depend heavily on geometric point-cloud features, leaving frame-to-scan (F2S) alignment and open-set vision-language features underexplored. In addition, existing datasets for scene graph alignment remain small-scale with limited object diversity, constraining systematic training and evaluation. We present a unified and efficient scene graph alignment framework that predicts object correspondences by fusing vision-language, textual, and geometric features with spatial context. The framework comprises modules such as a distance-gated spatial attention encoder, a minimum-cost-flow-based allocator, and a global scene embedding generator to achieve accurate alignment even under large coordinate discrepancies. We further introduce ScanNet-SG, a large-scale dataset generated via an automated annotation pipeline with over 700k samples, covering 509 object categories from ScanNet labels and over 3k categories from GPT-4o-based tagging. Experiments show that our method achieves the best overall performance on both F2S and S2S tasks, substantially outperforming existing scene graph alignment methods. Our code and dataset are released at: https://autonomousrobots.nl/paper_websites/opensga.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents OpenSGA, a unified framework for 3D scene graph alignment that fuses vision-language, textual, and geometric features with spatial context via a distance-gated spatial attention encoder, a minimum-cost-flow-based allocator, and a global scene embedding generator. It targets both frame-to-scan (F2S) and subscan-to-subscan (S2S) tasks in open-world settings with large coordinate discrepancies. The authors introduce ScanNet-SG, a large-scale dataset of over 700k samples constructed via an automated pipeline that augments 509 ScanNet labels with GPT-4o tagging for more than 3k categories. Experiments claim that OpenSGA achieves the best overall performance on both tasks, substantially outperforming prior scene graph alignment methods. Code and dataset are released.

Significance. If the performance claims are supported by validated data, the work advances open-set 3D scene understanding for robotics by enabling F2S alignment and large-scale training on diverse categories. The explicit release of code and the ScanNet-SG dataset constitutes a clear strength for reproducibility and follow-on research.

major comments (2)

[Dataset Construction] Dataset Construction section: the ScanNet-SG dataset is generated using GPT-4o tagging for >3k open-set categories to produce >700k samples and correspondence labels, yet no quantitative validation (precision/recall, human agreement, or error rates on object categories) is reported. Because mismatches would directly corrupt positive/negative pair labels and spatial-context features, this undermines isolation of whether reported F2S/S2S gains arise from the fused modules or from dataset artifacts.
[Experiments] Experiments section: the headline claim of best overall F2S and S2S performance rests on comparisons against prior methods on ScanNet-SG, but the manuscript provides no ablation studies isolating the contribution of the distance-gated attention or feature-fusion components, no error bars, and no failure-case analysis under large coordinate discrepancies. These omissions make it impossible to verify robustness of the central empirical result.

minor comments (2)

[Abstract] Abstract: reports superior performance without any quantitative metrics, table references, or result summaries; adding a compact results table would improve clarity.
[Notation and Presentation] Ensure all acronyms (F2S, S2S, VL) are defined on first use and used consistently in figure captions and tables.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and indicate planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Dataset Construction] Dataset Construction section: the ScanNet-SG dataset is generated using GPT-4o tagging for >3k open-set categories to produce >700k samples and correspondence labels, yet no quantitative validation (precision/recall, human agreement, or error rates on object categories) is reported. Because mismatches would directly corrupt positive/negative pair labels and spatial-context features, this undermines isolation of whether reported F2S/S2S gains arise from the fused modules or from dataset artifacts.

Authors: We agree that explicit quantitative validation of the GPT-4o tagging step is important to rule out label noise as a confounding factor. The current manuscript describes the automated pipeline but does not report precision/recall or human agreement metrics. In the revised version we will add a dedicated validation subsection that includes (i) precision/recall on a manually inspected random subset of 5,000 samples and (ii) inter-rater agreement statistics with two human annotators on the same subset. These additions will allow readers to assess label quality independently of the alignment modules. revision: yes
Referee: [Experiments] Experiments section: the headline claim of best overall F2S and S2S performance rests on comparisons against prior methods on ScanNet-SG, but the manuscript provides no ablation studies isolating the contribution of the distance-gated attention or feature-fusion components, no error bars, and no failure-case analysis under large coordinate discrepancies. These omissions make it impossible to verify robustness of the central empirical result.

Authors: We concur that the empirical section would benefit from component-wise ablations, statistical variability measures, and targeted failure analysis. The present manuscript reports only aggregate performance numbers. We will revise the Experiments section to include: (i) ablations that isolate the distance-gated spatial attention encoder and the vision-language/geometric feature fusion, (ii) mean and standard deviation results over five independent training runs with different random seeds, and (iii) a qualitative and quantitative failure-case study focused on pairs with large coordinate discrepancies. These changes will directly address concerns about robustness. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation or claims

full rationale

The paper's central claims concern empirical performance of a scene-graph alignment framework on F2S/S2S tasks, supported by experiments on a newly introduced dataset. No equations, predictions, or first-principles derivations are presented that reduce by construction to fitted parameters, self-definitions, or self-citation chains. The method description (distance-gated attention, minimum-cost-flow allocator, feature fusion) is algorithmic and evaluated externally against prior methods; dataset construction details do not alter the non-circular status of the reported results under the specified patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method assumes standard computer-vision feature extractors (vision-language models, point-cloud descriptors) and graph-matching solvers are reliable; no new physical axioms or invented entities are introduced.

axioms (1)

domain assumption Vision-language and geometric features extracted from partially overlapping 3D observations contain sufficient signal for object correspondence
Invoked when claiming accurate alignment under large coordinate discrepancies

pith-pipeline@v0.9.0 · 5585 in / 1346 out tokens · 43830 ms · 2026-05-12T04:55:35.050354+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 4 internal anchors

[1]

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

Grounded sam: Assembling open-world models for diverse visual tasks , author =. arXiv preprint arXiv:2401.14159 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

European conference on computer vision , pages=

Grounding dino: Marrying dino with grounded pre-training for open-set object detection , author =. European conference on computer vision , pages=. 2024 , organization=

work page 2024
[3]

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing , pages=

Sentence-bert: Sentence embeddings using siamese bert-networks , author=. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing , pages=

work page 2019
[4]

Proceedings of the 23rd annual conference on Computer graphics and interactive techniques , pages=

OBBTree: A hierarchical structure for rapid interference detection , author=. Proceedings of the 23rd annual conference on Computer graphics and interactive techniques , pages=

work page
[5]

2004 , publisher=

Real-time collision detection , author=. 2004 , publisher=

work page 2004
[6]

Sg-pgm: Partial graph matching network with semantic geometric fusion for 3d scene graph alignment and its downstream tasks , author=

work page
[7]

2025 , volume=

Liu, Chuhao and Qiao, Zhijian and Shi, Jieqi and Wang, Ke and Liu, Peize and Shen, Shaojie , journal=tro, title=. 2025 , volume=

work page 2025
[8]

Sgaligner: 3d scene alignment with scene graphs , author=

work page
[9]

Advances in neural information processing systems , volume=

Attention is all you need , author=. Advances in neural information processing systems , volume=

work page
[10]

Deep residual learning for image recognition , author=

work page
[11]

Layer Normalization

Layer normalization , author=. arXiv preprint arXiv:1607.06450 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Lightglue: Local feature matching at light speed , author=

work page
[13]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Recognize anything: A strong image tagging model , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[14]

Journal of the Operational Research Society , volume=

Network flows: theory, algorithms, and applications , author=. Journal of the Operational Research Society , volume=. 1994 , publisher=

work page 1994
[15]

2008 , institution=

Exploring network structure, dynamics, and function using NetworkX , author=. 2008 , institution=

work page 2008
[16]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Learning 3d semantic scene graphs from 3d indoor reconstructions , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[17]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Rio: 3d object instance re-localization in changing indoor environments , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[18]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Scannet: Richly-annotated 3d reconstructions of indoor scenes , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[19]

Communications of the ACM , volume=

Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography , author=. Communications of the ACM , volume=. 1981 , publisher=

work page 1981
[20]

Sensor fusion IV: control paradigms and data structures , volume=

Method for registration of 3-D shapes , author=. Sensor fusion IV: control paradigms and data structures , volume=. 1992 , organization=

work page 1992
[21]

GPT-4o System Card

OpenAI , month = oct, year =. doi:10.48550/arXiv.2410.21276 , language =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2410.21276
[22]

International Conference on Machine Learning , year=

Learning Transferable Visual Models from Natural Language Supervision , author =. International Conference on Machine Learning , year=

work page
[23]

International Conference on Learning Representations , year=

Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=

work page
[24]

Peterson, Mason B and Jia, Yi Xuan and Tian, Yulun and Thomas, Annika and How, Jonathan P , booktitle=

work page
[25]

Fast segment anything,

Fast segment anything , author=. arXiv preprint arXiv:2306.12156 , year=

work page arXiv
[26]

Proceedings of the 25th

Optuna: A Next-generation Hyperparameter Optimization Framework , author=. Proceedings of the 25th

work page
[27]

kdd , volume=

A density-based algorithm for discovering clusters in large spatial databases with noise , author=. kdd , volume=

work page
[28]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month=

Geometric Transformer for Fast and Robust Point Cloud Registration , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month=. 2022 , pages=

work page 2022
[29]

3d scene graph: A structure for unified semantics, 3d space, and camera , author=

work page
[30]

Robotics: Science and Systems (RSS) , year =

Nathan Hughes and Yun Chang and Luca Carlone , title =. Robotics: Science and Systems (RSS) , year =

work page
[31]

The International Journal of Robotics Research , volume=

Foundations of spatial perception for robotics: Hierarchical representations and real-time systems , author=. The International Journal of Robotics Research , volume=. 2024 , publisher=

work page 2024
[32]

2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages=

Hydra-multi: Collaborative online construction of 3d scene graphs with multi-robot teams , author=. 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages=. 2023 , organization=

work page 2023
[33]

Open3dsg: Open-vocabulary 3d scene graphs from point clouds with queryable objects and open-set relationships , author=

work page
[34]

Clip-driven open-vocabulary 3d scene graph generation via cross-modality contrastive learning , author=

work page
[35]

2024 , publisher=

Clio: Real-time task-driven open-set 3d scene graphs , author=. 2024 , publisher=

work page 2024
[36]

Open-vocabulary functional 3d scene graphs for real-world indoor spaces , author=

work page
[37]

Incremental 3d semantic scene graph prediction from rgb sequences , author=

work page
[38]

2024 , publisher=

Indoor and outdoor 3d scene graph generation via language-enabled spatial ontologies , author=. 2024 , publisher=

work page 2024
[39]

arXiv preprint arXiv:2411.02938 , year=

Multi-modal 3D scene graph updater for shared and dynamic environments , author=. arXiv preprint arXiv:2411.02938 , year=

work page arXiv
[40]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Segment anything , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

work page
[41]

IEEE Robotics and Automation Letters , year=

MR-COGraphs: Communication-efficient multi-robot open-vocabulary mapping system via 3D scene graphs , author=. IEEE Robotics and Automation Letters , year=

work page
[42]

2024 , organization=

Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning , author =. 2024 , organization=

work page 2024
[43]

2024 , publisher=

Language-grounded dynamic scene graphs for interactive object search with mobile manipulation , author=. 2024 , publisher=

work page 2024
[44]

Conference on Robot Learning (CoRL) , year=

SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Robot Task Planning , author=. Conference on Robot Learning (CoRL) , year=

work page
[45]

2025 , organization=

Delta: Decomposed efficient long-term robot task planning using large language models , author=. 2025 , organization=

work page 2025
[46]

2024 , organization=

Optimal scene graph planning with large language model guidance , author=. 2024 , organization=

work page 2024
[47]

Open-Architecture End-to-End System for Real-World Autonomous Robot Navigation

Orionnav: Online planning for robot autonomy with context-aware llm and open-vocabulary semantic scene graphs , author=. arXiv preprint arXiv:2410.06239 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[48]

2025 , organization=

Openobject-nav: Open-vocabulary object-oriented navigation based on dynamic carrier-relationship scene graph , author=. 2025 , organization=

work page 2025
[49]

2025 , organization=

Synergai: Perception alignment for human-robot collaboration , author=. 2025 , organization=

work page 2025
[50]

2024 , organization=

Commonsense scene graph-based target localization for object search , author=. 2024 , organization=

work page 2024
[51]

2023 , publisher=

S-Graphs+: Real-Time Localization and Mapping Leveraging Hierarchical Representations , author=. 2023 , publisher=

work page 2023
[52]

2025 , organization=

REGRACE: A Robust and Efficient Graph-based Re-localization Algorithm using Consistency Evaluation , author=. 2025 , organization=

work page 2025
[53]

IEEE Robotics and Automation Letters , volume=

Long-term human trajectory prediction using 3d dynamic scene graphs , author=. IEEE Robotics and Automation Letters , volume=. 2024 , publisher=

work page 2024
[54]

Controllable 3D outdoor scene generation via scene graphs , author=

work page
[55]

arXiv preprint arXiv:2509.20401 , year=

SGAligner++: Cross-Modal Language-Aided 3D Scene Graph Alignment , author=. arXiv preprint arXiv:2509.20401 , year=

work page arXiv
[56]

Advances in Neural Information Processing Systems , volume=

Commonscenes: Generating commonsense 3d indoor scenes with scene graph diffusion , author=. Advances in Neural Information Processing Systems , volume=

work page
[57]

arXiv preprint arXiv:2411.03540 , year=

Vla-3d: A dataset for 3d semantic scene understanding and navigation , author=. arXiv preprint arXiv:2411.03540 , year=

work page arXiv
[58]

arXiv preprint arXiv:2408.04034 , year=

Task-oriented Sequential Grounding and Navigation in 3D Scenes , author=. arXiv preprint arXiv:2408.04034 , year=

work page arXiv
[59]

European Conference on Computer Vision (ECCV) , year=

SceneGraphLoc: Cross-Modal Coarse Visual Localization on 3D Scene Graphs , author=. European Conference on Computer Vision (ECCV) , year=

work page
[60]

arXiv preprint arXiv:2602.20972 , year=

Are Multimodal Large Language Models Good Annotators for Image Tagging? , author=. arXiv preprint arXiv:2602.20972 , year=

work page arXiv