pith. machine review for the scientific record. sign in

arxiv: 2505.23747 · v1 · submitted 2025-05-29 · 💻 cs.CV · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

Authors on Pith no claims yet

Pith reviewed 2026-05-16 08:31 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords multimodal large language modelsspatial reasoningvisual geometrydual encoder2D to 3D inferenceframe samplingsupervised fine-tuning
0
0 comments X

The pith

Spatial-MLLM equips multimodal language models with stronger 3D spatial reasoning using only 2D image and video inputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This work aims to enhance multimodal large language models so they can understand and reason about three-dimensional space when given only ordinary two-dimensional pictures or videos. The central idea is to add a second visual encoder that starts from a geometry-focused model to capture structural information alongside the usual semantic features. By fusing these in a connector and sampling informative frames from videos, the model trains on a custom dataset to handle tasks like judging distances, relations between objects, and scene layouts. A sympathetic reader would care because many practical applications from robot navigation to photo editing need spatial awareness yet most available data lacks explicit 3D labels. If successful, it removes the requirement for costly three-dimensional sensors or datasets in building capable vision-language systems.

Core claim

The paper claims that initializing a spatial encoder from the backbone of a feed-forward visual geometry foundation model, pairing it with a standard 2D semantic encoder, and integrating their outputs through a connector produces unified visual tokens that enable superior spatial understanding and reasoning from purely 2D inputs when trained on the Spatial-MLLM-120k dataset with supervised fine-tuning and GRPO, further aided by space-aware frame sampling.

What carries the argument

Dual-encoder architecture with a semantic 2D visual encoder and a spatial encoder derived from a visual geometry foundation model.

If this is right

  • Performance reaches state-of-the-art levels across multiple visual spatial understanding and reasoning benchmarks using only 2D data.
  • Video-based spatial tasks improve because the model selects frames that carry the most spatial information.
  • Training does not require any additional 3D or 2.5D data beyond the 2D inputs and the new 120k dataset.
  • The approach works for both image and video inputs without changes to the core model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models built this way might scale to longer videos or streaming inputs if the sampling strategy generalizes.
  • Similar initialization tricks could be tried on other geometry foundation models to boost different spatial capabilities.
  • Real-world deployment in mobile devices becomes more feasible since no extra hardware for depth sensing is needed.

Load-bearing premise

That the spatial encoder initialized from a visual geometry foundation model can reliably extract usable three-dimensional structure features directly from two-dimensional image or video inputs without any three-dimensional supervision during training.

What would settle it

Running the model on a held-out set of 2D images that depict clear 3D spatial relations and finding that its accuracy on spatial questions matches or falls below that of a standard CLIP-based MLLM would indicate the added encoder provides no benefit.

read the original abstract

Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced performance on 2D visual tasks. However, improving their spatial intelligence remains a challenge. Existing 3D MLLMs always rely on additional 3D or 2.5D data to incorporate spatial awareness, restricting their utility in scenarios with only 2D inputs, such as images or videos. In this paper, we present Spatial-MLLM, a novel framework for visual-based spatial reasoning from purely 2D observations. Unlike conventional video MLLMs which rely on CLIP-based visual encoders optimized for semantic understanding, our key insight is to unleash the strong structure prior from the feed-forward visual geometry foundation model. Specifically, we propose a dual-encoder architecture: a pretrained 2D visual encoder to extract semantic features, and a spatial encoder-initialized from the backbone of the visual geometry model-to extract 3D structure features. A connector then integrates both features into unified visual tokens for enhanced spatial understanding. Furthermore, we propose a space-aware frame sampling strategy at inference time, which selects the spatially informative frames of a video sequence, ensuring that even under limited token length, the model focuses on frames critical for spatial reasoning. Beyond architecture improvements, we construct the Spatial-MLLM-120k dataset and train the model on it using supervised fine-tuning and GRPO. Extensive experiments on various real-world datasets demonstrate that our spatial-MLLM achieves state-of-the-art performance in a wide range of visual-based spatial understanding and reasoning tasks. Project page: https://diankun-wu.github.io/Spatial-MLLM/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Spatial-MLLM, a dual-encoder MLLM for visual-based spatial reasoning from purely 2D image and video inputs. It uses a pretrained 2D visual encoder for semantic features and a spatial encoder initialized from the backbone of a feed-forward visual geometry foundation model to extract 3D structure features, fuses them via a connector, applies space-aware frame sampling for videos, constructs the Spatial-MLLM-120k dataset, and trains with supervised fine-tuning plus GRPO, claiming state-of-the-art performance across spatial understanding and reasoning benchmarks.

Significance. If the experimental results hold, the work offers a practical route to spatial intelligence in MLLMs without requiring 3D or 2.5D training data, which would expand applicability to standard image and video pipelines in robotics, AR/VR, and video analysis. The dataset release and GRPO training protocol constitute concrete, reusable contributions.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Experiments): the central SOTA claim is asserted after training on the 120k dataset with SFT and GRPO, yet the abstract supplies no quantitative metrics, ablation tables, or error bars. The load-bearing performance numbers and comparisons to prior video MLLMs must be presented with concrete scores, baselines, and statistical details to substantiate the headline result.
  2. [§3.1] §3.1 (Dual-encoder architecture): the claim that the spatial encoder, initialized from the geometry-model backbone, reliably supplies usable 3D structure features from 2D inputs without any 3D supervision is central to the novelty. No ablation, feature visualization, or quantitative probe (e.g., geometric vs. semantic leakage) is described that isolates the contribution of these transferred features; if they collapse to semantic information already available from the CLIP-style encoder, the architecture reduces to a standard video MLLM and the reported gains are unexplained.
minor comments (2)
  1. [§3.2 and Figure 2] §3.2 and Figure 2: the space-aware frame sampling strategy is introduced without pseudocode or explicit selection criterion; adding a short algorithmic description would improve reproducibility.
  2. [References and §4.1] References: several recent video MLLM baselines (e.g., Video-LLaVA, LLaVA-NeXT-Video) are cited but the exact versions and training settings used for comparison should be stated explicitly in §4.1.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below and have revised the manuscript to strengthen the presentation of results and the validation of the dual-encoder design. All changes are highlighted in the revised version.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): the central SOTA claim is asserted after training on the 120k dataset with SFT and GRPO, yet the abstract supplies no quantitative metrics, ablation tables, or error bars. The load-bearing performance numbers and comparisons to prior video MLLMs must be presented with concrete scores, baselines, and statistical details to substantiate the headline result.

    Authors: We agree that the abstract should contain the key quantitative results to support the SOTA claim. In the revised manuscript we have added the primary benchmark scores (e.g., average gains over prior video MLLMs on spatial understanding and reasoning tasks) together with the main baselines. In §4 we have augmented the tables with error bars computed over multiple runs and have made the comparisons to prior methods more explicit by listing exact scores and dataset splits. These additions directly address the request for concrete metrics and statistical detail. revision: yes

  2. Referee: [§3.1] §3.1 (Dual-encoder architecture): the claim that the spatial encoder, initialized from the geometry-model backbone, reliably supplies usable 3D structure features from 2D inputs without any 3D supervision is central to the novelty. No ablation, feature visualization, or quantitative probe (e.g., geometric vs. semantic leakage) is described that isolates the contribution of these transferred features; if they collapse to semantic information already available from the CLIP-style encoder, the architecture reduces to a standard video MLLM and the reported gains are unexplained.

    Authors: We acknowledge that an explicit isolation of the spatial encoder’s contribution strengthens the novelty argument. The original submission already contained an ablation in §4.3 that removed the spatial encoder and reported the resulting performance drop. To further address the referee’s concern we have added (i) t-SNE visualizations of features from both encoders on the same inputs and (ii) a quantitative probe that measures geometric consistency (e.g., depth and pose estimation accuracy) when using only the spatial encoder versus only the semantic encoder. These new analyses demonstrate that the spatial encoder supplies distinct 3D-structure information that is not redundant with the CLIP-style encoder, thereby explaining the observed gains. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture with external evaluation

full rationale

The paper contains no equations, derivations, or predictions that reduce to inputs by construction. It describes a dual-encoder architecture (semantic encoder + spatial encoder initialized from a pre-trained visual geometry model), a space-aware frame sampling strategy, and training on the constructed Spatial-MLLM-120k dataset using SFT and GRPO. Performance is measured on external real-world datasets. No self-citation chains, fitted parameters renamed as predictions, or self-definitional steps are present. The central SOTA claim rests on experimental results rather than tautological reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical machine-learning paper. No new physical entities or ad-hoc axioms are introduced. The work relies on standard transfer-learning assumptions about pretrained encoders and the utility of the new dataset.

axioms (1)
  • domain assumption Pretrained visual geometry foundation models contain transferable 3D structural priors usable on 2D inputs
    The spatial encoder is initialized directly from such a backbone without further justification or ablation in the abstract.

pith-pipeline@v0.9.0 · 5609 in / 1242 out tokens · 52855 ms · 2026-05-16T08:31:09.946115+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    dual-encoder architecture: a pretrained 2D visual encoder to extract semantic features, and a spatial encoder—initialized from the backbone of the visual geometry model—to extract 3D structure features. A connector then integrates both features into unified visual tokens

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. PinpointQA: A Dataset and Benchmark for Small Object-Centric Spatial Understanding in Indoor Videos

    cs.CV 2026-04 unverdicted novelty 8.0

    PinpointQA is the first benchmark dataset for small object-centric spatial understanding in indoor videos, with four tasks showing MLLM capability gaps that improve via supervised fine-tuning.

  2. ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models

    cs.CV 2026-05 unverdicted novelty 7.0

    ViSRA boosts MLLM 3D spatial reasoning performance by up to 28.9% on unseen tasks via a plug-and-play video-based agent that extracts explicit spatial cues from expert models without any post-training.

  3. Exploring Spatial Intelligence from a Generative Perspective

    cs.CV 2026-04 unverdicted novelty 7.0

    Fine-tuning multimodal models on a new synthetic spatial benchmark improves generative spatial compliance on real and synthetic tasks and transfers to better spatial understanding.

  4. Why MLLMs Struggle to Determine Object Orientations

    cs.CV 2026-04 accept novelty 7.0

    Orientation information is recoverable from MLLM visual encoder embeddings via linear regression, contradicting the hypothesis that failures originate in the encoders.

  5. Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale

    cs.CV 2026-04 unverdicted novelty 7.0

    A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.

  6. PinpointQA: A Dataset and Benchmark for Small Object-Centric Spatial Understanding in Indoor Videos

    cs.CV 2026-04 unverdicted novelty 7.0

    PinpointQA is the first benchmark dataset for small object-centric spatial understanding in indoor videos, with four progressive tasks built from ScanNet data.

  7. SCP: Spatial Causal Prediction in Video

    cs.CV 2026-03 unverdicted novelty 7.0

    SCP defines a new benchmark task for predicting spatial causal outcomes beyond direct observation and shows that 23 leading models lag far behind humans on it.

  8. SpatialForge: Bootstrapping 3D-Aware Spatial Reasoning from Open-World 2D Images

    cs.CV 2026-05 unverdicted novelty 6.0

    SpatialForge synthesizes 10 million spatial QA pairs from in-the-wild 2D images to train VLMs for better depth ordering, layout, and viewpoint-dependent reasoning.

  9. SpatialFusion: Endowing Unified Image Generation with Intrinsic 3D Geometric Awareness

    cs.CV 2026-04 unverdicted novelty 6.0

    SpatialFusion internalizes 3D geometric awareness into unified image generation models by pairing an MLLM with a spatial transformer that produces depth maps to constrain diffusion generation.

  10. ReplicateAnyScene: Zero-Shot Video-to-3D Composition via Textual-Visual-Spatial Alignment

    cs.CV 2026-04 unverdicted novelty 6.0

    ReplicateAnyScene performs fully automated zero-shot video-to-compositional-3D reconstruction by cascading alignments of generic priors from vision foundation models across textual, visual, and spatial dimensions.

  11. Let Geometry GUIDE: Layer-wise Unrolling of Geometric Priors in Multimodal LLMs

    cs.CV 2026-04 unverdicted novelty 6.0

    GUIDE unrolls multi-granularity geometric priors layer-wise into early MLLM layers with gating to improve spatial reasoning and perception.

  12. EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs

    cs.CV 2026-04 unverdicted novelty 6.0

    EgoMind activates spatial cognition in MLLMs via linguistic Role-Play Caption and Progressive Spatial Analysis, reaching competitive results on VSI-Bench, SPAR-Bench, SITE-Bench and SPBench with only 5K SFT and 20K RL...

  13. Feeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding

    cs.CV 2026-03 unverdicted novelty 6.0

    Motion-MLLM integrates IMU egomotion data into MLLMs using cascaded filtering and asymmetric fusion to ground visual content in physical trajectories for scale-aware 3D understanding, achieving competitive accuracy at...

  14. Do MLLMs Really Understand Space? A Mathematical Reasoning Evaluation

    cs.AI 2026-02 unverdicted novelty 6.0

    MLLMs show a large gap in spatial mathematical reasoning compared to humans, and a new 10,000-problem dataset helps narrow it through training.

  15. Vision-aligned Latent Reasoning for Multi-modal Large Language Model

    cs.CV 2026-02 unverdicted novelty 6.0

    VaLR generates vision-aligned latent tokens before each reasoning step to preserve perceptual cues, improving VSI-Bench accuracy from 33.0% to 52.9%.

  16. Thinking with Novel Views: A Systematic Analysis of Generative-Augmented Spatial Intelligence

    cs.CV 2026-05 unverdicted novelty 5.0

    Integrating generative novel-view synthesis into LMM reasoning loops improves accuracy on spatial subtasks by 1.3 to 3.9 percentage points across multiple models and tasks.

  17. SpatialImaginer: Towards Adaptive Visual Imagination for Spatial Reasoning

    cs.CV 2026-04 unverdicted novelty 5.0

    SpatialImaginer integrates visual imagination with textual chain-of-thought to improve spatial reasoning robustness in multimodal large language models.

  18. OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence

    cs.CL 2026-04 unverdicted novelty 5.0

    OpenSpatial supplies a principled open-source data engine and 3-million-sample dataset that raises spatial-reasoning model performance by an average of 19 percent on benchmarks.

  19. OpenWorldLib: A Unified Codebase and Definition of Advanced World Models

    cs.CV 2026-04 unverdicted novelty 4.0

    OpenWorldLib offers a standardized codebase and definition for world models that combine perception, interaction, and memory to understand and predict the world.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · cited by 18 Pith papers · 19 internal anchors

  1. [1]

    Flamingo: a visual language model for few-shot learning,

    J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, et al., “Flamingo: a visual language model for few-shot learning,”NeurIPS, 2022

  2. [2]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,

    J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” inICML, 2023

  3. [3]

    Visual instruction tuning,

    H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,”NeurIPS, 2024

  4. [4]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    G. Team, P. Georgiev, V . I. Lei, R. Burnell, L. Bai, A. Gulati, G. Tanzer, D. Vincent, Z. Pan, S. Wang,et al., “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,”arXiv preprint arXiv:2403.05530, 2024

  5. [5]

    GPT-4o System Card

    A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al., “Gpt-4o system card,”arXiv preprint arXiv:2410.21276, 2024

  6. [6]

    LLaVA-OneVision: Easy Visual Task Transfer

    B. Li, Y . Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, Y . Li, Z. Liu, and C. Li, “Llava-onevision: Easy visual task transfer,”arXiv preprint arXiv:2408.03326, 2024

  7. [7]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,

    Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Lu,et al., “Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,” inCVPR, 2024

  8. [8]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-vl: A frontier large vision-language model with versatile abilities,”arXiv preprint arXiv:2308.12966, 2023

  9. [9]

    Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

    B. Lin, B. Zhu, Y . Ye, M. Ning, P. Jin, and L. Yuan, “Video-llava: Learning united visual representation by alignment before projection,”arXiv preprint arXiv:2311.10122, 2023

  10. [10]

    VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

    Z. Cheng, S. Leng, H. Zhang, Y . Xin, X. Li, G. Chen, Y . Zhu, W. Zhang, Z. Luo, D. Zhao, et al., “Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms,” arXiv preprint arXiv:2406.07476, 2024

  11. [11]

    Streaming long video understanding with large language models,

    R. Qian, X. Dong, P. Zhang, Y . Zang, S. Ding, D. Lin, and J. Wang, “Streaming long video understanding with large language models,”arXiv preprint arXiv:2405.16009, 2024

  12. [12]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Y . Zhang, J. Wu, W. Li, B. Li, Z. Ma, Z. Liu, and C. Li, “Video instruction tuning with synthetic data,” ArXiv, vol. abs/2410.02713, 2024

  13. [13]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K.-Y . Chen, X. Liu, J. Wang, W. Ge, Y . Fan, K. Dang, M. Du, X. Ren, R. Men, D. Liu, C. Zhou, J. Zhou, and J. Lin, “Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,”ArXiv, vol. abs/2409.12191, 2024

  14. [14]

    Qwen2.5-VL Technical Report

    S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y . Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y . Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin, “Qwen2.5-vl technical report,”ArXiv, vol. abs/2502.13923, 2025

  15. [15]

    Audiogpt: Understanding and generating speech, music, sound, and talking head,

    R. Huang, M. Li, D. Yang, J. Shi, X. Chang, Z. Ye, Y . Wu, Z. Hong, J.-B. Huang, J. Liu, Y . Ren, Z. Zhao, and S. Watanabe, “Audiogpt: Understanding and generating speech, music, sound, and talking head,” ArXiv, vol. abs/2304.12995, 2023

  16. [16]

    Salmonn: Towards generic hearing abilities for large language models,

    C. Tang, W. Yu, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. Ma, and C. Zhang, “Salmonn: Towards generic hearing abilities for large language models,”ArXiv, vol. abs/2310.13289, 2023

  17. [17]

    Ola: Pushing the frontiers of omni-modal language model with progressive modality alignment,

    Z. Liu, Y . Dong, J. Wang, Z. Liu, W. Hu, J. Lu, and Y . Rao, “Ola: Pushing the frontiers of omni-modal language model with progressive modality alignment,”ArXiv, vol. abs/2502.04328, 2025

  18. [18]

    Thinking in space: How multimodal large language models see, remember, and recall spaces,

    J. Yang, S. Yang, A. W. Gupta, R. Han, F.-F. Li, and S. Xie, “Thinking in space: How multimodal large language models see, remember, and recall spaces,”ArXiv, vol. abs/2412.14171, 2024

  19. [19]

    Spatialvlm: Endowing vision-language models with spatial reasoning capabilities,

    B. Chen, Z. Xu, S. Kirmani, B. Ichter, D. Driess, P. Florence, D. Sadigh, L. J. Guibas, and F. Xia, “Spatialvlm: Endowing vision-language models with spatial reasoning capabilities,” 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14455–14465, 2024

  20. [21]

    3d-llava: Towards generalist 3d lmms with omni superpoint transformer,

    J. Deng, T. He, L. Jiang, T. Wang, F. Dayoub, and I. Reid, “3d-llava: Towards generalist 3d lmms with omni superpoint transformer,”ArXiv, vol. abs/2501.01163, 2025

  21. [22]

    Chat-scene: Bridging 3d scene and large language models with object identifiers,

    H. Huang, Z. Wang, R. Huang, L. Liu, X. Cheng, Y . Zhao, T. Jin, and Z. Zhao, “Chat-scene: Bridging 3d scene and large language models with object identifiers,” inNeural Information Processing Systems, 2023

  22. [23]

    Ll3da: Visual interactive instruction tuning for omni-3d understanding reasoning and planning,

    S. Chen, X. Chen, C. Zhang, M. Li, G. Yu, H. Fei, H. Zhu, J. Fan, and T. Chen, “Ll3da: Visual interactive instruction tuning for omni-3d understanding reasoning and planning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 26428–26438, 2024

  23. [24]

    Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness,

    C. Zhu, T. Wang, W. Zhang, J. Pang, and X. Liu, “Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness,”arXiv preprint arXiv:2409.18125, 2024. 18

  24. [25]

    Video-3d llm: Learning position-aware video representation for 3d scene understanding,

    D. Zheng, S. Huang, and L. Wang, “Video-3d llm: Learning position-aware video representation for 3d scene understanding,”ArXiv, vol. abs/2412.00493, 2024

  25. [26]

    Datacomp: In search of the next generation of multimodal datasets,

    S. Y . Gadre, G. Ilharco, A. Fang, J. Hayase, G. Smyrnis, T. Nguyen, R. Marten, M. Wortsman, D. Ghosh, J. Zhang, E. Orgad, R. Entezari, G. Daras, S. Pratt, V . Ramanujan, Y . Bitton, K. Marathe, S. Mussmann, R. Vencu, M. Cherti, R. Krishna, P. W. Koh, O. Saukh, A. J. Ratner, S. Song, H. Hajishirzi, A. Farhadi, R. Beaumont, S. Oh, A. G. Dimakis, J. Jitsev,...

  26. [27]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” inInternational Conference on Machine Learning, 2021

  27. [28]

    Tulip: Towards unified language-image pretraining,

    Z. Tang, L. Lian, S. Eisape, X. Wang, R. Herzig, A. Yala, A. Suhr, T. Darrell, and D. M. Chan, “Tulip: Towards unified language-image pretraining,”ArXiv, vol. abs/2503.15485, 2025

  28. [29]

    Beyond semantics: Rediscovering spatial awareness in vision-language models,

    J. Qi, J. Liu, H. Tang, and Z. Zhu, “Beyond semantics: Rediscovering spatial awareness in vision-language models,”ArXiv, vol. abs/2503.17349, 2025

  29. [30]

    Long-clip: Unlocking the long-text capability of clip,

    B. Zhang, P. Zhang, X. wen Dong, Y . Zang, and J. Wang, “Long-clip: Unlocking the long-text capability of clip,” inEuropean Conference on Computer Vision, 2024

  30. [31]

    Dust3r: Geometric 3d vision made easy,

    S. Wang, V . Leroy, Y . Cabon, B. Chidlovskii, and J. Revaud, “Dust3r: Geometric 3d vision made easy,” 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pp. 20697–20709, 2023

  31. [32]

    Vggt: Visual geometry grounded transformer,

    J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny, “Vggt: Visual geometry grounded transformer,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

  32. [33]

    Megasam: Accurate, fast, and robust structure and motion from casual dynamic videos,

    Z. Li, R. Tucker, F. Cole, Q. Wang, L. Jin, V . Ye, A. Kanazawa, A. Holynski, and N. Snavely, “Megasam: Accurate, fast, and robust structure and motion from casual dynamic videos,”ArXiv, vol. abs/2412.04463, 2024

  33. [34]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J.-M. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, X. Zhang, X. Yu, Y . Wu, Z. F. Wu, Z. Gou, Z. Shao, Z. Li, Z. Gao, A. Liu, B. Xue, B.-L. Wang, B. Wu, B. Feng, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, D. Dai, D. Chen, D.-L. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Bao, H....

  34. [35]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Z. Shao, P. Wang, Q. Zhu, R. Xu, J.-M. Song, M. Zhang, Y . K. Li, Y . Wu, and D. Guo, “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,” ArXiv, vol. abs/2402.03300, 2024

  35. [36]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. H. Chi, F. Xia, Q. Le, and D. Zhou, “Chain of thought prompting elicits reasoning in large language models,”ArXiv, vol. abs/2201.11903, 2022

  36. [37]

    Scanqa: 3d question answering for spatial scene understanding,

    D. Azuma, T. Miyanishi, S. Kurita, and M. Kawanabe, “Scanqa: 3d question answering for spatial scene understanding,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 19107–19117, 2021

  37. [38]

    Sqa3d: Situated question answering in 3d scenes,

    X. Ma, S. Yong, Z. Zheng, Q. Li, Y . Liang, S.-C. Zhu, and S. Huang, “Sqa3d: Situated question answering in 3d scenes,”ArXiv, vol. abs/2210.07474, 2022

  38. [39]

    Improved baselines with visual instruction tuning,

    H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved baselines with visual instruction tuning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 26296–26306, 2024

  39. [40]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,”arXiv preprint arXiv:2304.10592, 2023

  40. [41]

    PandaGPT: One Model To Instruction-Follow Them All

    Y . Su, T. Lan, H. Li, J. Xu, Y . Wang, and D. Cai, “Pandagpt: One model to instruction-follow them all,” arXiv preprint arXiv:2305.16355, 2023. 19

  41. [42]

    Detgpt: Detect what you need via reasoning,

    R. Pi, J. Gao, S. Diao, R. Pan, H. Dong, J. Zhang, L. Yao, J. Han, H. Xu, L. Kong, et al., “Detgpt: Detect what you need via reasoning,”arXiv preprint arXiv:2305.14167, 2023

  42. [43]

    VideoChat: Chat-Centric Video Understanding

    K. Li, Y . He, Y . Wang, Y . Li, W. Wang, P. Luo, Y . Wang, L. Wang, and Y . Qiao, “Videochat: Chat-centric video understanding,”arXiv preprint arXiv:2305.06355, 2023

  43. [44]

    Grounded 3d-llm with referent tokens,

    Y . Chen, S. Yang, H. Huang, T. Wang, R. Xu, R. Lyu, D. Lin, and J. Pang, “Grounded 3d-llm with referent tokens,”arXiv preprint arXiv:2405.10370, 2024

  44. [45]

    Chat-3d: Data-efficiently tuning large language model for universal dialogue of 3d scenes,

    Z. Wang, H. Huang, Y . Zhao, Z. Zhang, and Z. Zhao, “Chat-3d: Data-efficiently tuning large language model for universal dialogue of 3d scenes,”arXiv preprint arXiv:2308.08769, 2023

  45. [47]

    Chat-scene: Bridging 3d scene and large language models with object identifiers,

    H. Huang, Y . Chen, Z. Wang, R. Huang, R. Xu, T. Wang, L. Liu, X. Cheng, Y . Zhao, J. Pang, et al., “Chat-scene: Bridging 3d scene and large language models with object identifiers,” inThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  46. [48]

    3d-llm: Injecting the 3d world into large language models,

    Y . Hong, H. Zhen, P. Chen, S. Zheng, Y . Du, Z. Chen, and C. Gan, “3d-llm: Injecting the 3d world into large language models,”Advances in Neural Information Processing Systems, vol. 36, pp. 20482–20494, 2023

  47. [50]

    Gpt4scene: Understand 3d scenes from videos with vision-language models,

    Z. Qi, Z. Zhang, Y . Fang, J. Wang, and H. Zhao, “Gpt4scene: Understand 3d scenes from videos with vision-language models,”arXiv preprint arXiv:2501.01428, 2025

  48. [51]

    Oryx mllm: On-demand spatial-temporal understanding at arbitrary resolution,

    Z. Liu, Y . Dong, Z. Liu, W. Hu, J. Lu, and Y . Rao, “Oryx mllm: On-demand spatial-temporal understanding at arbitrary resolution,”arXiv preprint arXiv:2409.12961, 2024

  49. [52]

    Videoagent: Long-form video understanding with large language model as agent,

    X. Wang, Y . Zhang, O. Zohar, and S. Yeung-Levy, “Videoagent: Long-form video understanding with large language model as agent,” inEuropean Conference on Computer Vision, pp. 58–76, Springer, 2024

  50. [53]

    Sti-bench: Are mllms ready for precise spatial-temporal world understanding?,

    Y . Li, Y . Zhang, T. Lin, X. Liu, W. Cai, Z. Liu, and B. Zhao, “Sti-bench: Are mllms ready for precise spatial-temporal world understanding?,”arXiv preprint arXiv:2503.23765, 2025

  51. [54]

    St-think: How multimodal large language models reason about 4d worlds from ego-centric videos,

    P. Wu, Y . Liu, M. Liu, and J. Shen, “St-think: How multimodal large language models reason about 4d worlds from ego-centric videos,”arXiv preprint arXiv:2503.12542, 2025

  52. [55]

    Vlm4d: Towards spatiotemporal awareness in vision language models,

    S. Zhou, A. Vilesov, X. He, Z. Wan, S. Zhang, A. N. D. C. D. Chen, and X. E. W. A. Kadambi, “Vlm4d: Towards spatiotemporal awareness in vision language models,”

  53. [56]

    Attention is all you need,

    A. Vaswani, N. M. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” inNeural Information Processing Systems, 2017

  54. [57]

    Vision Transformers Need Registers

    T. Darcet, M. Oquab, J. Mairal, and P. Bojanowski, “Vision transformers need registers,” ArXiv, vol. abs/2309.16588, 2023

  55. [58]

    Transfer between modalities with metaqueries,

    X. Pan, S. N. Shukla, A. Singh, Z. Zhao, S. K. Mishra, J. Wang, Z. Xu, J. Chen, K. Li, F. Juefei-Xu, J. Hou, and S. Xie, “Transfer between modalities with metaqueries,” 2025

  56. [59]

    An analysis of approximations for maximizing submodular set functions—i,

    G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher, “An analysis of approximations for maximizing submodular set functions—i,” Mathematical Programming, vol. 14, no. 1, pp. 265–294, 1978

  57. [60]

    D. S. Hochbaum, Approximating covering and packing problems: set cover, vertex cover, independent set, and related problems, p. 94–143. USA: PWS Publishing Co., 1996

  58. [61]

    Scannet: Richly-annotated 3d reconstructions of indoor scenes,

    A. Dai, A. X. Chang, M. Savva, M. Halber, T. A. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2432–2443, 2017

  59. [62]

    Adam: A Method for Stochastic Optimization

    D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”CoRR, vol. abs/1412.6980, 2014

  60. [63]

    Longvila: Scaling long-context visual language models for long videos,

    F. Xue, Y . Chen, D. Li, Q. Hu, L. Zhu, X. Li, Y . Fang, H. Tang, S. Yang, Z. Liu, E. He, H. Yin, P. Molchanov, J. Kautz, L. Fan, Y . Zhu, Y . Lu, and S. Han, “Longvila: Scaling long-context visual language models for long videos,”ArXiv, vol. abs/2408.10188, 2024

  61. [64]

    Vila: On pre-training for visual language models,

    J. Lin, H. Yin, W. Ping, Y . Lu, P. Molchanov, A. Tao, H. Mao, J. Kautz, M. Shoeybi, and S. Han, “Vila: On pre-training for visual language models,”2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 26679–26689, 2023

  62. [65]

    Long Context Transfer from Language to Vision

    P. Zhang, K. Zhang, B. Li, G. Zeng, J. Yang, Y . Zhang, Z. Wang, H. Tan, C. Li, and Z. Liu, “Long context transfer from language to vision,”ArXiv, vol. abs/2406.16852, 2024

  63. [66]

    Scannet++: A high-fidelity dataset of 3d indoor scenes,

    C. Yeshwanth, Y .-C. Liu, M. Nießner, and A. Dai, “Scannet++: A high-fidelity dataset of 3d indoor scenes,” 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 12–22, 2023. 20

  64. [67]

    ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data,

    G. Baruch, Z. Chen, A. Dehghan, T. Dimry, Y . Feigin, P. Fu, T. Gebauer, B. Joffe, D. Kurz, A. Schwartz, and E. Shulman, “ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data,” inThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021

  65. [68]

    An embodied generalist agent in 3d world,

    J. Huang, S. Yong, X. Ma, X. Linghu, P. Li, Y . Wang, Q. Li, S.-C. Zhu, B. Jia, and S. Huang, “An embodied generalist agent in 3d world,”ArXiv, vol. abs/2311.12871, 2023

  66. [69]

    3d-vista: Pre-trained transformer for 3d vision and text alignment,

    Z. Zhu, X. Ma, Y . Chen, Z. Deng, S. Huang, and Q. Li, “3d-vista: Pre-trained transformer for 3d vision and text alignment,”2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2899–2909, 2023

  67. [70]

    3d-llm: Injecting the 3d world into large language models,

    Y . Hong, H. Zhen, P. Chen, S. Zheng, Y . Du, Z. Chen, and C. Gan, “3d-llm: Injecting the 3d world into large language models,”NeurIPS, 2023

  68. [71]

    Open3D: A Modern Library for 3D Data Processing

    Q.-Y . Zhou, J. Park, and V . Koltun, “Open3d: A modern library for 3d data processing,” ArXiv, vol. abs/1801.09847, 2018

  69. [72]

    Indoor segmentation and support inference from rgbd images,

    N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from rgbd images,” inEuropean Conference on Computer Vision, 2012

  70. [73]

    Perceptual organization and recognition of indoor scenes from rgb-d images,

    S. Gupta, P. Arbeláez, and J. Malik, “Perceptual organization and recognition of indoor scenes from rgb-d images,”2013 IEEE Conference on Computer Vision and Pattern Recognition, pp. 564–571, 2013

  71. [74]

    Scene-llm: Extending language model for 3d visual understanding and reasoning,

    R. Fu, J. Liu, X. Chen, Y . Nie, and W. Xiong, “Scene-llm: Extending language model for 3d visual understanding and reasoning,”ArXiv, vol. abs/2403.11401, 2024. 21