arxiv: 2505.23747 · v1 · submitted 2025-05-29 · 💻 cs.CV · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

Diankun Wu , Fangfu Liu , Yi-Hsin Hung , Yueqi Duan

Authors on Pith no claims yet

Pith reviewed 2026-05-16 08:31 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords multimodal large language modelsspatial reasoningvisual geometrydual encoder2D to 3D inferenceframe samplingsupervised fine-tuning

0 comments

The pith

Spatial-MLLM equips multimodal language models with stronger 3D spatial reasoning using only 2D image and video inputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This work aims to enhance multimodal large language models so they can understand and reason about three-dimensional space when given only ordinary two-dimensional pictures or videos. The central idea is to add a second visual encoder that starts from a geometry-focused model to capture structural information alongside the usual semantic features. By fusing these in a connector and sampling informative frames from videos, the model trains on a custom dataset to handle tasks like judging distances, relations between objects, and scene layouts. A sympathetic reader would care because many practical applications from robot navigation to photo editing need spatial awareness yet most available data lacks explicit 3D labels. If successful, it removes the requirement for costly three-dimensional sensors or datasets in building capable vision-language systems.

Core claim

The paper claims that initializing a spatial encoder from the backbone of a feed-forward visual geometry foundation model, pairing it with a standard 2D semantic encoder, and integrating their outputs through a connector produces unified visual tokens that enable superior spatial understanding and reasoning from purely 2D inputs when trained on the Spatial-MLLM-120k dataset with supervised fine-tuning and GRPO, further aided by space-aware frame sampling.

What carries the argument

Dual-encoder architecture with a semantic 2D visual encoder and a spatial encoder derived from a visual geometry foundation model.

If this is right

Performance reaches state-of-the-art levels across multiple visual spatial understanding and reasoning benchmarks using only 2D data.
Video-based spatial tasks improve because the model selects frames that carry the most spatial information.
Training does not require any additional 3D or 2.5D data beyond the 2D inputs and the new 120k dataset.
The approach works for both image and video inputs without changes to the core model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models built this way might scale to longer videos or streaming inputs if the sampling strategy generalizes.
Similar initialization tricks could be tried on other geometry foundation models to boost different spatial capabilities.
Real-world deployment in mobile devices becomes more feasible since no extra hardware for depth sensing is needed.

Load-bearing premise

That the spatial encoder initialized from a visual geometry foundation model can reliably extract usable three-dimensional structure features directly from two-dimensional image or video inputs without any three-dimensional supervision during training.

What would settle it

Running the model on a held-out set of 2D images that depict clear 3D spatial relations and finding that its accuracy on spatial questions matches or falls below that of a standard CLIP-based MLLM would indicate the added encoder provides no benefit.

read the original abstract

Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced performance on 2D visual tasks. However, improving their spatial intelligence remains a challenge. Existing 3D MLLMs always rely on additional 3D or 2.5D data to incorporate spatial awareness, restricting their utility in scenarios with only 2D inputs, such as images or videos. In this paper, we present Spatial-MLLM, a novel framework for visual-based spatial reasoning from purely 2D observations. Unlike conventional video MLLMs which rely on CLIP-based visual encoders optimized for semantic understanding, our key insight is to unleash the strong structure prior from the feed-forward visual geometry foundation model. Specifically, we propose a dual-encoder architecture: a pretrained 2D visual encoder to extract semantic features, and a spatial encoder-initialized from the backbone of the visual geometry model-to extract 3D structure features. A connector then integrates both features into unified visual tokens for enhanced spatial understanding. Furthermore, we propose a space-aware frame sampling strategy at inference time, which selects the spatially informative frames of a video sequence, ensuring that even under limited token length, the model focuses on frames critical for spatial reasoning. Beyond architecture improvements, we construct the Spatial-MLLM-120k dataset and train the model on it using supervised fine-tuning and GRPO. Extensive experiments on various real-world datasets demonstrate that our spatial-MLLM achieves state-of-the-art performance in a wide range of visual-based spatial understanding and reasoning tasks. Project page: https://diankun-wu.github.io/Spatial-MLLM/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The dual-encoder setup pulling spatial features from a geometry backbone plus space-aware sampling is the actual new piece, but the SOTA claim sits on thin experimental detail so far.

read the letter

The main thing here is the dual-encoder architecture that pairs a standard semantic encoder with a spatial encoder initialized from a feed-forward visual geometry model, plus an inference-time sampling method that picks frames with more spatial content. They train on a new 120k dataset using SFT and GRPO and claim this gets better spatial reasoning from plain 2D images and video without any 3D inputs. That combination is the concrete advance over prior video MLLMs that stick to CLIP-style encoders. The paper does a clean job explaining why existing models struggle with spatial tasks and why pulling structure priors from geometry models could help without requiring extra 3D data or sensors. The connector fusion and the sampling trick are practical engineering choices that address token limits in real deployments. The motivation and the high-level design read as honest attempts to close a known gap for robotics and AR use cases. The soft spots sit in the experimental side. The abstract asserts state-of-the-art results across tasks, yet the summary gives no numbers, no ablation tables, and no error bars, so it is impossible to judge how much the spatial branch actually adds versus what the fine-tuning alone would achieve. The transfer assumption—that the geometry backbone reliably supplies usable 3D structure from 2D inputs with no 3D supervision during training—remains the least tested part; if those features mostly leak semantic information or collapse, the whole gain shrinks. The stress-test note on this point is reasonable and would need direct checks like feature visualizations or controlled ablations to settle. This is aimed at multimodal researchers working on spatial intelligence for downstream applications rather than core theory. A reader already building MLLMs would pick up usable architecture details and the dataset if released. The work shows clear thinking on the problem setup and cites the relevant prior lines without obvious holes. It deserves a serious referee because the idea is timely, the claims are testable, and the architecture is straightforward enough to reproduce and extend even if the current evidence needs tightening.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Spatial-MLLM, a dual-encoder MLLM for visual-based spatial reasoning from purely 2D image and video inputs. It uses a pretrained 2D visual encoder for semantic features and a spatial encoder initialized from the backbone of a feed-forward visual geometry foundation model to extract 3D structure features, fuses them via a connector, applies space-aware frame sampling for videos, constructs the Spatial-MLLM-120k dataset, and trains with supervised fine-tuning plus GRPO, claiming state-of-the-art performance across spatial understanding and reasoning benchmarks.

Significance. If the experimental results hold, the work offers a practical route to spatial intelligence in MLLMs without requiring 3D or 2.5D training data, which would expand applicability to standard image and video pipelines in robotics, AR/VR, and video analysis. The dataset release and GRPO training protocol constitute concrete, reusable contributions.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): the central SOTA claim is asserted after training on the 120k dataset with SFT and GRPO, yet the abstract supplies no quantitative metrics, ablation tables, or error bars. The load-bearing performance numbers and comparisons to prior video MLLMs must be presented with concrete scores, baselines, and statistical details to substantiate the headline result.
[§3.1] §3.1 (Dual-encoder architecture): the claim that the spatial encoder, initialized from the geometry-model backbone, reliably supplies usable 3D structure features from 2D inputs without any 3D supervision is central to the novelty. No ablation, feature visualization, or quantitative probe (e.g., geometric vs. semantic leakage) is described that isolates the contribution of these transferred features; if they collapse to semantic information already available from the CLIP-style encoder, the architecture reduces to a standard video MLLM and the reported gains are unexplained.

minor comments (2)

[§3.2 and Figure 2] §3.2 and Figure 2: the space-aware frame sampling strategy is introduced without pseudocode or explicit selection criterion; adding a short algorithmic description would improve reproducibility.
[References and §4.1] References: several recent video MLLM baselines (e.g., Video-LLaVA, LLaVA-NeXT-Video) are cited but the exact versions and training settings used for comparison should be stated explicitly in §4.1.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below and have revised the manuscript to strengthen the presentation of results and the validation of the dual-encoder design. All changes are highlighted in the revised version.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): the central SOTA claim is asserted after training on the 120k dataset with SFT and GRPO, yet the abstract supplies no quantitative metrics, ablation tables, or error bars. The load-bearing performance numbers and comparisons to prior video MLLMs must be presented with concrete scores, baselines, and statistical details to substantiate the headline result.

Authors: We agree that the abstract should contain the key quantitative results to support the SOTA claim. In the revised manuscript we have added the primary benchmark scores (e.g., average gains over prior video MLLMs on spatial understanding and reasoning tasks) together with the main baselines. In §4 we have augmented the tables with error bars computed over multiple runs and have made the comparisons to prior methods more explicit by listing exact scores and dataset splits. These additions directly address the request for concrete metrics and statistical detail. revision: yes
Referee: [§3.1] §3.1 (Dual-encoder architecture): the claim that the spatial encoder, initialized from the geometry-model backbone, reliably supplies usable 3D structure features from 2D inputs without any 3D supervision is central to the novelty. No ablation, feature visualization, or quantitative probe (e.g., geometric vs. semantic leakage) is described that isolates the contribution of these transferred features; if they collapse to semantic information already available from the CLIP-style encoder, the architecture reduces to a standard video MLLM and the reported gains are unexplained.

Authors: We acknowledge that an explicit isolation of the spatial encoder’s contribution strengthens the novelty argument. The original submission already contained an ablation in §4.3 that removed the spatial encoder and reported the resulting performance drop. To further address the referee’s concern we have added (i) t-SNE visualizations of features from both encoders on the same inputs and (ii) a quantitative probe that measures geometric consistency (e.g., depth and pose estimation accuracy) when using only the spatial encoder versus only the semantic encoder. These new analyses demonstrate that the spatial encoder supplies distinct 3D-structure information that is not redundant with the CLIP-style encoder, thereby explaining the observed gains. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture with external evaluation

full rationale

The paper contains no equations, derivations, or predictions that reduce to inputs by construction. It describes a dual-encoder architecture (semantic encoder + spatial encoder initialized from a pre-trained visual geometry model), a space-aware frame sampling strategy, and training on the constructed Spatial-MLLM-120k dataset using SFT and GRPO. Performance is measured on external real-world datasets. No self-citation chains, fitted parameters renamed as predictions, or self-definitional steps are present. The central SOTA claim rests on experimental results rather than tautological reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical machine-learning paper. No new physical entities or ad-hoc axioms are introduced. The work relies on standard transfer-learning assumptions about pretrained encoders and the utility of the new dataset.

axioms (1)

domain assumption Pretrained visual geometry foundation models contain transferable 3D structural priors usable on 2D inputs
The spatial encoder is initialized directly from such a backbone without further justification or ablation in the abstract.

pith-pipeline@v0.9.0 · 5609 in / 1242 out tokens · 52855 ms · 2026-05-16T08:31:09.946115+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

dual-encoder architecture: a pretrained 2D visual encoder to extract semantic features, and a spatial encoder—initialized from the backbone of the visual geometry model—to extract 3D structure features. A connector then integrates both features into unified visual tokens

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PinpointQA: A Dataset and Benchmark for Small Object-Centric Spatial Understanding in Indoor Videos
cs.CV 2026-04 unverdicted novelty 8.0

PinpointQA is the first benchmark dataset for small object-centric spatial understanding in indoor videos, with four tasks showing MLLM capability gaps that improve via supervised fine-tuning.
ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models
cs.CV 2026-05 unverdicted novelty 7.0

ViSRA boosts MLLM 3D spatial reasoning performance by up to 28.9% on unseen tasks via a plug-and-play video-based agent that extracts explicit spatial cues from expert models without any post-training.
Exploring Spatial Intelligence from a Generative Perspective
cs.CV 2026-04 unverdicted novelty 7.0

Fine-tuning multimodal models on a new synthetic spatial benchmark improves generative spatial compliance on real and synthetic tasks and transfers to better spatial understanding.
Why MLLMs Struggle to Determine Object Orientations
cs.CV 2026-04 accept novelty 7.0

Orientation information is recoverable from MLLM visual encoder embeddings via linear regression, contradicting the hypothesis that failures originate in the encoders.
Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale
cs.CV 2026-04 unverdicted novelty 7.0

A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.
PinpointQA: A Dataset and Benchmark for Small Object-Centric Spatial Understanding in Indoor Videos
cs.CV 2026-04 unverdicted novelty 7.0

PinpointQA is the first benchmark dataset for small object-centric spatial understanding in indoor videos, with four progressive tasks built from ScanNet data.
SCP: Spatial Causal Prediction in Video
cs.CV 2026-03 unverdicted novelty 7.0

SCP defines a new benchmark task for predicting spatial causal outcomes beyond direct observation and shows that 23 leading models lag far behind humans on it.
SpatialForge: Bootstrapping 3D-Aware Spatial Reasoning from Open-World 2D Images
cs.CV 2026-05 unverdicted novelty 6.0

SpatialForge synthesizes 10 million spatial QA pairs from in-the-wild 2D images to train VLMs for better depth ordering, layout, and viewpoint-dependent reasoning.
SpatialFusion: Endowing Unified Image Generation with Intrinsic 3D Geometric Awareness
cs.CV 2026-04 unverdicted novelty 6.0

SpatialFusion internalizes 3D geometric awareness into unified image generation models by pairing an MLLM with a spatial transformer that produces depth maps to constrain diffusion generation.
ReplicateAnyScene: Zero-Shot Video-to-3D Composition via Textual-Visual-Spatial Alignment
cs.CV 2026-04 unverdicted novelty 6.0

ReplicateAnyScene performs fully automated zero-shot video-to-compositional-3D reconstruction by cascading alignments of generic priors from vision foundation models across textual, visual, and spatial dimensions.
Let Geometry GUIDE: Layer-wise Unrolling of Geometric Priors in Multimodal LLMs
cs.CV 2026-04 unverdicted novelty 6.0

GUIDE unrolls multi-granularity geometric priors layer-wise into early MLLM layers with gating to improve spatial reasoning and perception.
EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs
cs.CV 2026-04 unverdicted novelty 6.0

EgoMind activates spatial cognition in MLLMs via linguistic Role-Play Caption and Progressive Spatial Analysis, reaching competitive results on VSI-Bench, SPAR-Bench, SITE-Bench and SPBench with only 5K SFT and 20K RL...
Feeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding
cs.CV 2026-03 unverdicted novelty 6.0

Motion-MLLM integrates IMU egomotion data into MLLMs using cascaded filtering and asymmetric fusion to ground visual content in physical trajectories for scale-aware 3D understanding, achieving competitive accuracy at...
Do MLLMs Really Understand Space? A Mathematical Reasoning Evaluation
cs.AI 2026-02 unverdicted novelty 6.0

MLLMs show a large gap in spatial mathematical reasoning compared to humans, and a new 10,000-problem dataset helps narrow it through training.
Vision-aligned Latent Reasoning for Multi-modal Large Language Model
cs.CV 2026-02 unverdicted novelty 6.0

VaLR generates vision-aligned latent tokens before each reasoning step to preserve perceptual cues, improving VSI-Bench accuracy from 33.0% to 52.9%.
Thinking with Novel Views: A Systematic Analysis of Generative-Augmented Spatial Intelligence
cs.CV 2026-05 unverdicted novelty 5.0

Integrating generative novel-view synthesis into LMM reasoning loops improves accuracy on spatial subtasks by 1.3 to 3.9 percentage points across multiple models and tasks.
SpatialImaginer: Towards Adaptive Visual Imagination for Spatial Reasoning
cs.CV 2026-04 unverdicted novelty 5.0

SpatialImaginer integrates visual imagination with textual chain-of-thought to improve spatial reasoning robustness in multimodal large language models.
OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence
cs.CL 2026-04 unverdicted novelty 5.0

OpenSpatial supplies a principled open-source data engine and 3-million-sample dataset that raises spatial-reasoning model performance by an average of 19 percent on benchmarks.
OpenWorldLib: A Unified Codebase and Definition of Advanced World Models
cs.CV 2026-04 unverdicted novelty 4.0

OpenWorldLib offers a standardized codebase and definition for world models that combine perception, interaction, and memory to understand and predict the world.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · cited by 18 Pith papers · 19 internal anchors

[1]

Flamingo: a visual language model for few-shot learning,

J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, et al., “Flamingo: a visual language model for few-shot learning,”NeurIPS, 2022

work page 2022
[2]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,

J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” inICML, 2023

work page 2023
[3]

Visual instruction tuning,

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,”NeurIPS, 2024

work page 2024
[4]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

G. Team, P. Georgiev, V . I. Lei, R. Burnell, L. Bai, A. Gulati, G. Tanzer, D. Vincent, Z. Pan, S. Wang,et al., “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,”arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

GPT-4o System Card

A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al., “Gpt-4o system card,”arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

LLaVA-OneVision: Easy Visual Task Transfer

B. Li, Y . Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, Y . Li, Z. Liu, and C. Li, “Llava-onevision: Easy visual task transfer,”arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,

Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Lu,et al., “Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,” inCVPR, 2024

work page 2024
[8]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-vl: A frontier large vision-language model with versatile abilities,”arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

B. Lin, B. Zhu, Y . Ye, M. Ning, P. Jin, and L. Yuan, “Video-llava: Learning united visual representation by alignment before projection,”arXiv preprint arXiv:2311.10122, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Z. Cheng, S. Leng, H. Zhang, Y . Xin, X. Li, G. Chen, Y . Zhu, W. Zhang, Z. Luo, D. Zhao, et al., “Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms,” arXiv preprint arXiv:2406.07476, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Streaming long video understanding with large language models,

R. Qian, X. Dong, P. Zhang, Y . Zang, S. Ding, D. Lin, and J. Wang, “Streaming long video understanding with large language models,”arXiv preprint arXiv:2405.16009, 2024

work page arXiv 2024
[12]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Y . Zhang, J. Wu, W. Li, B. Li, Z. Ma, Z. Liu, and C. Li, “Video instruction tuning with synthetic data,” ArXiv, vol. abs/2410.02713, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K.-Y . Chen, X. Liu, J. Wang, W. Ge, Y . Fan, K. Dang, M. Du, X. Ren, R. Men, D. Liu, C. Zhou, J. Zhou, and J. Lin, “Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,”ArXiv, vol. abs/2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Qwen2.5-VL Technical Report

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y . Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y . Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin, “Qwen2.5-vl technical report,”ArXiv, vol. abs/2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Audiogpt: Understanding and generating speech, music, sound, and talking head,

R. Huang, M. Li, D. Yang, J. Shi, X. Chang, Z. Ye, Y . Wu, Z. Hong, J.-B. Huang, J. Liu, Y . Ren, Z. Zhao, and S. Watanabe, “Audiogpt: Understanding and generating speech, music, sound, and talking head,” ArXiv, vol. abs/2304.12995, 2023

work page arXiv 2023
[16]

Salmonn: Towards generic hearing abilities for large language models,

C. Tang, W. Yu, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. Ma, and C. Zhang, “Salmonn: Towards generic hearing abilities for large language models,”ArXiv, vol. abs/2310.13289, 2023

work page arXiv 2023
[17]

Ola: Pushing the frontiers of omni-modal language model with progressive modality alignment,

Z. Liu, Y . Dong, J. Wang, Z. Liu, W. Hu, J. Lu, and Y . Rao, “Ola: Pushing the frontiers of omni-modal language model with progressive modality alignment,”ArXiv, vol. abs/2502.04328, 2025

work page arXiv 2025
[18]

Thinking in space: How multimodal large language models see, remember, and recall spaces,

J. Yang, S. Yang, A. W. Gupta, R. Han, F.-F. Li, and S. Xie, “Thinking in space: How multimodal large language models see, remember, and recall spaces,”ArXiv, vol. abs/2412.14171, 2024

work page arXiv 2024
[19]

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities,

B. Chen, Z. Xu, S. Kirmani, B. Ichter, D. Driess, P. Florence, D. Sadigh, L. J. Guibas, and F. Xia, “Spatialvlm: Endowing vision-language models with spatial reasoning capabilities,” 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14455–14465, 2024

work page 2024
[21]

3d-llava: Towards generalist 3d lmms with omni superpoint transformer,

J. Deng, T. He, L. Jiang, T. Wang, F. Dayoub, and I. Reid, “3d-llava: Towards generalist 3d lmms with omni superpoint transformer,”ArXiv, vol. abs/2501.01163, 2025

work page arXiv 2025
[22]

Chat-scene: Bridging 3d scene and large language models with object identifiers,

H. Huang, Z. Wang, R. Huang, L. Liu, X. Cheng, Y . Zhao, T. Jin, and Z. Zhao, “Chat-scene: Bridging 3d scene and large language models with object identifiers,” inNeural Information Processing Systems, 2023

work page 2023
[23]

Ll3da: Visual interactive instruction tuning for omni-3d understanding reasoning and planning,

S. Chen, X. Chen, C. Zhang, M. Li, G. Yu, H. Fei, H. Zhu, J. Fan, and T. Chen, “Ll3da: Visual interactive instruction tuning for omni-3d understanding reasoning and planning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 26428–26438, 2024

work page 2024
[24]

Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness,

C. Zhu, T. Wang, W. Zhang, J. Pang, and X. Liu, “Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness,”arXiv preprint arXiv:2409.18125, 2024. 18

work page arXiv 2024
[25]

Video-3d llm: Learning position-aware video representation for 3d scene understanding,

D. Zheng, S. Huang, and L. Wang, “Video-3d llm: Learning position-aware video representation for 3d scene understanding,”ArXiv, vol. abs/2412.00493, 2024

work page arXiv 2024
[26]

Datacomp: In search of the next generation of multimodal datasets,

S. Y . Gadre, G. Ilharco, A. Fang, J. Hayase, G. Smyrnis, T. Nguyen, R. Marten, M. Wortsman, D. Ghosh, J. Zhang, E. Orgad, R. Entezari, G. Daras, S. Pratt, V . Ramanujan, Y . Bitton, K. Marathe, S. Mussmann, R. Vencu, M. Cherti, R. Krishna, P. W. Koh, O. Saukh, A. J. Ratner, S. Song, H. Hajishirzi, A. Farhadi, R. Beaumont, S. Oh, A. G. Dimakis, J. Jitsev,...

work page arXiv 2023
[27]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” inInternational Conference on Machine Learning, 2021

work page 2021
[28]

Tulip: Towards unified language-image pretraining,

Z. Tang, L. Lian, S. Eisape, X. Wang, R. Herzig, A. Yala, A. Suhr, T. Darrell, and D. M. Chan, “Tulip: Towards unified language-image pretraining,”ArXiv, vol. abs/2503.15485, 2025

work page arXiv 2025
[29]

Beyond semantics: Rediscovering spatial awareness in vision-language models,

J. Qi, J. Liu, H. Tang, and Z. Zhu, “Beyond semantics: Rediscovering spatial awareness in vision-language models,”ArXiv, vol. abs/2503.17349, 2025

work page arXiv 2025
[30]

Long-clip: Unlocking the long-text capability of clip,

B. Zhang, P. Zhang, X. wen Dong, Y . Zang, and J. Wang, “Long-clip: Unlocking the long-text capability of clip,” inEuropean Conference on Computer Vision, 2024

work page 2024
[31]

Dust3r: Geometric 3d vision made easy,

S. Wang, V . Leroy, Y . Cabon, B. Chidlovskii, and J. Revaud, “Dust3r: Geometric 3d vision made easy,” 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pp. 20697–20709, 2023

work page 2024
[32]

Vggt: Visual geometry grounded transformer,

J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny, “Vggt: Visual geometry grounded transformer,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

work page 2025
[33]

Megasam: Accurate, fast, and robust structure and motion from casual dynamic videos,

Z. Li, R. Tucker, F. Cole, Q. Wang, L. Jin, V . Ye, A. Kanazawa, A. Holynski, and N. Snavely, “Megasam: Accurate, fast, and robust structure and motion from casual dynamic videos,”ArXiv, vol. abs/2412.04463, 2024

work page arXiv 2024
[34]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J.-M. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, X. Zhang, X. Yu, Y . Wu, Z. F. Wu, Z. Gou, Z. Shao, Z. Li, Z. Gao, A. Liu, B. Xue, B.-L. Wang, B. Wu, B. Feng, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, D. Dai, D. Chen, D.-L. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Bao, H....

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Z. Shao, P. Wang, Q. Zhu, R. Xu, J.-M. Song, M. Zhang, Y . K. Li, Y . Wu, and D. Guo, “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,” ArXiv, vol. abs/2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. H. Chi, F. Xia, Q. Le, and D. Zhou, “Chain of thought prompting elicits reasoning in large language models,”ArXiv, vol. abs/2201.11903, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[37]

Scanqa: 3d question answering for spatial scene understanding,

D. Azuma, T. Miyanishi, S. Kurita, and M. Kawanabe, “Scanqa: 3d question answering for spatial scene understanding,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 19107–19117, 2021

work page 2022
[38]

Sqa3d: Situated question answering in 3d scenes,

X. Ma, S. Yong, Z. Zheng, Q. Li, Y . Liang, S.-C. Zhu, and S. Huang, “Sqa3d: Situated question answering in 3d scenes,”ArXiv, vol. abs/2210.07474, 2022

work page arXiv 2022
[39]

Improved baselines with visual instruction tuning,

H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved baselines with visual instruction tuning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 26296–26306, 2024

work page 2024
[40]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,”arXiv preprint arXiv:2304.10592, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

PandaGPT: One Model To Instruction-Follow Them All

Y . Su, T. Lan, H. Li, J. Xu, Y . Wang, and D. Cai, “Pandagpt: One model to instruction-follow them all,” arXiv preprint arXiv:2305.16355, 2023. 19

work page internal anchor Pith review arXiv 2023
[42]

Detgpt: Detect what you need via reasoning,

R. Pi, J. Gao, S. Diao, R. Pan, H. Dong, J. Zhang, L. Yao, J. Han, H. Xu, L. Kong, et al., “Detgpt: Detect what you need via reasoning,”arXiv preprint arXiv:2305.14167, 2023

work page arXiv 2023
[43]

VideoChat: Chat-Centric Video Understanding

K. Li, Y . He, Y . Wang, Y . Li, W. Wang, P. Luo, Y . Wang, L. Wang, and Y . Qiao, “Videochat: Chat-centric video understanding,”arXiv preprint arXiv:2305.06355, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

Grounded 3d-llm with referent tokens,

Y . Chen, S. Yang, H. Huang, T. Wang, R. Xu, R. Lyu, D. Lin, and J. Pang, “Grounded 3d-llm with referent tokens,”arXiv preprint arXiv:2405.10370, 2024

work page arXiv 2024
[45]

Chat-3d: Data-efficiently tuning large language model for universal dialogue of 3d scenes,

Z. Wang, H. Huang, Y . Zhao, Z. Zhang, and Z. Zhao, “Chat-3d: Data-efficiently tuning large language model for universal dialogue of 3d scenes,”arXiv preprint arXiv:2308.08769, 2023

work page arXiv 2023
[47]

Chat-scene: Bridging 3d scene and large language models with object identifiers,

H. Huang, Y . Chen, Z. Wang, R. Huang, R. Xu, T. Wang, L. Liu, X. Cheng, Y . Zhao, J. Pang, et al., “Chat-scene: Bridging 3d scene and large language models with object identifiers,” inThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[48]

3d-llm: Injecting the 3d world into large language models,

Y . Hong, H. Zhen, P. Chen, S. Zheng, Y . Du, Z. Chen, and C. Gan, “3d-llm: Injecting the 3d world into large language models,”Advances in Neural Information Processing Systems, vol. 36, pp. 20482–20494, 2023

work page 2023
[50]

Gpt4scene: Understand 3d scenes from videos with vision-language models,

Z. Qi, Z. Zhang, Y . Fang, J. Wang, and H. Zhao, “Gpt4scene: Understand 3d scenes from videos with vision-language models,”arXiv preprint arXiv:2501.01428, 2025

work page arXiv 2025
[51]

Oryx mllm: On-demand spatial-temporal understanding at arbitrary resolution,

Z. Liu, Y . Dong, Z. Liu, W. Hu, J. Lu, and Y . Rao, “Oryx mllm: On-demand spatial-temporal understanding at arbitrary resolution,”arXiv preprint arXiv:2409.12961, 2024

work page arXiv 2024
[52]

Videoagent: Long-form video understanding with large language model as agent,

X. Wang, Y . Zhang, O. Zohar, and S. Yeung-Levy, “Videoagent: Long-form video understanding with large language model as agent,” inEuropean Conference on Computer Vision, pp. 58–76, Springer, 2024

work page 2024
[53]

Sti-bench: Are mllms ready for precise spatial-temporal world understanding?,

Y . Li, Y . Zhang, T. Lin, X. Liu, W. Cai, Z. Liu, and B. Zhao, “Sti-bench: Are mllms ready for precise spatial-temporal world understanding?,”arXiv preprint arXiv:2503.23765, 2025

work page arXiv 2025
[54]

St-think: How multimodal large language models reason about 4d worlds from ego-centric videos,

P. Wu, Y . Liu, M. Liu, and J. Shen, “St-think: How multimodal large language models reason about 4d worlds from ego-centric videos,”arXiv preprint arXiv:2503.12542, 2025

work page arXiv 2025
[55]

Vlm4d: Towards spatiotemporal awareness in vision language models,

S. Zhou, A. Vilesov, X. He, Z. Wan, S. Zhang, A. N. D. C. D. Chen, and X. E. W. A. Kadambi, “Vlm4d: Towards spatiotemporal awareness in vision language models,”

work page
[56]

Attention is all you need,

A. Vaswani, N. M. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” inNeural Information Processing Systems, 2017

work page 2017
[57]

Vision Transformers Need Registers

T. Darcet, M. Oquab, J. Mairal, and P. Bojanowski, “Vision transformers need registers,” ArXiv, vol. abs/2309.16588, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

Transfer between modalities with metaqueries,

X. Pan, S. N. Shukla, A. Singh, Z. Zhao, S. K. Mishra, J. Wang, Z. Xu, J. Chen, K. Li, F. Juefei-Xu, J. Hou, and S. Xie, “Transfer between modalities with metaqueries,” 2025

work page 2025
[59]

An analysis of approximations for maximizing submodular set functions—i,

G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher, “An analysis of approximations for maximizing submodular set functions—i,” Mathematical Programming, vol. 14, no. 1, pp. 265–294, 1978

work page 1978
[60]

D. S. Hochbaum, Approximating covering and packing problems: set cover, vertex cover, independent set, and related problems, p. 94–143. USA: PWS Publishing Co., 1996

work page 1996
[61]

Scannet: Richly-annotated 3d reconstructions of indoor scenes,

A. Dai, A. X. Chang, M. Savva, M. Halber, T. A. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2432–2443, 2017

work page 2017
[62]

Adam: A Method for Stochastic Optimization

D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”CoRR, vol. abs/1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[63]

Longvila: Scaling long-context visual language models for long videos,

F. Xue, Y . Chen, D. Li, Q. Hu, L. Zhu, X. Li, Y . Fang, H. Tang, S. Yang, Z. Liu, E. He, H. Yin, P. Molchanov, J. Kautz, L. Fan, Y . Zhu, Y . Lu, and S. Han, “Longvila: Scaling long-context visual language models for long videos,”ArXiv, vol. abs/2408.10188, 2024

work page arXiv 2024
[64]

Vila: On pre-training for visual language models,

J. Lin, H. Yin, W. Ping, Y . Lu, P. Molchanov, A. Tao, H. Mao, J. Kautz, M. Shoeybi, and S. Han, “Vila: On pre-training for visual language models,”2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 26679–26689, 2023

work page 2024
[65]

Long Context Transfer from Language to Vision

P. Zhang, K. Zhang, B. Li, G. Zeng, J. Yang, Y . Zhang, Z. Wang, H. Tan, C. Li, and Z. Liu, “Long context transfer from language to vision,”ArXiv, vol. abs/2406.16852, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[66]

Scannet++: A high-fidelity dataset of 3d indoor scenes,

C. Yeshwanth, Y .-C. Liu, M. Nießner, and A. Dai, “Scannet++: A high-fidelity dataset of 3d indoor scenes,” 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 12–22, 2023. 20

work page 2023
[67]

ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data,

G. Baruch, Z. Chen, A. Dehghan, T. Dimry, Y . Feigin, P. Fu, T. Gebauer, B. Joffe, D. Kurz, A. Schwartz, and E. Shulman, “ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data,” inThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021

work page 2021
[68]

An embodied generalist agent in 3d world,

J. Huang, S. Yong, X. Ma, X. Linghu, P. Li, Y . Wang, Q. Li, S.-C. Zhu, B. Jia, and S. Huang, “An embodied generalist agent in 3d world,”ArXiv, vol. abs/2311.12871, 2023

work page arXiv 2023
[69]

3d-vista: Pre-trained transformer for 3d vision and text alignment,

Z. Zhu, X. Ma, Y . Chen, Z. Deng, S. Huang, and Q. Li, “3d-vista: Pre-trained transformer for 3d vision and text alignment,”2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2899–2909, 2023

work page 2023
[70]

3d-llm: Injecting the 3d world into large language models,

Y . Hong, H. Zhen, P. Chen, S. Zheng, Y . Du, Z. Chen, and C. Gan, “3d-llm: Injecting the 3d world into large language models,”NeurIPS, 2023

work page 2023
[71]

Open3D: A Modern Library for 3D Data Processing

Q.-Y . Zhou, J. Park, and V . Koltun, “Open3d: A modern library for 3d data processing,” ArXiv, vol. abs/1801.09847, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[72]

Indoor segmentation and support inference from rgbd images,

N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from rgbd images,” inEuropean Conference on Computer Vision, 2012

work page 2012
[73]

Perceptual organization and recognition of indoor scenes from rgb-d images,

S. Gupta, P. Arbeláez, and J. Malik, “Perceptual organization and recognition of indoor scenes from rgb-d images,”2013 IEEE Conference on Computer Vision and Pattern Recognition, pp. 564–571, 2013

work page 2013
[74]

Scene-llm: Extending language model for 3d visual understanding and reasoning,

R. Fu, J. Liu, X. Chen, Y . Nie, and W. Xiong, “Scene-llm: Extending language model for 3d visual understanding and reasoning,”ArXiv, vol. abs/2403.11401, 2024. 21

work page arXiv 2024