Recognition: 2 theorem links
· Lean TheoremSpatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
Pith reviewed 2026-05-16 08:31 UTC · model grok-4.3
The pith
Spatial-MLLM equips multimodal language models with stronger 3D spatial reasoning using only 2D image and video inputs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that initializing a spatial encoder from the backbone of a feed-forward visual geometry foundation model, pairing it with a standard 2D semantic encoder, and integrating their outputs through a connector produces unified visual tokens that enable superior spatial understanding and reasoning from purely 2D inputs when trained on the Spatial-MLLM-120k dataset with supervised fine-tuning and GRPO, further aided by space-aware frame sampling.
What carries the argument
Dual-encoder architecture with a semantic 2D visual encoder and a spatial encoder derived from a visual geometry foundation model.
If this is right
- Performance reaches state-of-the-art levels across multiple visual spatial understanding and reasoning benchmarks using only 2D data.
- Video-based spatial tasks improve because the model selects frames that carry the most spatial information.
- Training does not require any additional 3D or 2.5D data beyond the 2D inputs and the new 120k dataset.
- The approach works for both image and video inputs without changes to the core model.
Where Pith is reading between the lines
- Models built this way might scale to longer videos or streaming inputs if the sampling strategy generalizes.
- Similar initialization tricks could be tried on other geometry foundation models to boost different spatial capabilities.
- Real-world deployment in mobile devices becomes more feasible since no extra hardware for depth sensing is needed.
Load-bearing premise
That the spatial encoder initialized from a visual geometry foundation model can reliably extract usable three-dimensional structure features directly from two-dimensional image or video inputs without any three-dimensional supervision during training.
What would settle it
Running the model on a held-out set of 2D images that depict clear 3D spatial relations and finding that its accuracy on spatial questions matches or falls below that of a standard CLIP-based MLLM would indicate the added encoder provides no benefit.
read the original abstract
Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced performance on 2D visual tasks. However, improving their spatial intelligence remains a challenge. Existing 3D MLLMs always rely on additional 3D or 2.5D data to incorporate spatial awareness, restricting their utility in scenarios with only 2D inputs, such as images or videos. In this paper, we present Spatial-MLLM, a novel framework for visual-based spatial reasoning from purely 2D observations. Unlike conventional video MLLMs which rely on CLIP-based visual encoders optimized for semantic understanding, our key insight is to unleash the strong structure prior from the feed-forward visual geometry foundation model. Specifically, we propose a dual-encoder architecture: a pretrained 2D visual encoder to extract semantic features, and a spatial encoder-initialized from the backbone of the visual geometry model-to extract 3D structure features. A connector then integrates both features into unified visual tokens for enhanced spatial understanding. Furthermore, we propose a space-aware frame sampling strategy at inference time, which selects the spatially informative frames of a video sequence, ensuring that even under limited token length, the model focuses on frames critical for spatial reasoning. Beyond architecture improvements, we construct the Spatial-MLLM-120k dataset and train the model on it using supervised fine-tuning and GRPO. Extensive experiments on various real-world datasets demonstrate that our spatial-MLLM achieves state-of-the-art performance in a wide range of visual-based spatial understanding and reasoning tasks. Project page: https://diankun-wu.github.io/Spatial-MLLM/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Spatial-MLLM, a dual-encoder MLLM for visual-based spatial reasoning from purely 2D image and video inputs. It uses a pretrained 2D visual encoder for semantic features and a spatial encoder initialized from the backbone of a feed-forward visual geometry foundation model to extract 3D structure features, fuses them via a connector, applies space-aware frame sampling for videos, constructs the Spatial-MLLM-120k dataset, and trains with supervised fine-tuning plus GRPO, claiming state-of-the-art performance across spatial understanding and reasoning benchmarks.
Significance. If the experimental results hold, the work offers a practical route to spatial intelligence in MLLMs without requiring 3D or 2.5D training data, which would expand applicability to standard image and video pipelines in robotics, AR/VR, and video analysis. The dataset release and GRPO training protocol constitute concrete, reusable contributions.
major comments (2)
- [Abstract and §4] Abstract and §4 (Experiments): the central SOTA claim is asserted after training on the 120k dataset with SFT and GRPO, yet the abstract supplies no quantitative metrics, ablation tables, or error bars. The load-bearing performance numbers and comparisons to prior video MLLMs must be presented with concrete scores, baselines, and statistical details to substantiate the headline result.
- [§3.1] §3.1 (Dual-encoder architecture): the claim that the spatial encoder, initialized from the geometry-model backbone, reliably supplies usable 3D structure features from 2D inputs without any 3D supervision is central to the novelty. No ablation, feature visualization, or quantitative probe (e.g., geometric vs. semantic leakage) is described that isolates the contribution of these transferred features; if they collapse to semantic information already available from the CLIP-style encoder, the architecture reduces to a standard video MLLM and the reported gains are unexplained.
minor comments (2)
- [§3.2 and Figure 2] §3.2 and Figure 2: the space-aware frame sampling strategy is introduced without pseudocode or explicit selection criterion; adding a short algorithmic description would improve reproducibility.
- [References and §4.1] References: several recent video MLLM baselines (e.g., Video-LLaVA, LLaVA-NeXT-Video) are cited but the exact versions and training settings used for comparison should be stated explicitly in §4.1.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments. We address each major point below and have revised the manuscript to strengthen the presentation of results and the validation of the dual-encoder design. All changes are highlighted in the revised version.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): the central SOTA claim is asserted after training on the 120k dataset with SFT and GRPO, yet the abstract supplies no quantitative metrics, ablation tables, or error bars. The load-bearing performance numbers and comparisons to prior video MLLMs must be presented with concrete scores, baselines, and statistical details to substantiate the headline result.
Authors: We agree that the abstract should contain the key quantitative results to support the SOTA claim. In the revised manuscript we have added the primary benchmark scores (e.g., average gains over prior video MLLMs on spatial understanding and reasoning tasks) together with the main baselines. In §4 we have augmented the tables with error bars computed over multiple runs and have made the comparisons to prior methods more explicit by listing exact scores and dataset splits. These additions directly address the request for concrete metrics and statistical detail. revision: yes
-
Referee: [§3.1] §3.1 (Dual-encoder architecture): the claim that the spatial encoder, initialized from the geometry-model backbone, reliably supplies usable 3D structure features from 2D inputs without any 3D supervision is central to the novelty. No ablation, feature visualization, or quantitative probe (e.g., geometric vs. semantic leakage) is described that isolates the contribution of these transferred features; if they collapse to semantic information already available from the CLIP-style encoder, the architecture reduces to a standard video MLLM and the reported gains are unexplained.
Authors: We acknowledge that an explicit isolation of the spatial encoder’s contribution strengthens the novelty argument. The original submission already contained an ablation in §4.3 that removed the spatial encoder and reported the resulting performance drop. To further address the referee’s concern we have added (i) t-SNE visualizations of features from both encoders on the same inputs and (ii) a quantitative probe that measures geometric consistency (e.g., depth and pose estimation accuracy) when using only the spatial encoder versus only the semantic encoder. These new analyses demonstrate that the spatial encoder supplies distinct 3D-structure information that is not redundant with the CLIP-style encoder, thereby explaining the observed gains. revision: yes
Circularity Check
No circularity: empirical architecture with external evaluation
full rationale
The paper contains no equations, derivations, or predictions that reduce to inputs by construction. It describes a dual-encoder architecture (semantic encoder + spatial encoder initialized from a pre-trained visual geometry model), a space-aware frame sampling strategy, and training on the constructed Spatial-MLLM-120k dataset using SFT and GRPO. Performance is measured on external real-world datasets. No self-citation chains, fitted parameters renamed as predictions, or self-definitional steps are present. The central SOTA claim rests on experimental results rather than tautological reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Pretrained visual geometry foundation models contain transferable 3D structural priors usable on 2D inputs
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
dual-encoder architecture: a pretrained 2D visual encoder to extract semantic features, and a spatial encoder—initialized from the backbone of the visual geometry model—to extract 3D structure features. A connector then integrates both features into unified visual tokens
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 19 Pith papers
-
PinpointQA: A Dataset and Benchmark for Small Object-Centric Spatial Understanding in Indoor Videos
PinpointQA is the first benchmark dataset for small object-centric spatial understanding in indoor videos, with four tasks showing MLLM capability gaps that improve via supervised fine-tuning.
-
ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models
ViSRA boosts MLLM 3D spatial reasoning performance by up to 28.9% on unseen tasks via a plug-and-play video-based agent that extracts explicit spatial cues from expert models without any post-training.
-
Exploring Spatial Intelligence from a Generative Perspective
Fine-tuning multimodal models on a new synthetic spatial benchmark improves generative spatial compliance on real and synthetic tasks and transfers to better spatial understanding.
-
Why MLLMs Struggle to Determine Object Orientations
Orientation information is recoverable from MLLM visual encoder embeddings via linear regression, contradicting the hypothesis that failures originate in the encoders.
-
Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale
A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.
-
PinpointQA: A Dataset and Benchmark for Small Object-Centric Spatial Understanding in Indoor Videos
PinpointQA is the first benchmark dataset for small object-centric spatial understanding in indoor videos, with four progressive tasks built from ScanNet data.
-
SCP: Spatial Causal Prediction in Video
SCP defines a new benchmark task for predicting spatial causal outcomes beyond direct observation and shows that 23 leading models lag far behind humans on it.
-
SpatialForge: Bootstrapping 3D-Aware Spatial Reasoning from Open-World 2D Images
SpatialForge synthesizes 10 million spatial QA pairs from in-the-wild 2D images to train VLMs for better depth ordering, layout, and viewpoint-dependent reasoning.
-
SpatialFusion: Endowing Unified Image Generation with Intrinsic 3D Geometric Awareness
SpatialFusion internalizes 3D geometric awareness into unified image generation models by pairing an MLLM with a spatial transformer that produces depth maps to constrain diffusion generation.
-
ReplicateAnyScene: Zero-Shot Video-to-3D Composition via Textual-Visual-Spatial Alignment
ReplicateAnyScene performs fully automated zero-shot video-to-compositional-3D reconstruction by cascading alignments of generic priors from vision foundation models across textual, visual, and spatial dimensions.
-
Let Geometry GUIDE: Layer-wise Unrolling of Geometric Priors in Multimodal LLMs
GUIDE unrolls multi-granularity geometric priors layer-wise into early MLLM layers with gating to improve spatial reasoning and perception.
-
EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs
EgoMind activates spatial cognition in MLLMs via linguistic Role-Play Caption and Progressive Spatial Analysis, reaching competitive results on VSI-Bench, SPAR-Bench, SITE-Bench and SPBench with only 5K SFT and 20K RL...
-
Feeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding
Motion-MLLM integrates IMU egomotion data into MLLMs using cascaded filtering and asymmetric fusion to ground visual content in physical trajectories for scale-aware 3D understanding, achieving competitive accuracy at...
-
Do MLLMs Really Understand Space? A Mathematical Reasoning Evaluation
MLLMs show a large gap in spatial mathematical reasoning compared to humans, and a new 10,000-problem dataset helps narrow it through training.
-
Vision-aligned Latent Reasoning for Multi-modal Large Language Model
VaLR generates vision-aligned latent tokens before each reasoning step to preserve perceptual cues, improving VSI-Bench accuracy from 33.0% to 52.9%.
-
Thinking with Novel Views: A Systematic Analysis of Generative-Augmented Spatial Intelligence
Integrating generative novel-view synthesis into LMM reasoning loops improves accuracy on spatial subtasks by 1.3 to 3.9 percentage points across multiple models and tasks.
-
SpatialImaginer: Towards Adaptive Visual Imagination for Spatial Reasoning
SpatialImaginer integrates visual imagination with textual chain-of-thought to improve spatial reasoning robustness in multimodal large language models.
-
OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence
OpenSpatial supplies a principled open-source data engine and 3-million-sample dataset that raises spatial-reasoning model performance by an average of 19 percent on benchmarks.
-
OpenWorldLib: A Unified Codebase and Definition of Advanced World Models
OpenWorldLib offers a standardized codebase and definition for world models that combine perception, interaction, and memory to understand and predict the world.
Reference graph
Works this paper leans on
-
[1]
Flamingo: a visual language model for few-shot learning,
J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, et al., “Flamingo: a visual language model for few-shot learning,”NeurIPS, 2022
work page 2022
-
[2]
J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” inICML, 2023
work page 2023
-
[3]
H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,”NeurIPS, 2024
work page 2024
-
[4]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
G. Team, P. Georgiev, V . I. Lei, R. Burnell, L. Bai, A. Gulati, G. Tanzer, D. Vincent, Z. Pan, S. Wang,et al., “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,”arXiv preprint arXiv:2403.05530, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al., “Gpt-4o system card,”arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
LLaVA-OneVision: Easy Visual Task Transfer
B. Li, Y . Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, Y . Li, Z. Liu, and C. Li, “Llava-onevision: Easy visual task transfer,”arXiv preprint arXiv:2408.03326, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,
Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Lu,et al., “Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,” inCVPR, 2024
work page 2024
-
[8]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-vl: A frontier large vision-language model with versatile abilities,”arXiv preprint arXiv:2308.12966, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
B. Lin, B. Zhu, Y . Ye, M. Ning, P. Jin, and L. Yuan, “Video-llava: Learning united visual representation by alignment before projection,”arXiv preprint arXiv:2311.10122, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[10]
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Z. Cheng, S. Leng, H. Zhang, Y . Xin, X. Li, G. Chen, Y . Zhu, W. Zhang, Z. Luo, D. Zhao, et al., “Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms,” arXiv preprint arXiv:2406.07476, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
Streaming long video understanding with large language models,
R. Qian, X. Dong, P. Zhang, Y . Zang, S. Ding, D. Lin, and J. Wang, “Streaming long video understanding with large language models,”arXiv preprint arXiv:2405.16009, 2024
-
[12]
LLaVA-Video: Video Instruction Tuning With Synthetic Data
Y . Zhang, J. Wu, W. Li, B. Li, Z. Ma, Z. Liu, and C. Li, “Video instruction tuning with synthetic data,” ArXiv, vol. abs/2410.02713, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K.-Y . Chen, X. Liu, J. Wang, W. Ge, Y . Fan, K. Dang, M. Du, X. Ren, R. Men, D. Liu, C. Zhou, J. Zhou, and J. Lin, “Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,”ArXiv, vol. abs/2409.12191, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y . Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y . Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin, “Qwen2.5-vl technical report,”ArXiv, vol. abs/2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Audiogpt: Understanding and generating speech, music, sound, and talking head,
R. Huang, M. Li, D. Yang, J. Shi, X. Chang, Z. Ye, Y . Wu, Z. Hong, J.-B. Huang, J. Liu, Y . Ren, Z. Zhao, and S. Watanabe, “Audiogpt: Understanding and generating speech, music, sound, and talking head,” ArXiv, vol. abs/2304.12995, 2023
-
[16]
Salmonn: Towards generic hearing abilities for large language models,
C. Tang, W. Yu, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. Ma, and C. Zhang, “Salmonn: Towards generic hearing abilities for large language models,”ArXiv, vol. abs/2310.13289, 2023
-
[17]
Ola: Pushing the frontiers of omni-modal language model with progressive modality alignment,
Z. Liu, Y . Dong, J. Wang, Z. Liu, W. Hu, J. Lu, and Y . Rao, “Ola: Pushing the frontiers of omni-modal language model with progressive modality alignment,”ArXiv, vol. abs/2502.04328, 2025
-
[18]
Thinking in space: How multimodal large language models see, remember, and recall spaces,
J. Yang, S. Yang, A. W. Gupta, R. Han, F.-F. Li, and S. Xie, “Thinking in space: How multimodal large language models see, remember, and recall spaces,”ArXiv, vol. abs/2412.14171, 2024
-
[19]
Spatialvlm: Endowing vision-language models with spatial reasoning capabilities,
B. Chen, Z. Xu, S. Kirmani, B. Ichter, D. Driess, P. Florence, D. Sadigh, L. J. Guibas, and F. Xia, “Spatialvlm: Endowing vision-language models with spatial reasoning capabilities,” 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14455–14465, 2024
work page 2024
-
[21]
3d-llava: Towards generalist 3d lmms with omni superpoint transformer,
J. Deng, T. He, L. Jiang, T. Wang, F. Dayoub, and I. Reid, “3d-llava: Towards generalist 3d lmms with omni superpoint transformer,”ArXiv, vol. abs/2501.01163, 2025
-
[22]
Chat-scene: Bridging 3d scene and large language models with object identifiers,
H. Huang, Z. Wang, R. Huang, L. Liu, X. Cheng, Y . Zhao, T. Jin, and Z. Zhao, “Chat-scene: Bridging 3d scene and large language models with object identifiers,” inNeural Information Processing Systems, 2023
work page 2023
-
[23]
Ll3da: Visual interactive instruction tuning for omni-3d understanding reasoning and planning,
S. Chen, X. Chen, C. Zhang, M. Li, G. Yu, H. Fei, H. Zhu, J. Fan, and T. Chen, “Ll3da: Visual interactive instruction tuning for omni-3d understanding reasoning and planning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 26428–26438, 2024
work page 2024
-
[24]
Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness,
C. Zhu, T. Wang, W. Zhang, J. Pang, and X. Liu, “Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness,”arXiv preprint arXiv:2409.18125, 2024. 18
-
[25]
Video-3d llm: Learning position-aware video representation for 3d scene understanding,
D. Zheng, S. Huang, and L. Wang, “Video-3d llm: Learning position-aware video representation for 3d scene understanding,”ArXiv, vol. abs/2412.00493, 2024
-
[26]
Datacomp: In search of the next generation of multimodal datasets,
S. Y . Gadre, G. Ilharco, A. Fang, J. Hayase, G. Smyrnis, T. Nguyen, R. Marten, M. Wortsman, D. Ghosh, J. Zhang, E. Orgad, R. Entezari, G. Daras, S. Pratt, V . Ramanujan, Y . Bitton, K. Marathe, S. Mussmann, R. Vencu, M. Cherti, R. Krishna, P. W. Koh, O. Saukh, A. J. Ratner, S. Song, H. Hajishirzi, A. Farhadi, R. Beaumont, S. Oh, A. G. Dimakis, J. Jitsev,...
-
[27]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” inInternational Conference on Machine Learning, 2021
work page 2021
-
[28]
Tulip: Towards unified language-image pretraining,
Z. Tang, L. Lian, S. Eisape, X. Wang, R. Herzig, A. Yala, A. Suhr, T. Darrell, and D. M. Chan, “Tulip: Towards unified language-image pretraining,”ArXiv, vol. abs/2503.15485, 2025
-
[29]
Beyond semantics: Rediscovering spatial awareness in vision-language models,
J. Qi, J. Liu, H. Tang, and Z. Zhu, “Beyond semantics: Rediscovering spatial awareness in vision-language models,”ArXiv, vol. abs/2503.17349, 2025
-
[30]
Long-clip: Unlocking the long-text capability of clip,
B. Zhang, P. Zhang, X. wen Dong, Y . Zang, and J. Wang, “Long-clip: Unlocking the long-text capability of clip,” inEuropean Conference on Computer Vision, 2024
work page 2024
-
[31]
Dust3r: Geometric 3d vision made easy,
S. Wang, V . Leroy, Y . Cabon, B. Chidlovskii, and J. Revaud, “Dust3r: Geometric 3d vision made easy,” 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pp. 20697–20709, 2023
work page 2024
-
[32]
Vggt: Visual geometry grounded transformer,
J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny, “Vggt: Visual geometry grounded transformer,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025
work page 2025
-
[33]
Megasam: Accurate, fast, and robust structure and motion from casual dynamic videos,
Z. Li, R. Tucker, F. Cole, Q. Wang, L. Jin, V . Ye, A. Kanazawa, A. Holynski, and N. Snavely, “Megasam: Accurate, fast, and robust structure and motion from casual dynamic videos,”ArXiv, vol. abs/2412.04463, 2024
-
[34]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J.-M. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, X. Zhang, X. Yu, Y . Wu, Z. F. Wu, Z. Gou, Z. Shao, Z. Li, Z. Gao, A. Liu, B. Xue, B.-L. Wang, B. Wu, B. Feng, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, D. Dai, D. Chen, D.-L. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Bao, H....
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Z. Shao, P. Wang, Q. Zhu, R. Xu, J.-M. Song, M. Zhang, Y . K. Li, Y . Wu, and D. Guo, “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,” ArXiv, vol. abs/2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[36]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. H. Chi, F. Xia, Q. Le, and D. Zhou, “Chain of thought prompting elicits reasoning in large language models,”ArXiv, vol. abs/2201.11903, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[37]
Scanqa: 3d question answering for spatial scene understanding,
D. Azuma, T. Miyanishi, S. Kurita, and M. Kawanabe, “Scanqa: 3d question answering for spatial scene understanding,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 19107–19117, 2021
work page 2022
-
[38]
Sqa3d: Situated question answering in 3d scenes,
X. Ma, S. Yong, Z. Zheng, Q. Li, Y . Liang, S.-C. Zhu, and S. Huang, “Sqa3d: Situated question answering in 3d scenes,”ArXiv, vol. abs/2210.07474, 2022
-
[39]
Improved baselines with visual instruction tuning,
H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved baselines with visual instruction tuning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 26296–26306, 2024
work page 2024
-
[40]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,”arXiv preprint arXiv:2304.10592, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[41]
PandaGPT: One Model To Instruction-Follow Them All
Y . Su, T. Lan, H. Li, J. Xu, Y . Wang, and D. Cai, “Pandagpt: One model to instruction-follow them all,” arXiv preprint arXiv:2305.16355, 2023. 19
work page internal anchor Pith review arXiv 2023
-
[42]
Detgpt: Detect what you need via reasoning,
R. Pi, J. Gao, S. Diao, R. Pan, H. Dong, J. Zhang, L. Yao, J. Han, H. Xu, L. Kong, et al., “Detgpt: Detect what you need via reasoning,”arXiv preprint arXiv:2305.14167, 2023
-
[43]
VideoChat: Chat-Centric Video Understanding
K. Li, Y . He, Y . Wang, Y . Li, W. Wang, P. Luo, Y . Wang, L. Wang, and Y . Qiao, “Videochat: Chat-centric video understanding,”arXiv preprint arXiv:2305.06355, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[44]
Grounded 3d-llm with referent tokens,
Y . Chen, S. Yang, H. Huang, T. Wang, R. Xu, R. Lyu, D. Lin, and J. Pang, “Grounded 3d-llm with referent tokens,”arXiv preprint arXiv:2405.10370, 2024
-
[45]
Chat-3d: Data-efficiently tuning large language model for universal dialogue of 3d scenes,
Z. Wang, H. Huang, Y . Zhao, Z. Zhang, and Z. Zhao, “Chat-3d: Data-efficiently tuning large language model for universal dialogue of 3d scenes,”arXiv preprint arXiv:2308.08769, 2023
-
[47]
Chat-scene: Bridging 3d scene and large language models with object identifiers,
H. Huang, Y . Chen, Z. Wang, R. Huang, R. Xu, T. Wang, L. Liu, X. Cheng, Y . Zhao, J. Pang, et al., “Chat-scene: Bridging 3d scene and large language models with object identifiers,” inThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024
work page 2024
-
[48]
3d-llm: Injecting the 3d world into large language models,
Y . Hong, H. Zhen, P. Chen, S. Zheng, Y . Du, Z. Chen, and C. Gan, “3d-llm: Injecting the 3d world into large language models,”Advances in Neural Information Processing Systems, vol. 36, pp. 20482–20494, 2023
work page 2023
-
[50]
Gpt4scene: Understand 3d scenes from videos with vision-language models,
Z. Qi, Z. Zhang, Y . Fang, J. Wang, and H. Zhao, “Gpt4scene: Understand 3d scenes from videos with vision-language models,”arXiv preprint arXiv:2501.01428, 2025
-
[51]
Oryx mllm: On-demand spatial-temporal understanding at arbitrary resolution,
Z. Liu, Y . Dong, Z. Liu, W. Hu, J. Lu, and Y . Rao, “Oryx mllm: On-demand spatial-temporal understanding at arbitrary resolution,”arXiv preprint arXiv:2409.12961, 2024
-
[52]
Videoagent: Long-form video understanding with large language model as agent,
X. Wang, Y . Zhang, O. Zohar, and S. Yeung-Levy, “Videoagent: Long-form video understanding with large language model as agent,” inEuropean Conference on Computer Vision, pp. 58–76, Springer, 2024
work page 2024
-
[53]
Sti-bench: Are mllms ready for precise spatial-temporal world understanding?,
Y . Li, Y . Zhang, T. Lin, X. Liu, W. Cai, Z. Liu, and B. Zhao, “Sti-bench: Are mllms ready for precise spatial-temporal world understanding?,”arXiv preprint arXiv:2503.23765, 2025
-
[54]
St-think: How multimodal large language models reason about 4d worlds from ego-centric videos,
P. Wu, Y . Liu, M. Liu, and J. Shen, “St-think: How multimodal large language models reason about 4d worlds from ego-centric videos,”arXiv preprint arXiv:2503.12542, 2025
-
[55]
Vlm4d: Towards spatiotemporal awareness in vision language models,
S. Zhou, A. Vilesov, X. He, Z. Wan, S. Zhang, A. N. D. C. D. Chen, and X. E. W. A. Kadambi, “Vlm4d: Towards spatiotemporal awareness in vision language models,”
-
[56]
A. Vaswani, N. M. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” inNeural Information Processing Systems, 2017
work page 2017
-
[57]
Vision Transformers Need Registers
T. Darcet, M. Oquab, J. Mairal, and P. Bojanowski, “Vision transformers need registers,” ArXiv, vol. abs/2309.16588, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[58]
Transfer between modalities with metaqueries,
X. Pan, S. N. Shukla, A. Singh, Z. Zhao, S. K. Mishra, J. Wang, Z. Xu, J. Chen, K. Li, F. Juefei-Xu, J. Hou, and S. Xie, “Transfer between modalities with metaqueries,” 2025
work page 2025
-
[59]
An analysis of approximations for maximizing submodular set functions—i,
G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher, “An analysis of approximations for maximizing submodular set functions—i,” Mathematical Programming, vol. 14, no. 1, pp. 265–294, 1978
work page 1978
-
[60]
D. S. Hochbaum, Approximating covering and packing problems: set cover, vertex cover, independent set, and related problems, p. 94–143. USA: PWS Publishing Co., 1996
work page 1996
-
[61]
Scannet: Richly-annotated 3d reconstructions of indoor scenes,
A. Dai, A. X. Chang, M. Savva, M. Halber, T. A. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2432–2443, 2017
work page 2017
-
[62]
Adam: A Method for Stochastic Optimization
D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”CoRR, vol. abs/1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[63]
Longvila: Scaling long-context visual language models for long videos,
F. Xue, Y . Chen, D. Li, Q. Hu, L. Zhu, X. Li, Y . Fang, H. Tang, S. Yang, Z. Liu, E. He, H. Yin, P. Molchanov, J. Kautz, L. Fan, Y . Zhu, Y . Lu, and S. Han, “Longvila: Scaling long-context visual language models for long videos,”ArXiv, vol. abs/2408.10188, 2024
-
[64]
Vila: On pre-training for visual language models,
J. Lin, H. Yin, W. Ping, Y . Lu, P. Molchanov, A. Tao, H. Mao, J. Kautz, M. Shoeybi, and S. Han, “Vila: On pre-training for visual language models,”2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 26679–26689, 2023
work page 2024
-
[65]
Long Context Transfer from Language to Vision
P. Zhang, K. Zhang, B. Li, G. Zeng, J. Yang, Y . Zhang, Z. Wang, H. Tan, C. Li, and Z. Liu, “Long context transfer from language to vision,”ArXiv, vol. abs/2406.16852, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[66]
Scannet++: A high-fidelity dataset of 3d indoor scenes,
C. Yeshwanth, Y .-C. Liu, M. Nießner, and A. Dai, “Scannet++: A high-fidelity dataset of 3d indoor scenes,” 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 12–22, 2023. 20
work page 2023
-
[67]
G. Baruch, Z. Chen, A. Dehghan, T. Dimry, Y . Feigin, P. Fu, T. Gebauer, B. Joffe, D. Kurz, A. Schwartz, and E. Shulman, “ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data,” inThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021
work page 2021
-
[68]
An embodied generalist agent in 3d world,
J. Huang, S. Yong, X. Ma, X. Linghu, P. Li, Y . Wang, Q. Li, S.-C. Zhu, B. Jia, and S. Huang, “An embodied generalist agent in 3d world,”ArXiv, vol. abs/2311.12871, 2023
-
[69]
3d-vista: Pre-trained transformer for 3d vision and text alignment,
Z. Zhu, X. Ma, Y . Chen, Z. Deng, S. Huang, and Q. Li, “3d-vista: Pre-trained transformer for 3d vision and text alignment,”2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2899–2909, 2023
work page 2023
-
[70]
3d-llm: Injecting the 3d world into large language models,
Y . Hong, H. Zhen, P. Chen, S. Zheng, Y . Du, Z. Chen, and C. Gan, “3d-llm: Injecting the 3d world into large language models,”NeurIPS, 2023
work page 2023
-
[71]
Open3D: A Modern Library for 3D Data Processing
Q.-Y . Zhou, J. Park, and V . Koltun, “Open3d: A modern library for 3d data processing,” ArXiv, vol. abs/1801.09847, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[72]
Indoor segmentation and support inference from rgbd images,
N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from rgbd images,” inEuropean Conference on Computer Vision, 2012
work page 2012
-
[73]
Perceptual organization and recognition of indoor scenes from rgb-d images,
S. Gupta, P. Arbeláez, and J. Malik, “Perceptual organization and recognition of indoor scenes from rgb-d images,”2013 IEEE Conference on Computer Vision and Pattern Recognition, pp. 564–571, 2013
work page 2013
-
[74]
Scene-llm: Extending language model for 3d visual understanding and reasoning,
R. Fu, J. Liu, X. Chen, Y . Nie, and W. Xiong, “Scene-llm: Extending language model for 3d visual understanding and reasoning,”ArXiv, vol. abs/2403.11401, 2024. 21
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.