Thinking with Geometry: Active Geometry Integration for Spatial Reasoning

Hang Xu; Haoyuan Li; Jianhua Han; JiaWang Bian; Kun Xiang; Qihang Cao; Tao Tang; Xiaodan Liang; Zihan Guo

arxiv: 2602.06037 · v4 · submitted 2026-02-05 · 💻 cs.CV

Thinking with Geometry: Active Geometry Integration for Spatial Reasoning

Haoyuan Li , Qihang Cao , Tao Tang , Kun Xiang , Zihan Guo , Jianhua Han , Hang Xu , JiaWang Bian

show 1 more author

Xiaodan Liang

This is my paper

Pith reviewed 2026-05-16 06:36 UTC · model grok-4.3

classification 💻 cs.CV

keywords spatial reasoningmultimodal large language modelsactive geometry integrationcross-attentionimportance gatingVSI-Benchembodied AIspatial intelligence

0 comments

The pith

GeoThinker improves spatial reasoning by letting multimodal models actively select relevant geometric evidence based on their reasoning needs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that passive fusion of geometry into multimodal large language models causes semantic misalignment and hurts spatial reasoning performance. GeoThinker instead uses selective retrieval of geometric features at specific layers, guided by semantic priors through cross-attention and importance gating. This active method leads to superior results on spatial tasks. A reader would care because better spatial intelligence enables more capable AI for navigation, manipulation, and scene understanding in 3D worlds.

Core claim

The central claim is that shifting from passive global fusion of 3D geometry to active, reasoning-conditioned integration allows the model to selectively query and incorporate task-relevant geometric evidence. This is implemented via Spatial-Grounded Fusion at chosen VLM layers with frame-strict cross-attention and Importance Gating, resulting in a peak score of 72.6 on VSI-Bench and enhanced performance in embodied referring and autonomous driving.

What carries the argument

The Spatial-Grounded Fusion process, which applies frame-strict cross-attention conditioned on semantic visual priors and uses Importance Gating to prioritize task-relevant structures.

If this is right

GeoThinker achieves a new state-of-the-art score of 72.6 on the VSI-Bench.
It exhibits robust generalization to complex downstream scenarios including embodied referring and autonomous driving.
Active integration reduces semantic-geometry misalignment and redundant signals compared to passive methods.
The results support that active integration of spatial structures is essential for next-generation spatial intelligence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar active selection techniques could be adapted for integrating other data modalities like audio or text priors in multimodal models.
By focusing on relevant geometry only, the approach may lower computational demands in large-scale deployments.
Extending this to models without 3D encoders or testing on additional spatial benchmarks would further validate the method.

Load-bearing premise

Frame-strict cross-attention combined with importance gating can reliably select task-relevant geometry and reduce misalignment without introducing new selection biases or requiring task-specific tuning.

What would settle it

If an experiment shows that a simple passive fusion method achieves similar or better scores than GeoThinker on VSI-Bench and the downstream tasks, the benefit of the active components would be called into question.

Figures

Figures reproduced from arXiv: 2602.06037 by Hang Xu, Haoyuan Li, Jianhua Han, JiaWang Bian, Kun Xiang, Qihang Cao, Tao Tang, Xiaodan Liang, Zihan Guo.

**Figure 1.** Figure 1: Thinking with geometry through active integration. Left: (a) Passive Fusion: Conventional MLLMs indiscriminately incorporate a global stream of geometric features, which leads to significant information redundancy and semantic-texture misalignment. (b) Active Perception (GeoThinker): Our framework shifts the paradigm by empowering the model to discern and selectively retrieve spatial cues guided by its int… view at source ↗

**Figure 2.** Figure 2: Comparison of geometry integration paradigms. (a) and (b) represent passive paradigms that indiscriminately incorporate geometric streams, often leading to semantic-geometry misalignment and redundant noise. In contrast, (c) GeoThinker shifts to active perception, empowering the MLLM to autonomously discern and selectively retrieve task-related geometric cues guided by internal reasoning. strategies prim… view at source ↗

**Figure 3.** Figure 3: Overview of the GeoThinker architecture. Our framework features a decoupled interaction mechanism where the VGGT is integrated via Spatial-Grounded Fusion layers. By employing Importance Gating, the model predicts a localized attention bias to dynamically modulate the injection of geometric textures. This design ensures that rich structural details are only queried when they are contextually relevant to th… view at source ↗

**Figure 4.** Figure 4: Visualization of Importance Gating Scores. Heatmaps illustrate that GeoThinker naturally learns to prioritize salient object boundaries and structural edges while suppressing non-informative regions like floors or walls [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Computational cost comparison of FLOPs and inference latency. series, the FLOPs difference between our 8-frame model and VG-LLM is negligible, with the SGF module accounting for less than 5% of the total FLOPs. While this proportion slightly increases on the Qwen3-VL series due to differences in hidden state dimensions, the overall efficiency remains high. Efficiency of Spatial Compression: Our 32-frame se… view at source ↗

**Figure 6.** Figure 6: Visualization of importance score on MindCube [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Visualization of importance score on VSI-Bench. towel and trash bin in the bathroom scene, and the backpack and computer mouse in the office setting. Spatial Reasoning via Landmark Identification. The visualization demonstrates that the model’s spatial reasoning is grounded in precise object localization. In the office example, where the backpack is partially obscured or located among numerous similar de… view at source ↗

**Figure 8.** Figure 8: Visualization of robustness to image resolution. The left panels show the importance score heatmaps, while the right panels provide a masked visualization where only regions with a heatmap value greater than 0.5 are preserved. The experiment evaluates model performance across varying input quality, from original resolution down to 6.25%. L.3. LLM usage We thank the Gemini 2.5-Flash for assistance in editin… view at source ↗

read the original abstract

Recent progress in spatial reasoning with Multimodal Large Language Models (MLLMs) increasingly leverages geometric priors from 3D encoders. However, most existing integration strategies remain passive: geometry is exposed as a global stream and fused in an indiscriminate manner, which often induces semantic-geometry misalignment and redundant signals. We propose GeoThinker, a framework that shifts the paradigm from passive fusion to active perception. Instead of feature mixing, GeoThinker enables the model to selectively retrieve geometric evidence conditioned on its internal reasoning demands. GeoThinker achieves this through Spatial-Grounded Fusion applied at carefully selected VLM layers, where semantic visual priors selectively query and integrate task-relevant geometry via frame-strict cross-attention, further calibrated by Importance Gating that biases per-frame attention toward task-relevant structures. Comprehensive evaluation results show that GeoThinker sets a new state-of-the-art in spatial intelligence, achieving a peak score of 72.6 on the VSI-Bench. Furthermore, GeoThinker demonstrates robust generalization and significantly improved spatial perception across complex downstream scenarios, including embodied referring and autonomous driving. Our results indicate that the ability to actively integrate spatial structures is essential for next-generation spatial intelligence. Code can be found at https://github.com/Li-Hao-yuan/GeoThinker.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The active retrieval idea for geometry in MLLMs is worth testing, but the paper still needs to prove its specific components are responsible for the gains.

read the letter

The paper's main contribution is swapping passive global fusion of 3D geometry features for an active process where the model selectively retrieves relevant geometry based on its internal reasoning state. This targets a real problem: indiscriminate mixing often creates semantic misalignment and extra noise in spatial tasks. The proposed pieces are Spatial-Grounded Fusion at chosen layers, frame-strict cross-attention that limits queries to matching frames, and an importance gate that biases toward task-relevant structures. These form a concrete architectural shift from the usual blanket stream approach described in prior work. If the mechanism works, it could help models pull only what they need instead of processing everything at once, which lines up with needs in embodied AI and driving scenarios. The abstract reports a 72.6 peak on VSI-Bench plus better results on referring and autonomous driving tasks, and it notes code release, which is useful for anyone wanting to inspect the implementation. The soft spot is the lack of controls. The stress-test concern holds: there is no sign of ablations that remove the frame-strict constraint or the gating module while keeping layer choices and other factors fixed. Without those, it is difficult to attribute the headline score to the active retrieval rather than training details or added capacity. The abstract also omits baseline tables, dataset sizes, and error breakdowns, so the generalization claims sit without direct comparison. This is aimed at people already running MLLMs on spatial benchmarks who might want to try conditional fusion. The thinking is straightforward and engages the right literature on fusion strategies, even if the evidence is still thin. I would send it for peer review so referees can check whether the experiments actually isolate the contribution.

Referee Report

2 major / 1 minor

Summary. The paper proposes GeoThinker, a framework that shifts from passive global fusion of geometric priors in MLLMs to active selective retrieval for spatial reasoning. It applies Spatial-Grounded Fusion at selected VLM layers using frame-strict cross-attention conditioned on semantic priors, calibrated by importance gating to bias toward task-relevant structures, and reports a new SOTA peak score of 72.6 on VSI-Bench along with improved generalization on embodied referring and autonomous driving tasks.

Significance. If the performance gains are shown to be causally attributable to the active integration mechanisms rather than confounding factors, the work would offer a concrete architectural advance in reducing semantic-geometry misalignment, with potential impact on downstream spatial tasks in vision-language models.

major comments (2)

[§4 (Experimental Evaluation)] §4 (Experimental Evaluation): The manuscript asserts SOTA performance of 72.6 on VSI-Bench and robust generalization but supplies no quantitative baseline comparisons, component ablations, error analysis, or dataset statistics; this leaves the central claim that frame-strict cross-attention plus importance gating drives the improvement without direct supporting evidence.
[§3.2 (Spatial-Grounded Fusion)] §3.2 (Spatial-Grounded Fusion): No controlled ablation is reported that removes or relaxes the frame-strict constraint and importance gating while holding layer selection, parameter count, and training fixed; without these isolations the attribution of the 72.6 score to the selective-retrieval design remains unverified.

minor comments (1)

[Abstract and §3] The abstract and method descriptions use terms such as 'carefully selected VLM layers' without specifying the selection criterion or providing a diagram of the layer placement; adding this detail would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address the two major comments point by point below and will strengthen the experimental section with additional analyses in the revised manuscript.

read point-by-point responses

Referee: [§4 (Experimental Evaluation)] §4 (Experimental Evaluation): The manuscript asserts SOTA performance of 72.6 on VSI-Bench and robust generalization but supplies no quantitative baseline comparisons, component ablations, error analysis, or dataset statistics; this leaves the central claim that frame-strict cross-attention plus importance gating drives the improvement without direct supporting evidence.

Authors: We agree that the current presentation of results would benefit from more explicit quantitative support. While the manuscript reports the 72.6 peak score on VSI-Bench together with generalization to embodied referring and autonomous driving tasks, we acknowledge the absence of detailed baseline tables, component-wise ablations, error breakdowns, and dataset statistics. In the revision we will add these elements, including direct comparisons against passive global-fusion baselines and quantitative isolation of performance gains attributable to the proposed mechanisms. revision: yes
Referee: [§3.2 (Spatial-Grounded Fusion)] §3.2 (Spatial-Grounded Fusion): No controlled ablation is reported that removes or relaxes the frame-strict constraint and importance gating while holding layer selection, parameter count, and training fixed; without these isolations the attribution of the 72.6 score to the selective-retrieval design remains unverified.

Authors: We will add the requested controlled ablations in the revised manuscript. These experiments will remove or relax the frame-strict constraint and the importance-gating module individually while keeping layer selection, total parameter count, and training protocol fixed, thereby providing direct evidence for the contribution of each design choice to the reported performance. revision: yes

Circularity Check

0 steps flagged

No circularity: architectural framework independent of fitted or self-referential quantities

full rationale

The paper describes GeoThinker as an architectural shift from passive to active geometry integration via Spatial-Grounded Fusion at selected layers, frame-strict cross-attention, and importance gating. No equations, derivations, or parameter-fitting steps are present that reduce by construction to the inputs or to self-citations. Performance on VSI-Bench is reported as empirical outcome of the proposed modules rather than a renamed fit or self-referential prediction. The central claim rests on the design of selective retrieval mechanisms, which are presented as independent architectural choices without load-bearing self-citation chains or uniqueness theorems imported from prior author work. This matches the default expectation of a non-circular model paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the approach assumes standard VLM layer access and attention primitives from prior work.

pith-pipeline@v0.9.0 · 5548 in / 941 out tokens · 20484 ms · 2026-05-16T06:36:55.247817+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Cambrian-P: Pose-Grounded Video Understanding
cs.CV 2026-05 unverdicted novelty 6.0

Cambrian-P adds per-frame camera pose tokens and a regression head to video MLLMs, delivering 4.5-6.5% gains on spatial benchmarks, generalization to other video QA tasks, and SOTA streaming pose estimation on ScanNet.
GeoWeaver: Grounding Visual Tokens with Geometric Evidence before Scene Reasoning
cs.CV 2026-05 unverdicted novelty 6.0

GeoWeaver performs token-adaptive geometric grounding on visual tokens from a multi-level bank prior to language modeling to support better spatio-temporal reasoning.
SpatialImaginer: Towards Adaptive Visual Imagination for Spatial Reasoning
cs.CV 2026-04 unverdicted novelty 5.0

SpatialImaginer integrates visual imagination with textual chain-of-thought to improve spatial reasoning robustness in multimodal large language models.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · cited by 3 Pith papers · 13 internal anchors

[1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

URL https://www. anthropic.com/news/claude-3-5-sonnet. Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., and Zhou, J. Qwen-vl: A frontier large vision- language model with versatile abilities.arXiv preprint arXiv:2308.12966, 1(2):3,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

train on the test set

Brown, E., Yang, J., Yang, S., Fergus, R., and Xie, S. Benchmark designers should” train on the test set” to expose exploitable non-visual shortcuts.arXiv preprint arXiv:2511.04655,

work page arXiv
[4]

Seed1.5-VL Technical Report

ByteDance Seed. Seed1.5-vl technical report.arXiv preprint arXiv:2505.07062,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Scaling spatial intelligence with multimodal foundation models

Cai, Z., Wang, R., Gu, C., Pu, F., Xu, J., Wang, Y ., Yin, W., Yang, Z., Wei, C., Sun, Q., et al. Scaling spatial intelligence with multimodal foundation models.arXiv preprint arXiv:2511.13719, 2025a. Cai, Z., Wang, Y ., Sun, Q., Wang, R., Gu, C., Yin, W., Lin, Z., Yang, Z., Wei, C., Qian, O., et al. Holistic evaluation of multimodal llms on spatial intel...

work page arXiv
[6]

Reasoning in space via grounding in the world.arXiv preprint arXiv:2510.13800, 2025

Chen, Y ., Qi, Z., Zhang, W., Jin, X., Zhang, L., and Liu, P. Reasoning in space via grounding in the world.arXiv preprint arXiv:2510.13800, 2025a. Chen, Z., Wang, W., Tian, H., Ye, S., Gao, Z., Cui, E., Tong, W., Hu, K., Luo, J., Ma, Z., et al. How far are we to gpt-4v? closing the gap to commercial multimodal mod- els with open-source suites.Science Chi...

work page arXiv
[7]

Think with 3d: Geometric imagination grounded spatial reasoning from limited views.arXiv preprint arXiv:2510.18632, 2025a

Chen, Z., Zhang, M., Yu, X., Luo, X., Sun, M., Pan, Z., An, X., Feng, Y ., Pei, P., Cai, X., et al. Think with 3d: Geometric imagination grounded spatial reasoning from limited views.arXiv preprint arXiv:2510.18632, 2025b. Dauner, D., Hallgarten, M., Li, T., Weng, X., Huang, Z., Yang, Z., Li, H., Gilitschenski, I., Ivanovic, B., Pavone, M., et al. Navsim:...

work page arXiv
[8]

Emerging Properties in Unified Multimodal Pretraining

Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., Shi, G., and Fan, H. Emerg- ing properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

Fan, Z., Zhang, J., Li, R., Zhang, J., Chen, R., Hu, H., Wang, K., Qu, H., Wang, D., Yan, Z., et al. Vlm-3r: Vision- language models augmented with instruction-aligned 3d reconstruction.arXiv preprint arXiv:2505.20279,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Gemini: A Family of Highly Capable Multimodal Models

Accessed: 2025-11-18. 9 Thinking with Geometry: Active Geometry Integration for Spatial Reasoning Gemini Team. Gemini: a family of highly capable multi- modal models.arXiv preprint arXiv:2312.11805,

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team. Gemini 1.5: Unlocking multimodal under- standing across millions of tokens of context.arXiv preprint arXiv:2403.05530,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

g2vlm: Geometry grounded vision language model with unified 3d reconstruction and spatial reasoning.arXiv preprint arXiv:2511.21688,

Hu, W., Lin, J., Long, Y ., Ran, Y ., Jiang, L., Wang, Y ., Zhu, C., Xu, R., Wang, T., and Pang, J. G 2 VLM: Ge- ometry grounded vision language model with unified 3d reconstruction and spatial reasoning.arXiv preprint arXiv:2511.21688,

work page arXiv
[13]

GPT-4o System Card

Hurst, A., Lerer, A., Goucher, A. P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models.arXiv preprint arXiv:2506.03135, 2025

Jia, M., Qi, Z., Zhang, S., Zhang, W., Yu, X., He, J., Wang, H., and Yi, L. Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models. arXiv preprint arXiv:2506.03135,

work page arXiv
[15]

LLaVA-OneVision: Easy Visual Task Transfer

Li, B., Zhang, Y ., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y ., Liu, Z., et al. Llava- onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024a. Li, D., Li, H., Wang, Z., Yan, Y ., Zhang, H., Chen, S., Hou, G., Jiang, S., Zhang, W., Shen, Y ., et al. Viewspatial-bench: Evaluating multi-perspective spatial local...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Mmsi-video-bench: A holistic benchmark for video-based spatial intelligence.arXiv preprint arXiv:2512.10863, 2025

Lin, J., Xu, R., Zhu, S., Yang, S., Cao, P., Ran, Y ., Hu, M., Zhu, C., Xie, Y ., Long, Y ., et al. Mmsi-video-bench: A holistic benchmark for video-based spatial intelligence. arXiv preprint arXiv:2512.10863,

work page arXiv
[17]

Liu, H., Li, C., Li, Y ., Li, B., Zhang, Y ., Shen, S., and Lee, Y . J. Llava-next: Improved reason- ing, ocr, and world knowledge, January 2024a. URL https://llava-vl.github.io/blog/ 2024-01-30-llava-next/. Liu, R., Li, C., Tang, H., Ge, Y ., Shan, Y ., and Li, G. St-llm: Large language models are effective temporal learners. InEuropean Conference on Com...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Qwen2.5-VL Technical Report

Accessed: 2025-08-10. Qwen Team. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025a. Qwen Team. Qwen3-vl: Multimodal large language model series. https://github.com/QwenLM/ Qwen3-VL, 2025b. GitHub repository; accessed: 2025- 11-14. Qwen Team. Qwen3 technical report, 2025c. URL https: //arxiv.org/abs/2505.09388. Tong, P., Brown, E., Wu, P.,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Ross3d: Recon- structive visual instruction tuning with 3d-awareness.arXiv preprint arXiv:2504.01901, 2025

10 Thinking with Geometry: Active Geometry Integration for Spatial Reasoning Wang, H., Zhao, Y ., Wang, T., Fan, H., Zhang, X., and Zhang, Z. Ross3d: Reconstructive visual instruction tun- ing with 3d-awareness.arXiv preprint arXiv:2504.01901, 2025a. Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., and Novotny, D. VGGT: Visual geometry grounded...

work page arXiv
[20]

A0: An affordance-aware hierarchical model for general robotic manipulation,

Xu, R., Gao, H., Yu, M., An, D., Chen, S., Wang, C., Guo, L., Liang, X., and Xu, S. 3d-more: Unified modal- contextual reasoning for embodied question answering. In2025 IEEE/RSJ International Conference on Intelli- gent Robots and Systems (IROS), pp. 5924–5929. IEEE, 2025a. Xu, R., Zhang, J., Guo, M., Wen, Y ., Yang, H., Lin, M., Huang, J., Li, Z., Zhang,...

work page arXiv
[21]

NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation

Zhang, J., Wang, K., Xu, R., Zhou, G., Hong, Y ., Fang, X., Wu, Q., Zhang, Z., and He, W. Navid: Video-based vlm plans the next step for vision-and-language navigation. arXiv preprint arXiv:2402.15852, 2024a. Zhang, J., Chen, Y ., Zhou, Y ., Xu, Y ., Huang, Z., Mei, J., Chen, J., Yuan, Y .-J., Cai, X., Huang, G., et al. From flatland to space: Teaching vi...

work page internal anchor Pith review arXiv
[22]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Zhang, Y ., Wu, J., Li, W., Li, B., Ma, Z., Liu, Z., and Li, C. Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024b. Zheng, D., Huang, S., Li, Y ., and Wang, L. Learning from videos for 3d world: Enhancing mllms with 3d vision ge- ometry priors.arXiv preprint arXiv:2505.24625, 2025a. Zheng, D., Huang, S., and Wang, L. Video...

work page internal anchor Pith review Pith/arXiv arXiv
[23]

11 Thinking with Geometry: Active Geometry Integration for Spatial Reasoning A. Appendix/supplemental material The outline of the Appendix is as follows: • More implementation details; • More analysis on computational cost; • More analysis on fusion ratioρ; • More comparisons on EASI leaderboard; • More comparisons on VSI-Debiased; • More comparisons on V...

work page 2025
[24]

55.2 50.7 70.048.9 51.1 59.150.042.952.5 71.1 56.853.1 58.6 Gemini-1.5-pro-flash (Gemini Team, 2024)48.5 47.9 52.5 51.7 43.6 51.1 43.5 53.6 33.9 64.4 43.2 46.9 49.4 GPT-4V (Achiam et al.,

work page 2024

[1] [1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

URL https://www. anthropic.com/news/claude-3-5-sonnet. Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., and Zhou, J. Qwen-vl: A frontier large vision- language model with versatile abilities.arXiv preprint arXiv:2308.12966, 1(2):3,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

train on the test set

Brown, E., Yang, J., Yang, S., Fergus, R., and Xie, S. Benchmark designers should” train on the test set” to expose exploitable non-visual shortcuts.arXiv preprint arXiv:2511.04655,

work page arXiv

[4] [4]

Seed1.5-VL Technical Report

ByteDance Seed. Seed1.5-vl technical report.arXiv preprint arXiv:2505.07062,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Scaling spatial intelligence with multimodal foundation models

Cai, Z., Wang, R., Gu, C., Pu, F., Xu, J., Wang, Y ., Yin, W., Yang, Z., Wei, C., Sun, Q., et al. Scaling spatial intelligence with multimodal foundation models.arXiv preprint arXiv:2511.13719, 2025a. Cai, Z., Wang, Y ., Sun, Q., Wang, R., Gu, C., Yin, W., Lin, Z., Yang, Z., Wei, C., Qian, O., et al. Holistic evaluation of multimodal llms on spatial intel...

work page arXiv

[6] [6]

Reasoning in space via grounding in the world.arXiv preprint arXiv:2510.13800, 2025

Chen, Y ., Qi, Z., Zhang, W., Jin, X., Zhang, L., and Liu, P. Reasoning in space via grounding in the world.arXiv preprint arXiv:2510.13800, 2025a. Chen, Z., Wang, W., Tian, H., Ye, S., Gao, Z., Cui, E., Tong, W., Hu, K., Luo, J., Ma, Z., et al. How far are we to gpt-4v? closing the gap to commercial multimodal mod- els with open-source suites.Science Chi...

work page arXiv

[7] [7]

Think with 3d: Geometric imagination grounded spatial reasoning from limited views.arXiv preprint arXiv:2510.18632, 2025a

Chen, Z., Zhang, M., Yu, X., Luo, X., Sun, M., Pan, Z., An, X., Feng, Y ., Pei, P., Cai, X., et al. Think with 3d: Geometric imagination grounded spatial reasoning from limited views.arXiv preprint arXiv:2510.18632, 2025b. Dauner, D., Hallgarten, M., Li, T., Weng, X., Huang, Z., Yang, Z., Li, H., Gilitschenski, I., Ivanovic, B., Pavone, M., et al. Navsim:...

work page arXiv

[8] [8]

Emerging Properties in Unified Multimodal Pretraining

Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., Shi, G., and Fan, H. Emerg- ing properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

Fan, Z., Zhang, J., Li, R., Zhang, J., Chen, R., Hu, H., Wang, K., Qu, H., Wang, D., Yan, Z., et al. Vlm-3r: Vision- language models augmented with instruction-aligned 3d reconstruction.arXiv preprint arXiv:2505.20279,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Gemini: A Family of Highly Capable Multimodal Models

Accessed: 2025-11-18. 9 Thinking with Geometry: Active Geometry Integration for Spatial Reasoning Gemini Team. Gemini: a family of highly capable multi- modal models.arXiv preprint arXiv:2312.11805,

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team. Gemini 1.5: Unlocking multimodal under- standing across millions of tokens of context.arXiv preprint arXiv:2403.05530,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

g2vlm: Geometry grounded vision language model with unified 3d reconstruction and spatial reasoning.arXiv preprint arXiv:2511.21688,

Hu, W., Lin, J., Long, Y ., Ran, Y ., Jiang, L., Wang, Y ., Zhu, C., Xu, R., Wang, T., and Pang, J. G 2 VLM: Ge- ometry grounded vision language model with unified 3d reconstruction and spatial reasoning.arXiv preprint arXiv:2511.21688,

work page arXiv

[13] [13]

GPT-4o System Card

Hurst, A., Lerer, A., Goucher, A. P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models.arXiv preprint arXiv:2506.03135, 2025

Jia, M., Qi, Z., Zhang, S., Zhang, W., Yu, X., He, J., Wang, H., and Yi, L. Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models. arXiv preprint arXiv:2506.03135,

work page arXiv

[15] [15]

LLaVA-OneVision: Easy Visual Task Transfer

Li, B., Zhang, Y ., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y ., Liu, Z., et al. Llava- onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024a. Li, D., Li, H., Wang, Z., Yan, Y ., Zhang, H., Chen, S., Hou, G., Jiang, S., Zhang, W., Shen, Y ., et al. Viewspatial-bench: Evaluating multi-perspective spatial local...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

Mmsi-video-bench: A holistic benchmark for video-based spatial intelligence.arXiv preprint arXiv:2512.10863, 2025

Lin, J., Xu, R., Zhu, S., Yang, S., Cao, P., Ran, Y ., Hu, M., Zhu, C., Xie, Y ., Long, Y ., et al. Mmsi-video-bench: A holistic benchmark for video-based spatial intelligence. arXiv preprint arXiv:2512.10863,

work page arXiv

[17] [17]

Liu, H., Li, C., Li, Y ., Li, B., Zhang, Y ., Shen, S., and Lee, Y . J. Llava-next: Improved reason- ing, ocr, and world knowledge, January 2024a. URL https://llava-vl.github.io/blog/ 2024-01-30-llava-next/. Liu, R., Li, C., Tang, H., Ge, Y ., Shan, Y ., and Li, G. St-llm: Large language models are effective temporal learners. InEuropean Conference on Com...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

Qwen2.5-VL Technical Report

Accessed: 2025-08-10. Qwen Team. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025a. Qwen Team. Qwen3-vl: Multimodal large language model series. https://github.com/QwenLM/ Qwen3-VL, 2025b. GitHub repository; accessed: 2025- 11-14. Qwen Team. Qwen3 technical report, 2025c. URL https: //arxiv.org/abs/2505.09388. Tong, P., Brown, E., Wu, P.,...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

Ross3d: Recon- structive visual instruction tuning with 3d-awareness.arXiv preprint arXiv:2504.01901, 2025

10 Thinking with Geometry: Active Geometry Integration for Spatial Reasoning Wang, H., Zhao, Y ., Wang, T., Fan, H., Zhang, X., and Zhang, Z. Ross3d: Reconstructive visual instruction tun- ing with 3d-awareness.arXiv preprint arXiv:2504.01901, 2025a. Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., and Novotny, D. VGGT: Visual geometry grounded...

work page arXiv

[20] [20]

A0: An affordance-aware hierarchical model for general robotic manipulation,

Xu, R., Gao, H., Yu, M., An, D., Chen, S., Wang, C., Guo, L., Liang, X., and Xu, S. 3d-more: Unified modal- contextual reasoning for embodied question answering. In2025 IEEE/RSJ International Conference on Intelli- gent Robots and Systems (IROS), pp. 5924–5929. IEEE, 2025a. Xu, R., Zhang, J., Guo, M., Wen, Y ., Yang, H., Lin, M., Huang, J., Li, Z., Zhang,...

work page arXiv

[21] [21]

NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation

Zhang, J., Wang, K., Xu, R., Zhou, G., Hong, Y ., Fang, X., Wu, Q., Zhang, Z., and He, W. Navid: Video-based vlm plans the next step for vision-and-language navigation. arXiv preprint arXiv:2402.15852, 2024a. Zhang, J., Chen, Y ., Zhou, Y ., Xu, Y ., Huang, Z., Mei, J., Chen, J., Yuan, Y .-J., Cai, X., Huang, G., et al. From flatland to space: Teaching vi...

work page internal anchor Pith review arXiv

[22] [22]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Zhang, Y ., Wu, J., Li, W., Li, B., Ma, Z., Liu, Z., and Li, C. Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024b. Zheng, D., Huang, S., Li, Y ., and Wang, L. Learning from videos for 3d world: Enhancing mllms with 3d vision ge- ometry priors.arXiv preprint arXiv:2505.24625, 2025a. Zheng, D., Huang, S., and Wang, L. Video...

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

11 Thinking with Geometry: Active Geometry Integration for Spatial Reasoning A. Appendix/supplemental material The outline of the Appendix is as follows: • More implementation details; • More analysis on computational cost; • More analysis on fusion ratioρ; • More comparisons on EASI leaderboard; • More comparisons on VSI-Debiased; • More comparisons on V...

work page 2025

[24] [24]

55.2 50.7 70.048.9 51.1 59.150.042.952.5 71.1 56.853.1 58.6 Gemini-1.5-pro-flash (Gemini Team, 2024)48.5 47.9 52.5 51.7 43.6 51.1 43.5 53.6 33.9 64.4 43.2 46.9 49.4 GPT-4V (Achiam et al.,

work page 2024