Thinking with Geometry: Active Geometry Integration for Spatial Reasoning
Pith reviewed 2026-05-16 06:36 UTC · model grok-4.3
The pith
GeoThinker improves spatial reasoning by letting multimodal models actively select relevant geometric evidence based on their reasoning needs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that shifting from passive global fusion of 3D geometry to active, reasoning-conditioned integration allows the model to selectively query and incorporate task-relevant geometric evidence. This is implemented via Spatial-Grounded Fusion at chosen VLM layers with frame-strict cross-attention and Importance Gating, resulting in a peak score of 72.6 on VSI-Bench and enhanced performance in embodied referring and autonomous driving.
What carries the argument
The Spatial-Grounded Fusion process, which applies frame-strict cross-attention conditioned on semantic visual priors and uses Importance Gating to prioritize task-relevant structures.
If this is right
- GeoThinker achieves a new state-of-the-art score of 72.6 on the VSI-Bench.
- It exhibits robust generalization to complex downstream scenarios including embodied referring and autonomous driving.
- Active integration reduces semantic-geometry misalignment and redundant signals compared to passive methods.
- The results support that active integration of spatial structures is essential for next-generation spatial intelligence.
Where Pith is reading between the lines
- Similar active selection techniques could be adapted for integrating other data modalities like audio or text priors in multimodal models.
- By focusing on relevant geometry only, the approach may lower computational demands in large-scale deployments.
- Extending this to models without 3D encoders or testing on additional spatial benchmarks would further validate the method.
Load-bearing premise
Frame-strict cross-attention combined with importance gating can reliably select task-relevant geometry and reduce misalignment without introducing new selection biases or requiring task-specific tuning.
What would settle it
If an experiment shows that a simple passive fusion method achieves similar or better scores than GeoThinker on VSI-Bench and the downstream tasks, the benefit of the active components would be called into question.
Figures
read the original abstract
Recent progress in spatial reasoning with Multimodal Large Language Models (MLLMs) increasingly leverages geometric priors from 3D encoders. However, most existing integration strategies remain passive: geometry is exposed as a global stream and fused in an indiscriminate manner, which often induces semantic-geometry misalignment and redundant signals. We propose GeoThinker, a framework that shifts the paradigm from passive fusion to active perception. Instead of feature mixing, GeoThinker enables the model to selectively retrieve geometric evidence conditioned on its internal reasoning demands. GeoThinker achieves this through Spatial-Grounded Fusion applied at carefully selected VLM layers, where semantic visual priors selectively query and integrate task-relevant geometry via frame-strict cross-attention, further calibrated by Importance Gating that biases per-frame attention toward task-relevant structures. Comprehensive evaluation results show that GeoThinker sets a new state-of-the-art in spatial intelligence, achieving a peak score of 72.6 on the VSI-Bench. Furthermore, GeoThinker demonstrates robust generalization and significantly improved spatial perception across complex downstream scenarios, including embodied referring and autonomous driving. Our results indicate that the ability to actively integrate spatial structures is essential for next-generation spatial intelligence. Code can be found at https://github.com/Li-Hao-yuan/GeoThinker.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes GeoThinker, a framework that shifts from passive global fusion of geometric priors in MLLMs to active selective retrieval for spatial reasoning. It applies Spatial-Grounded Fusion at selected VLM layers using frame-strict cross-attention conditioned on semantic priors, calibrated by importance gating to bias toward task-relevant structures, and reports a new SOTA peak score of 72.6 on VSI-Bench along with improved generalization on embodied referring and autonomous driving tasks.
Significance. If the performance gains are shown to be causally attributable to the active integration mechanisms rather than confounding factors, the work would offer a concrete architectural advance in reducing semantic-geometry misalignment, with potential impact on downstream spatial tasks in vision-language models.
major comments (2)
- [§4 (Experimental Evaluation)] §4 (Experimental Evaluation): The manuscript asserts SOTA performance of 72.6 on VSI-Bench and robust generalization but supplies no quantitative baseline comparisons, component ablations, error analysis, or dataset statistics; this leaves the central claim that frame-strict cross-attention plus importance gating drives the improvement without direct supporting evidence.
- [§3.2 (Spatial-Grounded Fusion)] §3.2 (Spatial-Grounded Fusion): No controlled ablation is reported that removes or relaxes the frame-strict constraint and importance gating while holding layer selection, parameter count, and training fixed; without these isolations the attribution of the 72.6 score to the selective-retrieval design remains unverified.
minor comments (1)
- [Abstract and §3] The abstract and method descriptions use terms such as 'carefully selected VLM layers' without specifying the selection criterion or providing a diagram of the layer placement; adding this detail would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address the two major comments point by point below and will strengthen the experimental section with additional analyses in the revised manuscript.
read point-by-point responses
-
Referee: [§4 (Experimental Evaluation)] §4 (Experimental Evaluation): The manuscript asserts SOTA performance of 72.6 on VSI-Bench and robust generalization but supplies no quantitative baseline comparisons, component ablations, error analysis, or dataset statistics; this leaves the central claim that frame-strict cross-attention plus importance gating drives the improvement without direct supporting evidence.
Authors: We agree that the current presentation of results would benefit from more explicit quantitative support. While the manuscript reports the 72.6 peak score on VSI-Bench together with generalization to embodied referring and autonomous driving tasks, we acknowledge the absence of detailed baseline tables, component-wise ablations, error breakdowns, and dataset statistics. In the revision we will add these elements, including direct comparisons against passive global-fusion baselines and quantitative isolation of performance gains attributable to the proposed mechanisms. revision: yes
-
Referee: [§3.2 (Spatial-Grounded Fusion)] §3.2 (Spatial-Grounded Fusion): No controlled ablation is reported that removes or relaxes the frame-strict constraint and importance gating while holding layer selection, parameter count, and training fixed; without these isolations the attribution of the 72.6 score to the selective-retrieval design remains unverified.
Authors: We will add the requested controlled ablations in the revised manuscript. These experiments will remove or relax the frame-strict constraint and the importance-gating module individually while keeping layer selection, total parameter count, and training protocol fixed, thereby providing direct evidence for the contribution of each design choice to the reported performance. revision: yes
Circularity Check
No circularity: architectural framework independent of fitted or self-referential quantities
full rationale
The paper describes GeoThinker as an architectural shift from passive to active geometry integration via Spatial-Grounded Fusion at selected layers, frame-strict cross-attention, and importance gating. No equations, derivations, or parameter-fitting steps are present that reduce by construction to the inputs or to self-citations. Performance on VSI-Bench is reported as empirical outcome of the proposed modules rather than a renamed fit or self-referential prediction. The central claim rests on the design of selective retrieval mechanisms, which are presented as independent architectural choices without load-bearing self-citation chains or uniqueness theorems imported from prior author work. This matches the default expectation of a non-circular model paper.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 3 Pith papers
-
Cambrian-P: Pose-Grounded Video Understanding
Cambrian-P adds per-frame camera pose tokens and a regression head to video MLLMs, delivering 4.5-6.5% gains on spatial benchmarks, generalization to other video QA tasks, and SOTA streaming pose estimation on ScanNet.
-
GeoWeaver: Grounding Visual Tokens with Geometric Evidence before Scene Reasoning
GeoWeaver performs token-adaptive geometric grounding on visual tokens from a multi-level bank prior to language modeling to support better spatio-temporal reasoning.
-
SpatialImaginer: Towards Adaptive Visual Imagination for Spatial Reasoning
SpatialImaginer integrates visual imagination with textual chain-of-thought to improve spatial reasoning robustness in multimodal large language models.
Reference graph
Works this paper leans on
-
[1]
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
URL https://www. anthropic.com/news/claude-3-5-sonnet. Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., and Zhou, J. Qwen-vl: A frontier large vision- language model with versatile abilities.arXiv preprint arXiv:2308.12966, 1(2):3,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Brown, E., Yang, J., Yang, S., Fergus, R., and Xie, S. Benchmark designers should” train on the test set” to expose exploitable non-visual shortcuts.arXiv preprint arXiv:2511.04655,
-
[4]
ByteDance Seed. Seed1.5-vl technical report.arXiv preprint arXiv:2505.07062,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Scaling spatial intelligence with multimodal foundation models
Cai, Z., Wang, R., Gu, C., Pu, F., Xu, J., Wang, Y ., Yin, W., Yang, Z., Wei, C., Sun, Q., et al. Scaling spatial intelligence with multimodal foundation models.arXiv preprint arXiv:2511.13719, 2025a. Cai, Z., Wang, Y ., Sun, Q., Wang, R., Gu, C., Yin, W., Lin, Z., Yang, Z., Wei, C., Qian, O., et al. Holistic evaluation of multimodal llms on spatial intel...
-
[6]
Reasoning in space via grounding in the world.arXiv preprint arXiv:2510.13800, 2025
Chen, Y ., Qi, Z., Zhang, W., Jin, X., Zhang, L., and Liu, P. Reasoning in space via grounding in the world.arXiv preprint arXiv:2510.13800, 2025a. Chen, Z., Wang, W., Tian, H., Ye, S., Gao, Z., Cui, E., Tong, W., Hu, K., Luo, J., Ma, Z., et al. How far are we to gpt-4v? closing the gap to commercial multimodal mod- els with open-source suites.Science Chi...
-
[7]
Chen, Z., Zhang, M., Yu, X., Luo, X., Sun, M., Pan, Z., An, X., Feng, Y ., Pei, P., Cai, X., et al. Think with 3d: Geometric imagination grounded spatial reasoning from limited views.arXiv preprint arXiv:2510.18632, 2025b. Dauner, D., Hallgarten, M., Li, T., Weng, X., Huang, Z., Yang, Z., Li, H., Gilitschenski, I., Ivanovic, B., Pavone, M., et al. Navsim:...
-
[8]
Emerging Properties in Unified Multimodal Pretraining
Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., Shi, G., and Fan, H. Emerg- ing properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction
Fan, Z., Zhang, J., Li, R., Zhang, J., Chen, R., Hu, H., Wang, K., Qu, H., Wang, D., Yan, Z., et al. Vlm-3r: Vision- language models augmented with instruction-aligned 3d reconstruction.arXiv preprint arXiv:2505.20279,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Gemini: A Family of Highly Capable Multimodal Models
Accessed: 2025-11-18. 9 Thinking with Geometry: Active Geometry Integration for Spatial Reasoning Gemini Team. Gemini: a family of highly capable multi- modal models.arXiv preprint arXiv:2312.11805,
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini Team. Gemini 1.5: Unlocking multimodal under- standing across millions of tokens of context.arXiv preprint arXiv:2403.05530,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Hu, W., Lin, J., Long, Y ., Ran, Y ., Jiang, L., Wang, Y ., Zhu, C., Xu, R., Wang, T., and Pang, J. G 2 VLM: Ge- ometry grounded vision language model with unified 3d reconstruction and spatial reasoning.arXiv preprint arXiv:2511.21688,
-
[13]
Hurst, A., Lerer, A., Goucher, A. P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Jia, M., Qi, Z., Zhang, S., Zhang, W., Yu, X., He, J., Wang, H., and Yi, L. Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models. arXiv preprint arXiv:2506.03135,
-
[15]
LLaVA-OneVision: Easy Visual Task Transfer
Li, B., Zhang, Y ., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y ., Liu, Z., et al. Llava- onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024a. Li, D., Li, H., Wang, Z., Yan, Y ., Zhang, H., Chen, S., Hou, G., Jiang, S., Zhang, W., Shen, Y ., et al. Viewspatial-bench: Evaluating multi-perspective spatial local...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Lin, J., Xu, R., Zhu, S., Yang, S., Cao, P., Ran, Y ., Hu, M., Zhu, C., Xie, Y ., Long, Y ., et al. Mmsi-video-bench: A holistic benchmark for video-based spatial intelligence. arXiv preprint arXiv:2512.10863,
-
[17]
Liu, H., Li, C., Li, Y ., Li, B., Zhang, Y ., Shen, S., and Lee, Y . J. Llava-next: Improved reason- ing, ocr, and world knowledge, January 2024a. URL https://llava-vl.github.io/blog/ 2024-01-30-llava-next/. Liu, R., Li, C., Tang, H., Ge, Y ., Shan, Y ., and Li, G. St-llm: Large language models are effective temporal learners. InEuropean Conference on Com...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Accessed: 2025-08-10. Qwen Team. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025a. Qwen Team. Qwen3-vl: Multimodal large language model series. https://github.com/QwenLM/ Qwen3-VL, 2025b. GitHub repository; accessed: 2025- 11-14. Qwen Team. Qwen3 technical report, 2025c. URL https: //arxiv.org/abs/2505.09388. Tong, P., Brown, E., Wu, P.,...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
10 Thinking with Geometry: Active Geometry Integration for Spatial Reasoning Wang, H., Zhao, Y ., Wang, T., Fan, H., Zhang, X., and Zhang, Z. Ross3d: Reconstructive visual instruction tun- ing with 3d-awareness.arXiv preprint arXiv:2504.01901, 2025a. Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., and Novotny, D. VGGT: Visual geometry grounded...
-
[20]
A0: An affordance-aware hierarchical model for general robotic manipulation,
Xu, R., Gao, H., Yu, M., An, D., Chen, S., Wang, C., Guo, L., Liang, X., and Xu, S. 3d-more: Unified modal- contextual reasoning for embodied question answering. In2025 IEEE/RSJ International Conference on Intelli- gent Robots and Systems (IROS), pp. 5924–5929. IEEE, 2025a. Xu, R., Zhang, J., Guo, M., Wen, Y ., Yang, H., Lin, M., Huang, J., Li, Z., Zhang,...
-
[21]
NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation
Zhang, J., Wang, K., Xu, R., Zhou, G., Hong, Y ., Fang, X., Wu, Q., Zhang, Z., and He, W. Navid: Video-based vlm plans the next step for vision-and-language navigation. arXiv preprint arXiv:2402.15852, 2024a. Zhang, J., Chen, Y ., Zhou, Y ., Xu, Y ., Huang, Z., Mei, J., Chen, J., Yuan, Y .-J., Cai, X., Huang, G., et al. From flatland to space: Teaching vi...
work page internal anchor Pith review arXiv
-
[22]
LLaVA-Video: Video Instruction Tuning With Synthetic Data
Zhang, Y ., Wu, J., Li, W., Li, B., Ma, Z., Liu, Z., and Li, C. Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024b. Zheng, D., Huang, S., Li, Y ., and Wang, L. Learning from videos for 3d world: Enhancing mllms with 3d vision ge- ometry priors.arXiv preprint arXiv:2505.24625, 2025a. Zheng, D., Huang, S., and Wang, L. Video...
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
11 Thinking with Geometry: Active Geometry Integration for Spatial Reasoning A. Appendix/supplemental material The outline of the Appendix is as follows: • More implementation details; • More analysis on computational cost; • More analysis on fusion ratioρ; • More comparisons on EASI leaderboard; • More comparisons on VSI-Debiased; • More comparisons on V...
work page 2025
-
[24]
55.2 50.7 70.048.9 51.1 59.150.042.952.5 71.1 56.853.1 58.6 Gemini-1.5-pro-flash (Gemini Team, 2024)48.5 47.9 52.5 51.7 43.6 51.1 43.5 53.6 33.9 64.4 43.2 46.9 49.4 GPT-4V (Achiam et al.,
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.