pith. sign in

arxiv: 2602.06037 · v4 · submitted 2026-02-05 · 💻 cs.CV

Thinking with Geometry: Active Geometry Integration for Spatial Reasoning

Pith reviewed 2026-05-16 06:36 UTC · model grok-4.3

classification 💻 cs.CV
keywords spatial reasoningmultimodal large language modelsactive geometry integrationcross-attentionimportance gatingVSI-Benchembodied AIspatial intelligence
0
0 comments X

The pith

GeoThinker improves spatial reasoning by letting multimodal models actively select relevant geometric evidence based on their reasoning needs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that passive fusion of geometry into multimodal large language models causes semantic misalignment and hurts spatial reasoning performance. GeoThinker instead uses selective retrieval of geometric features at specific layers, guided by semantic priors through cross-attention and importance gating. This active method leads to superior results on spatial tasks. A reader would care because better spatial intelligence enables more capable AI for navigation, manipulation, and scene understanding in 3D worlds.

Core claim

The central claim is that shifting from passive global fusion of 3D geometry to active, reasoning-conditioned integration allows the model to selectively query and incorporate task-relevant geometric evidence. This is implemented via Spatial-Grounded Fusion at chosen VLM layers with frame-strict cross-attention and Importance Gating, resulting in a peak score of 72.6 on VSI-Bench and enhanced performance in embodied referring and autonomous driving.

What carries the argument

The Spatial-Grounded Fusion process, which applies frame-strict cross-attention conditioned on semantic visual priors and uses Importance Gating to prioritize task-relevant structures.

If this is right

  • GeoThinker achieves a new state-of-the-art score of 72.6 on the VSI-Bench.
  • It exhibits robust generalization to complex downstream scenarios including embodied referring and autonomous driving.
  • Active integration reduces semantic-geometry misalignment and redundant signals compared to passive methods.
  • The results support that active integration of spatial structures is essential for next-generation spatial intelligence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar active selection techniques could be adapted for integrating other data modalities like audio or text priors in multimodal models.
  • By focusing on relevant geometry only, the approach may lower computational demands in large-scale deployments.
  • Extending this to models without 3D encoders or testing on additional spatial benchmarks would further validate the method.

Load-bearing premise

Frame-strict cross-attention combined with importance gating can reliably select task-relevant geometry and reduce misalignment without introducing new selection biases or requiring task-specific tuning.

What would settle it

If an experiment shows that a simple passive fusion method achieves similar or better scores than GeoThinker on VSI-Bench and the downstream tasks, the benefit of the active components would be called into question.

Figures

Figures reproduced from arXiv: 2602.06037 by Hang Xu, Haoyuan Li, Jianhua Han, JiaWang Bian, Kun Xiang, Qihang Cao, Tao Tang, Xiaodan Liang, Zihan Guo.

Figure 1
Figure 1. Figure 1: Thinking with geometry through active integration. Left: (a) Passive Fusion: Conventional MLLMs indiscriminately incorporate a global stream of geometric features, which leads to significant information redundancy and semantic-texture misalignment. (b) Active Perception (GeoThinker): Our framework shifts the paradigm by empowering the model to discern and selectively retrieve spatial cues guided by its int… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of geometry integration paradigms. (a) and (b) represent passive paradigms that indiscriminately incorpo￾rate geometric streams, often leading to semantic-geometry mis￾alignment and redundant noise. In contrast, (c) GeoThinker shifts to active perception, empowering the MLLM to autonomously discern and selectively retrieve task-related geometric cues guided by internal reasoning. strategies prim… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the GeoThinker architecture. Our framework features a decoupled interaction mechanism where the VGGT is integrated via Spatial-Grounded Fusion layers. By employing Importance Gating, the model predicts a localized attention bias to dynamically modulate the injection of geometric textures. This design ensures that rich structural details are only queried when they are contextually relevant to th… view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of Importance Gating Scores. Heatmaps illustrate that GeoThinker naturally learns to prioritize salient object boundaries and structural edges while suppressing non-informative regions like floors or walls [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Computational cost comparison of FLOPs and inference latency. series, the FLOPs difference between our 8-frame model and VG-LLM is negligible, with the SGF module accounting for less than 5% of the total FLOPs. While this proportion slightly increases on the Qwen3-VL series due to differences in hidden state dimensions, the overall efficiency remains high. Efficiency of Spatial Compression: Our 32-frame se… view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of importance score on MindCube [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of importance score on VSI-Bench. towel and trash bin in the bathroom scene, and the backpack and computer mouse in the office setting. Spatial Reasoning via Landmark Identification. The vi￾sualization demonstrates that the model’s spatial reasoning is grounded in precise object localization. In the office ex￾ample, where the backpack is partially obscured or located among numerous similar de… view at source ↗
Figure 8
Figure 8. Figure 8: Visualization of robustness to image resolution. The left panels show the importance score heatmaps, while the right panels provide a masked visualization where only regions with a heatmap value greater than 0.5 are preserved. The experiment evaluates model performance across varying input quality, from original resolution down to 6.25%. L.3. LLM usage We thank the Gemini 2.5-Flash for assistance in editin… view at source ↗
read the original abstract

Recent progress in spatial reasoning with Multimodal Large Language Models (MLLMs) increasingly leverages geometric priors from 3D encoders. However, most existing integration strategies remain passive: geometry is exposed as a global stream and fused in an indiscriminate manner, which often induces semantic-geometry misalignment and redundant signals. We propose GeoThinker, a framework that shifts the paradigm from passive fusion to active perception. Instead of feature mixing, GeoThinker enables the model to selectively retrieve geometric evidence conditioned on its internal reasoning demands. GeoThinker achieves this through Spatial-Grounded Fusion applied at carefully selected VLM layers, where semantic visual priors selectively query and integrate task-relevant geometry via frame-strict cross-attention, further calibrated by Importance Gating that biases per-frame attention toward task-relevant structures. Comprehensive evaluation results show that GeoThinker sets a new state-of-the-art in spatial intelligence, achieving a peak score of 72.6 on the VSI-Bench. Furthermore, GeoThinker demonstrates robust generalization and significantly improved spatial perception across complex downstream scenarios, including embodied referring and autonomous driving. Our results indicate that the ability to actively integrate spatial structures is essential for next-generation spatial intelligence. Code can be found at https://github.com/Li-Hao-yuan/GeoThinker.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes GeoThinker, a framework that shifts from passive global fusion of geometric priors in MLLMs to active selective retrieval for spatial reasoning. It applies Spatial-Grounded Fusion at selected VLM layers using frame-strict cross-attention conditioned on semantic priors, calibrated by importance gating to bias toward task-relevant structures, and reports a new SOTA peak score of 72.6 on VSI-Bench along with improved generalization on embodied referring and autonomous driving tasks.

Significance. If the performance gains are shown to be causally attributable to the active integration mechanisms rather than confounding factors, the work would offer a concrete architectural advance in reducing semantic-geometry misalignment, with potential impact on downstream spatial tasks in vision-language models.

major comments (2)
  1. [§4 (Experimental Evaluation)] §4 (Experimental Evaluation): The manuscript asserts SOTA performance of 72.6 on VSI-Bench and robust generalization but supplies no quantitative baseline comparisons, component ablations, error analysis, or dataset statistics; this leaves the central claim that frame-strict cross-attention plus importance gating drives the improvement without direct supporting evidence.
  2. [§3.2 (Spatial-Grounded Fusion)] §3.2 (Spatial-Grounded Fusion): No controlled ablation is reported that removes or relaxes the frame-strict constraint and importance gating while holding layer selection, parameter count, and training fixed; without these isolations the attribution of the 72.6 score to the selective-retrieval design remains unverified.
minor comments (1)
  1. [Abstract and §3] The abstract and method descriptions use terms such as 'carefully selected VLM layers' without specifying the selection criterion or providing a diagram of the layer placement; adding this detail would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address the two major comments point by point below and will strengthen the experimental section with additional analyses in the revised manuscript.

read point-by-point responses
  1. Referee: [§4 (Experimental Evaluation)] §4 (Experimental Evaluation): The manuscript asserts SOTA performance of 72.6 on VSI-Bench and robust generalization but supplies no quantitative baseline comparisons, component ablations, error analysis, or dataset statistics; this leaves the central claim that frame-strict cross-attention plus importance gating drives the improvement without direct supporting evidence.

    Authors: We agree that the current presentation of results would benefit from more explicit quantitative support. While the manuscript reports the 72.6 peak score on VSI-Bench together with generalization to embodied referring and autonomous driving tasks, we acknowledge the absence of detailed baseline tables, component-wise ablations, error breakdowns, and dataset statistics. In the revision we will add these elements, including direct comparisons against passive global-fusion baselines and quantitative isolation of performance gains attributable to the proposed mechanisms. revision: yes

  2. Referee: [§3.2 (Spatial-Grounded Fusion)] §3.2 (Spatial-Grounded Fusion): No controlled ablation is reported that removes or relaxes the frame-strict constraint and importance gating while holding layer selection, parameter count, and training fixed; without these isolations the attribution of the 72.6 score to the selective-retrieval design remains unverified.

    Authors: We will add the requested controlled ablations in the revised manuscript. These experiments will remove or relax the frame-strict constraint and the importance-gating module individually while keeping layer selection, total parameter count, and training protocol fixed, thereby providing direct evidence for the contribution of each design choice to the reported performance. revision: yes

Circularity Check

0 steps flagged

No circularity: architectural framework independent of fitted or self-referential quantities

full rationale

The paper describes GeoThinker as an architectural shift from passive to active geometry integration via Spatial-Grounded Fusion at selected layers, frame-strict cross-attention, and importance gating. No equations, derivations, or parameter-fitting steps are present that reduce by construction to the inputs or to self-citations. Performance on VSI-Bench is reported as empirical outcome of the proposed modules rather than a renamed fit or self-referential prediction. The central claim rests on the design of selective retrieval mechanisms, which are presented as independent architectural choices without load-bearing self-citation chains or uniqueness theorems imported from prior author work. This matches the default expectation of a non-circular model paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the approach assumes standard VLM layer access and attention primitives from prior work.

pith-pipeline@v0.9.0 · 5548 in / 941 out tokens · 20484 ms · 2026-05-16T06:36:55.247817+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Cambrian-P: Pose-Grounded Video Understanding

    cs.CV 2026-05 unverdicted novelty 6.0

    Cambrian-P adds per-frame camera pose tokens and a regression head to video MLLMs, delivering 4.5-6.5% gains on spatial benchmarks, generalization to other video QA tasks, and SOTA streaming pose estimation on ScanNet.

  2. GeoWeaver: Grounding Visual Tokens with Geometric Evidence before Scene Reasoning

    cs.CV 2026-05 unverdicted novelty 6.0

    GeoWeaver performs token-adaptive geometric grounding on visual tokens from a multi-level bank prior to language modeling to support better spatio-temporal reasoning.

  3. SpatialImaginer: Towards Adaptive Visual Imagination for Spatial Reasoning

    cs.CV 2026-04 unverdicted novelty 5.0

    SpatialImaginer integrates visual imagination with textual chain-of-thought to improve spatial reasoning robustness in multimodal large language models.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · cited by 3 Pith papers · 13 internal anchors

  1. [1]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

  2. [2]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    URL https://www. anthropic.com/news/claude-3-5-sonnet. Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., and Zhou, J. Qwen-vl: A frontier large vision- language model with versatile abilities.arXiv preprint arXiv:2308.12966, 1(2):3,

  3. [3]

    train on the test set

    Brown, E., Yang, J., Yang, S., Fergus, R., and Xie, S. Benchmark designers should” train on the test set” to expose exploitable non-visual shortcuts.arXiv preprint arXiv:2511.04655,

  4. [4]

    Seed1.5-VL Technical Report

    ByteDance Seed. Seed1.5-vl technical report.arXiv preprint arXiv:2505.07062,

  5. [5]

    Scaling spatial intelligence with multimodal foundation models

    Cai, Z., Wang, R., Gu, C., Pu, F., Xu, J., Wang, Y ., Yin, W., Yang, Z., Wei, C., Sun, Q., et al. Scaling spatial intelligence with multimodal foundation models.arXiv preprint arXiv:2511.13719, 2025a. Cai, Z., Wang, Y ., Sun, Q., Wang, R., Gu, C., Yin, W., Lin, Z., Yang, Z., Wei, C., Qian, O., et al. Holistic evaluation of multimodal llms on spatial intel...

  6. [6]

    Reasoning in space via grounding in the world.arXiv preprint arXiv:2510.13800, 2025

    Chen, Y ., Qi, Z., Zhang, W., Jin, X., Zhang, L., and Liu, P. Reasoning in space via grounding in the world.arXiv preprint arXiv:2510.13800, 2025a. Chen, Z., Wang, W., Tian, H., Ye, S., Gao, Z., Cui, E., Tong, W., Hu, K., Luo, J., Ma, Z., et al. How far are we to gpt-4v? closing the gap to commercial multimodal mod- els with open-source suites.Science Chi...

  7. [7]

    Think with 3d: Geometric imagination grounded spatial reasoning from limited views.arXiv preprint arXiv:2510.18632, 2025a

    Chen, Z., Zhang, M., Yu, X., Luo, X., Sun, M., Pan, Z., An, X., Feng, Y ., Pei, P., Cai, X., et al. Think with 3d: Geometric imagination grounded spatial reasoning from limited views.arXiv preprint arXiv:2510.18632, 2025b. Dauner, D., Hallgarten, M., Li, T., Weng, X., Huang, Z., Yang, Z., Li, H., Gilitschenski, I., Ivanovic, B., Pavone, M., et al. Navsim:...

  8. [8]

    Emerging Properties in Unified Multimodal Pretraining

    Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., Shi, G., and Fan, H. Emerg- ing properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683,

  9. [9]

    VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

    Fan, Z., Zhang, J., Li, R., Zhang, J., Chen, R., Hu, H., Wang, K., Qu, H., Wang, D., Yan, Z., et al. Vlm-3r: Vision- language models augmented with instruction-aligned 3d reconstruction.arXiv preprint arXiv:2505.20279,

  10. [10]

    Gemini: A Family of Highly Capable Multimodal Models

    Accessed: 2025-11-18. 9 Thinking with Geometry: Active Geometry Integration for Spatial Reasoning Gemini Team. Gemini: a family of highly capable multi- modal models.arXiv preprint arXiv:2312.11805,

  11. [11]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team. Gemini 1.5: Unlocking multimodal under- standing across millions of tokens of context.arXiv preprint arXiv:2403.05530,

  12. [12]

    g2vlm: Geometry grounded vision language model with unified 3d reconstruction and spatial reasoning.arXiv preprint arXiv:2511.21688,

    Hu, W., Lin, J., Long, Y ., Ran, Y ., Jiang, L., Wang, Y ., Zhu, C., Xu, R., Wang, T., and Pang, J. G 2 VLM: Ge- ometry grounded vision language model with unified 3d reconstruction and spatial reasoning.arXiv preprint arXiv:2511.21688,

  13. [13]

    GPT-4o System Card

    Hurst, A., Lerer, A., Goucher, A. P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,

  14. [14]

    Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models.arXiv preprint arXiv:2506.03135, 2025

    Jia, M., Qi, Z., Zhang, S., Zhang, W., Yu, X., He, J., Wang, H., and Yi, L. Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models. arXiv preprint arXiv:2506.03135,

  15. [15]

    LLaVA-OneVision: Easy Visual Task Transfer

    Li, B., Zhang, Y ., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y ., Liu, Z., et al. Llava- onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024a. Li, D., Li, H., Wang, Z., Yan, Y ., Zhang, H., Chen, S., Hou, G., Jiang, S., Zhang, W., Shen, Y ., et al. Viewspatial-bench: Evaluating multi-perspective spatial local...

  16. [16]

    Mmsi-video-bench: A holistic benchmark for video-based spatial intelligence.arXiv preprint arXiv:2512.10863, 2025

    Lin, J., Xu, R., Zhu, S., Yang, S., Cao, P., Ran, Y ., Hu, M., Zhu, C., Xie, Y ., Long, Y ., et al. Mmsi-video-bench: A holistic benchmark for video-based spatial intelligence. arXiv preprint arXiv:2512.10863,

  17. [17]

    Liu, H., Li, C., Li, Y ., Li, B., Zhang, Y ., Shen, S., and Lee, Y . J. Llava-next: Improved reason- ing, ocr, and world knowledge, January 2024a. URL https://llava-vl.github.io/blog/ 2024-01-30-llava-next/. Liu, R., Li, C., Tang, H., Ge, Y ., Shan, Y ., and Li, G. St-llm: Large language models are effective temporal learners. InEuropean Conference on Com...

  18. [18]

    Qwen2.5-VL Technical Report

    Accessed: 2025-08-10. Qwen Team. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025a. Qwen Team. Qwen3-vl: Multimodal large language model series. https://github.com/QwenLM/ Qwen3-VL, 2025b. GitHub repository; accessed: 2025- 11-14. Qwen Team. Qwen3 technical report, 2025c. URL https: //arxiv.org/abs/2505.09388. Tong, P., Brown, E., Wu, P.,...

  19. [19]

    Ross3d: Recon- structive visual instruction tuning with 3d-awareness.arXiv preprint arXiv:2504.01901, 2025

    10 Thinking with Geometry: Active Geometry Integration for Spatial Reasoning Wang, H., Zhao, Y ., Wang, T., Fan, H., Zhang, X., and Zhang, Z. Ross3d: Reconstructive visual instruction tun- ing with 3d-awareness.arXiv preprint arXiv:2504.01901, 2025a. Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., and Novotny, D. VGGT: Visual geometry grounded...

  20. [20]

    A0: An affordance-aware hierarchical model for general robotic manipulation,

    Xu, R., Gao, H., Yu, M., An, D., Chen, S., Wang, C., Guo, L., Liang, X., and Xu, S. 3d-more: Unified modal- contextual reasoning for embodied question answering. In2025 IEEE/RSJ International Conference on Intelli- gent Robots and Systems (IROS), pp. 5924–5929. IEEE, 2025a. Xu, R., Zhang, J., Guo, M., Wen, Y ., Yang, H., Lin, M., Huang, J., Li, Z., Zhang,...

  21. [21]

    NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation

    Zhang, J., Wang, K., Xu, R., Zhou, G., Hong, Y ., Fang, X., Wu, Q., Zhang, Z., and He, W. Navid: Video-based vlm plans the next step for vision-and-language navigation. arXiv preprint arXiv:2402.15852, 2024a. Zhang, J., Chen, Y ., Zhou, Y ., Xu, Y ., Huang, Z., Mei, J., Chen, J., Yuan, Y .-J., Cai, X., Huang, G., et al. From flatland to space: Teaching vi...

  22. [22]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Zhang, Y ., Wu, J., Li, W., Li, B., Ma, Z., Liu, Z., and Li, C. Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024b. Zheng, D., Huang, S., Li, Y ., and Wang, L. Learning from videos for 3d world: Enhancing mllms with 3d vision ge- ometry priors.arXiv preprint arXiv:2505.24625, 2025a. Zheng, D., Huang, S., and Wang, L. Video...

  23. [23]

    11 Thinking with Geometry: Active Geometry Integration for Spatial Reasoning A. Appendix/supplemental material The outline of the Appendix is as follows: • More implementation details; • More analysis on computational cost; • More analysis on fusion ratioρ; • More comparisons on EASI leaderboard; • More comparisons on VSI-Debiased; • More comparisons on V...

  24. [24]

    55.2 50.7 70.048.9 51.1 59.150.042.952.5 71.1 56.853.1 58.6 Gemini-1.5-pro-flash (Gemini Team, 2024)48.5 47.9 52.5 51.7 43.6 51.1 43.5 53.6 33.9 64.4 43.2 46.9 49.4 GPT-4V (Achiam et al.,