GeoWorld-VLM: Geometry from World Models for Vision-Language Models
Pith reviewed 2026-05-20 17:51 UTC · model grok-4.3
The pith
GeoWorld-VLM improves spatial reasoning in VLMs by aligning features with frozen world models without changing the language backbone.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GeoWorld-VLM transfers geometric structure from frozen camera-conditioned video world models into VLMs by aligning post-projector image features with the world model's intermediate representations. Given an image, prompt, and sampled camera trajectory, the world model provides a synthetic multi-view spatial signal. Training uses spatial answer supervision, teacher-student feature alignment, and a preservation anchor to the original VLM, with only the encoder and projector updated.
What carries the argument
The feature alignment between post-projector VLM features and intermediate representations from the frozen world model that converts static images into multi-view spatial signals via camera trajectories.
If this is right
- Improved spatial judgments result from enhanced visual representations rather than changes to language processing.
- The method works across different VLM architectures, showing generality.
- Original linguistic capabilities are preserved because the language model stays frozen.
- Performance gains appear on multiple spatial reasoning datasets like What'sUp and VSR.
Where Pith is reading between the lines
- World models could be used as geometric teachers for other tasks requiring 3D understanding beyond spatial relations in images.
- Combining this with more advanced or larger world models might lead to further improvements in VLM spatial performance.
- This suggests a broader strategy of using specialized models to inject missing structures into general-purpose models without full retraining.
Load-bearing premise
The intermediate representations from the frozen world models contain geometric structure that can be transferred to directly cause better spatial reasoning in the VLM.
What would settle it
If removing the feature alignment with world model representations eliminates the performance gains on spatial benchmarks while keeping other training components, the central claim would be falsified.
Figures
read the original abstract
Modern Vision-Language Models (VLMs) achieve strong semantic recognition, yet remain brittle on elementary spatial relations such as left of, on, behind, and between. One cause of this failure arises before language reasoning begins: the visual pathway may compress or discard critical 3D structural cues during feature extraction, so the language model receives image representations that are already insufficient for reliable spatial judgment. We introduce GeoWorld-VLM, a VLM-side distillation framework that transfers geometric structure from frozen camera-conditioned video world models into VLMs. GeoWorld-VLM fine-tunes only the image encoder and multimodal projector, aligning post-projector image features with intermediate world-model representations while leaving the main backbone frozen. Given images, a prompt, and a sampled camera trajectory, the world-model teacher converts static visual input into a synthetic multi-view spatial signal. Training combines spatial answer supervision, teacher-student feature alignment, and a preservation anchor to the original VLM. Since the language model remains frozen, GeoWorld-VLM preserves the original model's linguistic capabilities while attributing spatial improvements to the enhanced visual pathway. To evaluate the effectiveness and generality of the proposed method, we apply GeoWorld-VLM to two distinct VLM architectures and observe consistent improvements across both backbones. GeoWorld-VLM improves performance by approximately 4 percent on both the What'sUp and VSR benchmarks, suggesting that world-model-guided visual alignment generalizes across model structures and spatial reasoning datasets.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces GeoWorld-VLM, a VLM-side distillation method that transfers geometric structure from frozen camera-conditioned video world models into the image encoder and multimodal projector of VLMs. Only these visual components are fine-tuned via a combination of spatial answer supervision, teacher-student feature alignment to world-model intermediates, and a preservation anchor to the original VLM, while the language model remains frozen. The authors apply the approach to two distinct VLM architectures and report consistent ~4% gains on the What'sUp and VSR spatial reasoning benchmarks.
Significance. If the gains are shown to arise specifically from transferable 3D structure in the world-model representations rather than incidental effects of spatial fine-tuning, the framework could provide an efficient route to bolstering spatial capabilities in existing VLMs without retraining the language backbone or sacrificing semantic performance. The consistency across two backbones and the use of external frozen world models as geometric teachers are notable strengths of the empirical design.
major comments (2)
- [Experiments] Experiments section: the manuscript reports ~4% gains on What'sUp and VSR but provides no ablation that removes the feature-alignment term to the world-model intermediates while retaining spatial answer supervision and the preservation anchor. This control is required to establish that the observed improvements are caused by geometric transfer rather than generic adaptation from the spatial loss alone.
- [Method] Method and Evaluation sections: no information is given on the collection protocol for spatial answers, the choice of baselines, or any statistical significance testing for the reported improvements, leaving the reliability and magnitude of the 4% gains difficult to evaluate.
minor comments (1)
- [Abstract] Abstract: the phrase 'approximately 4 percent' should be accompanied by the precise metric (e.g., accuracy delta) and the identity of the baseline model for each benchmark.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. We agree that the suggested ablation and additional methodological details will strengthen the paper and will incorporate them in the revised manuscript.
read point-by-point responses
-
Referee: [Experiments] Experiments section: the manuscript reports ~4% gains on What'sUp and VSR but provides no ablation that removes the feature-alignment term to the world-model intermediates while retaining spatial answer supervision and the preservation anchor. This control is required to establish that the observed improvements are caused by geometric transfer rather than generic adaptation from the spatial loss alone.
Authors: We agree that this ablation is important to isolate the contribution of geometric transfer. In the revised manuscript we will add a control experiment that retains spatial answer supervision and the preservation anchor but removes the feature-alignment term to world-model intermediates. Performance of this variant will be reported alongside the full GeoWorld-VLM results on both benchmarks and both architectures to demonstrate that the observed gains require the world-model alignment. revision: yes
-
Referee: [Method] Method and Evaluation sections: no information is given on the collection protocol for spatial answers, the choice of baselines, or any statistical significance testing for the reported improvements, leaving the reliability and magnitude of the 4% gains difficult to evaluate.
Authors: We will expand the Method and Evaluation sections to address these points. Spatial answers are taken directly from the ground-truth annotations of the What'sUp and VSR datasets; we will describe the exact prompt templates and answer formats used. Baselines consist of the two unmodified original VLM architectures plus relevant prior spatial-reasoning methods. We will also report statistical significance by including standard deviations across multiple random seeds and, where appropriate, paired statistical tests to support the reliability of the reported improvements. revision: yes
Circularity Check
No derivation chain or self-referential reductions present
full rationale
The paper presents an empirical VLM fine-tuning framework that aligns image features with intermediate representations from externally frozen camera-conditioned world models. No equations, derivations, fitted parameters, or predictions are defined or claimed anywhere in the abstract or method description. Improvements on What'sUp and VSR are reported as observed benchmark outcomes after training, not as quantities derived from the method's own inputs by construction. The approach relies on external pre-trained models and standard supervision signals rather than any self-citation chain, ansatz smuggling, or renaming of known results, rendering the reported gains independent of internal circular reductions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Camera-conditioned video world models produce intermediate representations that encode transferable 3D spatial structure from single images plus a sampled trajectory.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
GeoWorld-VLM fine-tunes only the image encoder and multimodal projector, aligning post-projector image features with intermediate world-model representations while leaving the main backbone frozen.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Training combines spatial answer supervision, teacher-student feature alignment, and a preservation anchor to the original VLM.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Cosmos World Foundation Model Platform for Physical AI
N. Agarwal, A. Ali, M. Bala, Y . Balaji, E. Barker, T. Cai, P. Chattopadhyay, Y . Chen, Y . Cui, Y . Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
The Claude 3 model family: Opus, Sonnet, Haiku
Anthropic. The Claude 3 model family: Opus, Sonnet, Haiku. Technical report, Anthropic, 2024
work page 2024
-
[3]
J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou. Qwen-VL: A ver- satile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, et al. Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
PaliGemma: A versatile 3B VLM for transfer
L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neumann, I. Alabdulmohsin, M. Tschannen, E. Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [6]
- [7]
-
[8]
S. Chen, T. Zhu, R. Zhou, J. Zhang, S. Gao, J. C. Niebles, M. Geva, J. He, J. Wu, and M. Li. Why is spatial reasoning hard for vlms? an attention mechanism perspective on focus areas, 2025
work page 2025
-
[9]
X. Chen, X. Wang, S. Changpinyo, A. J. Piergiovanni, P. Padlewski, D. Salz, S. Goodman, A. Grycner, B. Mustafa, L. Beyer, et al. PaLI: A jointly-scaled multilingual language-image model. InInternational Conference on Learning Representations (ICLR), 2023
work page 2023
- [10]
- [11]
-
[12]
W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023
work page 2023
-
[13]
M. Du, B. Wu, Z. Li, X. Huang, and Z. Wei. EmbSpatial-Bench: Benchmarking spatial understanding for embodied tasks with large vision-language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), Volume 2: Short Papers, 2024
work page 2024
-
[14]
X. Fu, Y . Hu, B. Li, Y . Feng, H. Wang, X. Lin, D. Roth, N. A. Smith, W.-C. Ma, and R. Krishna. BLINK: Multimodal large language models can see but not perceive. InEuropean Conference on Computer Vision (ECCV), 2024
work page 2024
-
[15]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini Team, Google. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Mastering Diverse Domains through World Models
D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
Y . Hong, H. Zhen, P. Chen, S. Zheng, Y . Du, Z. Chen, and C. Gan. 3d-llm: Injecting the 3d world into large language models.Advances in Neural Information Processing Systems, 36:20482–20494, 2023
work page 2023
-
[19]
A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, et al. GPT-4o system card.arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
A. Kamath, J. Hessel, and K.-W. Chang. What’s “up” with vision-language models? Investigating their struggle with spatial reasoning.arXiv preprint arXiv:2310.19785, 2023. 11
-
[21]
VideoPoet: A Large Language Model for Zero-Shot Video Generation
D. Kondratyuk, L. Yu, X. Gu, J. Lezama, J. Huang, G. Schindler, R. Hornung, V . Birodkar, J. Yan, M.-C. Chiu, et al. Videopoet: A large language model for zero-shot video generation.arXiv preprint arXiv:2312.14125, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wu, J. Zhang, et al. Hun- yuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
P. Y . Lee, J. Je, C. Park, M. A. Uy, L. Guibas, and M. Sung. Perspective-aware reasoning in vision- language models via mental imagery simulation. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision, pages 9241–9251, 2025
work page 2025
-
[24]
J. Li, D. Li, S. Savarese, and S. Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational Conference on Machine Learning (ICML), 2023
work page 2023
-
[25]
B. Lin, Y . Ge, X. Cheng, Z. Li, B. Zhu, S. Wang, X. He, Y . Ye, S. Yuan, L. Chen, et al. Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
T.-Y . Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollár. Microsoft coco: Common objects in context, 2015
work page 2015
-
[27]
F. Liu, G. Emerson, and N. Collier. Visual spatial reasoning.Transactions of the Association for Compu- tational Linguistics, 11:635–651, 2023
work page 2023
-
[28]
H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023
work page 2023
-
[29]
Y . Liu, F. Zhan, K. Zhou, Y . Du, P. P. Liang, and H. Pfister. Abstract 3d perception for spatial intelligence in vision-language models.arXiv preprint arXiv:2511.10946, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [30]
-
[31]
Introducing GPT-5.https://openai.com/index/introducing-gpt-5/, 2025
OpenAI. Introducing GPT-5.https://openai.com/index/introducing-gpt-5/, 2025. Accessed: 2026-05-04
work page 2025
-
[32]
DINOv2: Learning Robust Visual Features without Supervision
M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
Qwen Team. Qwen3-VL technical report.arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021
work page 2021
- [35]
-
[36]
O. Siméoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
Y . Tang, X. Han, X. Li, Q. Yu, Y . Hao, L. Hu, and M. Chen. Minigpt-3d: Efficiently aligning 3d point clouds with large language models using 2d priors. InProceedings of the 32nd ACM International Con- ference on Multimedia, pages 6617–6626, 2024
work page 2024
-
[38]
G. Team, T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, S. Pathak, L. Sifre, M. Rivière, M. S. Kale, J. Love, et al. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[39]
Q. Team. Qwen3. 5-omni technical report.arXiv preprint arXiv:2604.15804, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[40]
R. Team, Z. Gao, Q. Wang, Y . Zeng, J. Zhu, K. L. Cheng, Y . Li, H. Wang, Y . Xu, S. Ma, Y . Chen, J. Liu, Y . Cheng, Y . Yao, J. Zhu, Y . Meng, K. Zheng, Q. Bai, J. Chen, Z. Shen, Y . Yu, X. Zhu, Y . Shen, and H. Ouyang. Advancing open-source world models, 2026. 12
work page 2026
-
[41]
R. Team, Z. Gao, Q. Wang, Y . Zeng, J. Zhu, K. L. Cheng, Y . Li, H. Wang, Y . Xu, S. Ma, et al. Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[42]
T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025
work page 2025
- [44]
-
[45]
W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei, Z. Liu, L. Jing, S. Ye, J. Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei, Z. Liu, L. Jing, S. Ye, J. Shao, Z. Wang, Z. Chen, H. Zhang, G. Yang, H. Wang, Q. Wei, J. Yin, W. Li, E. Cui, G. Chen, Z. Ding, C. Tian, Z. Wu, J. Xie, Z. Li, B. Yang, Y . Duan, X. Wang, Z. Hou, H. Hao, T. Zhang, S. Li, X. Zhao, H. Duan, N. Deng, B. Fu, Y . He, Y . Wang, C. He, B. Shi, J. He, Y . Xiong, H. Lv...
work page 2025
- [47]
-
[48]
C. Wu, J. Li, J. Zhou, J. Lin, K. Gao, K. Yan, S.-m. Yin, S. Bai, X. Xu, Y . Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [49]
-
[50]
J. Yang, R. Tan, Q. Wu, R. Zheng, B. Peng, Y . Liang, Y . Gu, M. Cai, S. Ye, J. Jang, et al. Magma: A foundation model for multimodal ai agents. InProceedings of the computer vision and pattern recognition conference, pages 14203–14214, 2025
work page 2025
- [51]
-
[52]
Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y . Yang, W. Hong, X. Zhang, G. Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[53]
X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023
work page 2023
-
[54]
G. Zhao, X. Wang, Z. Zhu, X. Chen, G. Huang, X. Bao, and X. Wang. Drivedreamer-2: Llm-enhanced world models for diverse driving video generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 10412–10420, 2025
work page 2025
-
[55]
J. Zhou, H. Gao, V . V oleti, A. Vasishta, C.-H. Yao, M. Boss, P. Torr, C. Rupprecht, and V . Jampani. Stable virtual camera: Generative view synthesis with diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12405–12414, 2025
work page 2025
-
[56]
K. Zhou, Y . Wang, G. Chen, X. Chang, G. Beaudouin, F. Zhan, P. P. Liang, and M. Wang. Page-4d: Disentangled pose and geometry estimation for 4d perception.arXiv preprint arXiv:2510.17568, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[57]
B. Zou, M. Cai, J. Zhang, and Y . J. Lee. Vgbench: Evaluating large language models on vector graphics understanding and generation, 2024. 13 A More Details about LingBot-World-Fast GeoWorld-VLM uses LingBot-World-Fast as the default world-model teacher. LingBot-World-Fast is the efficient variant of LingBot-World, an open-source video-based world simulat...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.