arxiv: 2604.13321 · v1 · submitted 2026-04-14 · 💻 cs.CV

Recognition: unknown

Why MLLMs Struggle to Determine Object Orientations

Anju Gopinath , Nikhil Krishnaswamy , Bruce Draper

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:16 UTC · model grok-4.3

classification 💻 cs.CV

keywords multimodal large language modelsobject orientationvisual encoderslinear probinggeometric reasoningCLIPSigLIPLLaVA

0 comments

The pith

Orientation details are recoverable from MLLM visual encoder embeddings via linear models, showing encoders are not the cause of reasoning failures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper challenges the idea that MLLMs fail at 2D object orientation tasks because encoders like CLIP and SigLIP lack geometric information due to their semantic training. Experiments train linear regressors on embeddings from models such as LLaVA OneVision and Qwen2.5-VL, using both full images and rotated patches, and find accurate prediction of rotation angles. This presence of the data shifts attention to how the language model accesses or combines it during generation. A reader would care because it reframes the problem from missing signals to underutilized ones in multimodal systems.

Core claim

Contrary to the hypothesis that visual encoders fail to preserve orientation, simple linear models recover object rotation angles from SigLIP, ViT, and CLIP embeddings with high accuracy. The information exists in the representations from LLaVA and Qwen models but spreads across tens of thousands of features, which may prevent effective exploitation by the full MLLM during inference.

What carries the argument

Linear regressors that map encoder feature vectors to predicted object orientation angles, applied to full images or foreground patches.

Load-bearing premise

That accurate linear prediction from embeddings means the MLLM can locate and apply this orientation information when answering queries.

What would settle it

An auxiliary training run that adds an orientation prediction loss on the encoder outputs yet shows no gain in MLLM accuracy on orientation queries would indicate the information remains inaccessible in practice.

Figures

Figures reproduced from arXiv: 2604.13321 by Anju Gopinath, Bruce Draper, Nikhil Krishnaswamy.

**Figure 1.** Figure 1: Set of images - Sets A and B are used for experiments with LLaVA-OV and Qwen2.5-VL-7B-Instruct, and set C is used for [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Collage of every 15th image from Sections A (a) dog [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: 2D orientation estimation performance comparison of [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 5.** Figure 5: Statistical Analysis using visual plots for Qwen2.5-VL [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Incremental feature substitution for LLaVA-OneVision on images with the dog scene. On the y axis, when y = 1, predicted [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Incremental feature substitution for Qwen2.5-VL-7B-Instruct on images with the dog scene. On the y axis, when y = 1, predicted [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Collage of every 20th image from the images with the [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 11.** Figure 11: 2D orientation estimation performance comparison be [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗

**Figure 12.** Figure 12: 2D orientation estimation performance comparison be [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗

**Figure 13.** Figure 13: 2D orientation estimation performance comparison be [PITH_FULL_IMAGE:figures/full_fig_p012_13.png] view at source ↗

**Figure 16.** Figure 16: 2D orientation estimation performance comparison be [PITH_FULL_IMAGE:figures/full_fig_p012_16.png] view at source ↗

**Figure 15.** Figure 15: 2D orientation estimation performance comparison be [PITH_FULL_IMAGE:figures/full_fig_p012_15.png] view at source ↗

**Figure 20.** Figure 20: 2D orientation estimation performance comparison be [PITH_FULL_IMAGE:figures/full_fig_p013_20.png] view at source ↗

**Figure 21.** Figure 21: 2D orientation estimation performance comparison be [PITH_FULL_IMAGE:figures/full_fig_p013_21.png] view at source ↗

**Figure 19.** Figure 19: 2D orientation estimation performance comparison be [PITH_FULL_IMAGE:figures/full_fig_p013_19.png] view at source ↗

**Figure 23.** Figure 23: 2D orientation estimation performance comparison be [PITH_FULL_IMAGE:figures/full_fig_p014_23.png] view at source ↗

**Figure 24.** Figure 24: 2D orientation estimation performance comparison be [PITH_FULL_IMAGE:figures/full_fig_p014_24.png] view at source ↗

**Figure 25.** Figure 25: 2D orientation estimation performance comparison be [PITH_FULL_IMAGE:figures/full_fig_p014_25.png] view at source ↗

**Figure 26.** Figure 26: Statistical Analysis using visual plots for LLaVA [PITH_FULL_IMAGE:figures/full_fig_p015_26.png] view at source ↗

**Figure 27.** Figure 27: Statistical Analysis using visual plots for Qwen2.5- [PITH_FULL_IMAGE:figures/full_fig_p015_27.png] view at source ↗

**Figure 28.** Figure 28: Statistical Analysis using visual plots for LLaVA [PITH_FULL_IMAGE:figures/full_fig_p016_28.png] view at source ↗

**Figure 29.** Figure 29: Statistical Analysis using visual plots for Qwen2.5- [PITH_FULL_IMAGE:figures/full_fig_p016_29.png] view at source ↗

**Figure 30.** Figure 30: Statistical Analysis using visual plots for LLaVA [PITH_FULL_IMAGE:figures/full_fig_p017_30.png] view at source ↗

**Figure 31.** Figure 31: Statistical Analysis using visual plots for Qwen2.5- [PITH_FULL_IMAGE:figures/full_fig_p017_31.png] view at source ↗

**Figure 36.** Figure 36: Statistical Analysis using visual plots for LLaVA [PITH_FULL_IMAGE:figures/full_fig_p018_36.png] view at source ↗

**Figure 35.** Figure 35: Statistical Analysis using visual plots for Qwen2.5- [PITH_FULL_IMAGE:figures/full_fig_p018_35.png] view at source ↗

**Figure 38.** Figure 38: Statistical Analysis using visual plots for LLaVA [PITH_FULL_IMAGE:figures/full_fig_p019_38.png] view at source ↗

**Figure 39.** Figure 39: Statistical Analysis using visual plots for Qwen2.5- [PITH_FULL_IMAGE:figures/full_fig_p019_39.png] view at source ↗

**Figure 44.** Figure 44: Statistical Analysis using visual plots for LLaVA 1.6 - [PITH_FULL_IMAGE:figures/full_fig_p020_44.png] view at source ↗

**Figure 43.** Figure 43: Statistical Analysis using visual plots for LLaVA 1.5 - [PITH_FULL_IMAGE:figures/full_fig_p020_43.png] view at source ↗

**Figure 48.** Figure 48: Statistical Analysis using visual plots for LLaVA 1.5 [PITH_FULL_IMAGE:figures/full_fig_p021_48.png] view at source ↗

**Figure 47.** Figure 47: Statistical Analysis using visual plots for LLaVA 1.5 [PITH_FULL_IMAGE:figures/full_fig_p021_47.png] view at source ↗

**Figure 50.** Figure 50: Statistical Analysis using visual plots for LLaVA 1.6 [PITH_FULL_IMAGE:figures/full_fig_p022_50.png] view at source ↗

**Figure 51.** Figure 51: Statistical Analysis using visual plots for LLaVA 1.6 [PITH_FULL_IMAGE:figures/full_fig_p022_51.png] view at source ↗

**Figure 54.** Figure 54: Statistical Analysis using visual plots for LLaVA 1.5 - [PITH_FULL_IMAGE:figures/full_fig_p023_54.png] view at source ↗

**Figure 55.** Figure 55: Statistical Analysis using visual plots for LLaVA 1.5 - [PITH_FULL_IMAGE:figures/full_fig_p023_55.png] view at source ↗

**Figure 58.** Figure 58: Statistical Analysis using visual plots for LLaVA 1.6 - [PITH_FULL_IMAGE:figures/full_fig_p024_58.png] view at source ↗

**Figure 59.** Figure 59: Incremental feature substitution for LLaVA-OneVision on images with the lizard scene. No matter how the features are [PITH_FULL_IMAGE:figures/full_fig_p026_59.png] view at source ↗

**Figure 60.** Figure 60: Incremental feature substitution for Qwen2.5-VL-7B-Instruct on images with the lizard scene. No matter how the features are [PITH_FULL_IMAGE:figures/full_fig_p026_60.png] view at source ↗

**Figure 62.** Figure 62: Incremental feature substitution for Qwen2.5-VL-7B-Instruct on images with the train scene. No matter how the features are [PITH_FULL_IMAGE:figures/full_fig_p026_62.png] view at source ↗

**Figure 63.** Figure 63: Incremental feature substitution for LLaVA-OneVision on images with the beach scene. No matter how the features are [PITH_FULL_IMAGE:figures/full_fig_p026_63.png] view at source ↗

**Figure 64.** Figure 64: Incremental feature substitution for Qwen2.5-VL-7B-Instruct on images with the beach scene. No matter how the features are [PITH_FULL_IMAGE:figures/full_fig_p027_64.png] view at source ↗

**Figure 65.** Figure 65: Incremental feature substitution for LLaVA-OneVision on images with the indoor scene. No matter how the features are [PITH_FULL_IMAGE:figures/full_fig_p027_65.png] view at source ↗

**Figure 66.** Figure 66: Incremental feature substitution for Qwen2.5-VL-7B-Instruct on images with the indoor scene. No matter how the features are [PITH_FULL_IMAGE:figures/full_fig_p027_66.png] view at source ↗

**Figure 67.** Figure 67: Incremental feature substitution for LLaVA-OneVision on images with the fish scene. No matter how the features are selected [PITH_FULL_IMAGE:figures/full_fig_p027_67.png] view at source ↗

**Figure 68.** Figure 68: Incremental feature substitution for Qwen2.5-VL-7B-Instruct on images with the fish scene. No matter how the features are [PITH_FULL_IMAGE:figures/full_fig_p028_68.png] view at source ↗

**Figure 69.** Figure 69: Incremental feature substitution for LLaVA-OneVision on images with the koala-beach scene. No matter how the features are [PITH_FULL_IMAGE:figures/full_fig_p028_69.png] view at source ↗

**Figure 70.** Figure 70: Incremental feature substitution for Qwen2.5-VL-7B-Instruct on images with the koala-beach scene. No matter how the features [PITH_FULL_IMAGE:figures/full_fig_p028_70.png] view at source ↗

**Figure 71.** Figure 71: Incremental feature substitution for LLaVA-OneVision on images with the vase-indoor scene. No matter how the features are [PITH_FULL_IMAGE:figures/full_fig_p028_71.png] view at source ↗

**Figure 72.** Figure 72: Incremental feature substitution for Qwen2.5-VL-7B-Instruct on images with the vase-indoor scene. No matter how the features [PITH_FULL_IMAGE:figures/full_fig_p029_72.png] view at source ↗

**Figure 73.** Figure 73: Incremental feature substitution for LLaVA-OneVision on images with the vase-toaster-indoor scene. No matter how the features [PITH_FULL_IMAGE:figures/full_fig_p029_73.png] view at source ↗

**Figure 74.** Figure 74: Incremental feature substitution for Qwen2.5-VL-7B-Instruct on images with the vase-toaster-indoor scene. No matter how [PITH_FULL_IMAGE:figures/full_fig_p029_74.png] view at source ↗

**Figure 75.** Figure 75: Incremental feature substitution for LLaVA 1.5 on images with the beach background scene. No matter how the features are [PITH_FULL_IMAGE:figures/full_fig_p031_75.png] view at source ↗

**Figure 76.** Figure 76: Incremental feature substitution for LLaVA 1.6 on images with the beach background scene. No matter how the features are [PITH_FULL_IMAGE:figures/full_fig_p031_76.png] view at source ↗

**Figure 77.** Figure 77: Incremental feature substitution for LLaVA 1.5 on images with the fish background scene. No matter how the features are [PITH_FULL_IMAGE:figures/full_fig_p032_77.png] view at source ↗

**Figure 78.** Figure 78: Incremental feature substitution for LLaVA 1.6 on images with the fish background scene. No matter how the features are [PITH_FULL_IMAGE:figures/full_fig_p032_78.png] view at source ↗

**Figure 79.** Figure 79: Incremental feature substitution for LLaVA 1.5 on images with the indoor background scene. No matter how the features are [PITH_FULL_IMAGE:figures/full_fig_p033_79.png] view at source ↗

**Figure 80.** Figure 80: Incremental feature substitution for LLaVA 1.6 on images with the indoor background scene. No matter how the features are [PITH_FULL_IMAGE:figures/full_fig_p033_80.png] view at source ↗

**Figure 82.** Figure 82: Images with synthetic backgrounds used to test the im [PITH_FULL_IMAGE:figures/full_fig_p034_82.png] view at source ↗

**Figure 83.** Figure 83: Results of foreground orientation estimation by LLaVA [PITH_FULL_IMAGE:figures/full_fig_p034_83.png] view at source ↗

**Figure 81.** Figure 81: Results of foreground orientation estimation by LLaVA [PITH_FULL_IMAGE:figures/full_fig_p034_81.png] view at source ↗

**Figure 85.** Figure 85: Results of foreground orientation estimation by LLaVA [PITH_FULL_IMAGE:figures/full_fig_p035_85.png] view at source ↗

**Figure 87.** Figure 87: Results of foreground orientation estimation by LLaVA [PITH_FULL_IMAGE:figures/full_fig_p036_87.png] view at source ↗

read the original abstract

Multimodal Large Language Models (MLLMs) struggle with tasks that require reasoning about 2D object orientation in images, as documented in prior work. Tong et al. and Nichols et al. hypothesize that these failures originate in the visual encoder, since commonly used encoders such as CLIP and SigLIP are trained for image-text semantic alignment rather than geometric reasoning. We design a controlled empirical protocol to test this claim by measuring whether rotations can be recovered from encoder representations. In particular, we examine SigLIP and ViT features from LLaVA OneVision and Qwen2.5-VL-7B-Instruct models, respectively, using full images, and examine CLIP representations in LLaVA 1.5 and 1.6 using rotated foreground patches against natural background images. Our null hypothesis is that orientation information is not preserved in the encoder embeddings and we test this by training linear regressors to predict object orientation from encoded features. Contrary to the hypothesis, we find that orientation information is recoverable from encoder representations: simple linear models accurately predict object orientations from embeddings. This contradicts the assumption that MLLM orientation failures originate in the visual encoder. Having rejected the accepted hypothesis that MLLMs struggle with 2D orientation tasks because of visual encoder limitations, we still don't know why they fail. Although a full explanation is beyond the scope of this paper, we show that although present, orientation information is spread diffusely across tens of thousands of features. This may or may not be while MLLMs fail to exploit the available orientation information.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript tests the hypothesis that MLLM failures on 2D object orientation tasks originate in visual encoders (e.g., SigLIP, ViT, CLIP) failing to preserve geometric information. Using linear regression probes on embeddings from full images (LLaVA OneVision, Qwen2.5-VL) and rotated foreground patches on natural backgrounds (LLaVA 1.5/1.6), the authors reject the null that orientation is not recoverable, showing accurate prediction from encoder features. They note the information is present but diffuse across many dimensions and do not claim this resolves MLLM inference failures.

Significance. If the results hold, the work is significant for providing a controlled empirical refutation of a common hypothesis about MLLM geometric reasoning limits. The use of standard linear probes as an information-presence test, combined with separate full-image and patch-based setups, offers a direct, falsifiable check against the encoder-origin claim. Credit is due for the reproducible probe design and the explicit acknowledgment that presence of information does not imply exploitability by the full model. This shifts attention to decoder or training factors without overclaiming.

minor comments (2)

[Abstract] Abstract: the description of experimental controls (e.g., exact rotation ranges, background selection criteria, and error metrics such as MAE or R²) is incomplete, making it harder to assess the strength of the linear prediction results without the full methods section.
The discussion of diffuse information across tens of thousands of features would benefit from a brief quantitative illustration (e.g., how many top dimensions are needed for a given accuracy threshold) to ground the observation that the signal is not localized.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their supportive review and recommendation of minor revision. We appreciate the accurate summary of our controlled empirical protocol, the recognition of its falsifiability, and the acknowledgment that our results shift attention to decoder or training factors without overclaiming. The positive assessment of the reproducible probe design and explicit caveats is encouraging.

Circularity Check

0 steps flagged

No significant circularity; empirical refutation of external hypothesis

full rationale

The paper's central result is an empirical test: linear regressors are trained on encoder embeddings to recover object orientation angles, directly rejecting the null hypothesis (drawn from Tong et al. and Nichols et al.) that such information is absent from CLIP/SigLIP/ViT features. No equation or claim reduces to a fitted parameter renamed as a prediction, no self-citation supplies a load-bearing uniqueness theorem, and the diffuse-information observation is presented as an open question rather than a derivation. The protocol is self-contained against the stated external hypothesis and does not rely on internal self-definition or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the standard assumption that linear probes suffice to detect linearly accessible information in embeddings; no free parameters are fitted to support the main conclusion, and no new entities are introduced.

axioms (1)

domain assumption Linear probes can extract information that is linearly present in high-dimensional embeddings
Invoked when training regressors to predict orientation from encoder features; this is a standard assumption in representation learning literature.

pith-pipeline@v0.9.0 · 5588 in / 1257 out tokens · 78192 ms · 2026-05-10T15:16:10.412411+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 22 canonical work pages · 6 internal anchors

[1]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Representation engineering for large-language models: Survey and research challenges.arXiv preprint arXiv:2502.17601,

Lukasz Bartoszcze, Sarthak Munshi, Bryan Sukidi, Jen- nifer Yen, Zejia Yang, David Williams-King, Linh Le, Kosi Asuzu, and Carsten Maple. Representation engineering for large-language models: Survey and research challenges. arXiv preprint arXiv:2502.17601, 2025. 2

work page arXiv 2025
[3]

Per- ception tokens enhance visual reasoning in multimodal lan- guage models

Mahtab Bigverdi, Zelun Luo, Cheng-Yu Hsieh, Ethan Shen, Dongping Chen, Linda G Shapiro, and Ranjay Krishna. Per- ception tokens enhance visual reasoning in multimodal lan- guage models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3836–3845, 2025. 2

2025
[4]

Why is spatial reasoning hard for vlms? an attention mechanism perspective on focus ar- eas

Shiqi Chen, Tongyao Zhu, Ruochen Zhou, Jinghan Zhang, Siyang Gao, Juan Carlos Niebles, Mor Geva, Junxian He, Jiajun Wu, and Manling Li. Why is spatial reasoning hard for vlms? an attention mechanism perspective on focus ar- eas. InForty-second International Conference on Machine Learning, 2025. 1, 2, 8

2025
[5]

The dual mechanisms of spatial reasoning in vision–language models

Kelly Cui, Nikhil Prakash, Ayush Raina, David Bau, Antonio Torralba, and Tamar Rott Shaham. The dual mechanisms of spatial reasoning in vision–language models. InThe First Workshop on Efficient Spatial Reasoning. 2
[6]

Aligning vision- language models with human directional reference

KIM Daehyun and Hyounghun Kim. Aligning vision- language models with human directional reference. 1, 2
[7]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 1

work page internal anchor Pith review Pith/arXiv arXiv 2010
[8]

Im- age orientation estimation with convolutional networks

Philipp Fischer, Alexey Dosovitskiy, and Thomas Brox. Im- age orientation estimation with convolutional networks. In German conference on pattern recognition, pages 368–378. Springer, 2015. 2

2015
[9]

Large language models are challenged by habitat-centered reasoning

Sadaf Ghaffari and Nikhil Krishnaswamy. Large language models are challenged by habitat-centered reasoning. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 13047–13059, 2024. 2

2024
[10]

Data distribution: normal or abnormal? Journal of Korean medical science, 39(3), 2024

Farrokh Habibzadeh. Data distribution: normal or abnormal? Journal of Korean medical science, 39(3), 2024. 5

2024
[11]

Vision-language models can’t see the obvious

Ngoc Dung Huynh, Yasser Dahou, Phuc H Le-Khac, Wamiq Reyaz Para, Ankit Singh, and Sanath Narayan. Vision-language models can’t see the obvious. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 24159–24169, 2025. 2

2025
[12]

Omnispatial: Towards comprehensive spatial reasoning benchmark for vi- sion language models.arXiv preprint arXiv:2506.03135,

Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, and Li Yi. Omnispatial: Towards comprehensive spatial reasoning benchmark for vi- sion language models.arXiv preprint arXiv:2506.03135,

work page arXiv
[13]

Isright’right? enhancing ob- ject orientation understanding in multimodal large language models through egocentric instruction tuning

Ji Hyeok Jung, Eun Tae Kim, Seoyeon Kim, Joo Ho Lee, Bumsoo Kim, and Buru Chang. Isright’right? enhancing ob- ject orientation understanding in multimodal large language models through egocentric instruction tuning. InProceed- ings of the Computer Vision and Pattern Recognition Con- ference, pages 14257–14267, 2025. 2

2025
[14]

Visonlyqa: Large vision language models still struggle with visual perception of geo- metric information.arXiv preprint arXiv:2412.00947, 2024

Ryo Kamoi, Yusen Zhang, Sarkar Snigdha Sarathi Das, Ran- ran Haoran Zhang, and Rui Zhang. Visonlyqa: Large vision language models still struggle with visual perception of geo- metric information.arXiv preprint arXiv:2412.00947, 2024. 1

work page arXiv 2024
[15]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 1, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Euclid’s gift: En- hancing spatial perception and reasoning in vision-language models via geometric surrogate tasks.arXiv preprint arXiv:2509.24473, 2025

Shijie Lian, Changti Wu, Laurence Tianruo Yang, Hang Yuan, Bin Yu, Lei Zhang, and Kai Chen. Euclid’s gift: En- hancing spatial perception and reasoning in vision-language models via geometric surrogate tasks.arXiv preprint arXiv:2509.24473, 2025. 1, 8

work page arXiv 2025
[17]

Spatial intelligence in vision- language models: A comprehensive survey

Disheng Liu, Tuo Liang, Zhe Hu, Jierui Peng, Yiren Lu, Yi Xu, Yun Fu, and Yu Yin. Spatial intelligence in vision- language models: A comprehensive survey. 2026. 2

2026
[18]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024. 1

2024
[19]

Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024. 1, 7

2024
[20]

Os- kdet: Orientation-sensitive keypoint localization for rotated object detection

Dongchen Lu, Dongmei Li, Yali Li, and Shengjin Wang. Os- kdet: Orientation-sensitive keypoint localization for rotated object detection. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 1182–1192, 2022. 2

2022
[21]

Attention in space: Functional roles of vlm heads for spatial reasoning.arXiv preprint arXiv:2603.20662, 2026

Xueqi Ma, Shuo Yang, Yanbei Jiang, Shu Liu, Zhenzhen Liu, Jiayang Ao, Xingjun Ma, Sarah Monazam Erfani, and James Bailey. Attention in space: Functional roles of vlm heads for spatial reasoning.arXiv preprint arXiv:2603.20662, 2026. 1, 2

work page arXiv 2026
[22]

The kolmogorov-smirnov test for good- ness of fit.Journal of the American statistical Association, 46(253):68–78, 1951

Frank J Massey Jr. The kolmogorov-smirnov test for good- ness of fit.Journal of the American statistical Association, 46(253):68–78, 1951. 5

1951
[23]

Descriptive statis- tics and normality tests for statistical data.Annals of cardiac anaesthesia, 22(1):67–72, 2019

Prabhaker Mishra, Chandra M Pandey, Uttam Singh, Anshul Gupta, Chinmoy Sahu, and Amit Keshri. Descriptive statis- tics and normality tests for statistical data.Annals of cardiac anaesthesia, 22(1):67–72, 2019. 5

2019
[24]

Seeing Isn't Orienting: A Cognitively Grounded Benchmark Reveals Systematic Orientation Failures in MLLMs

Keanu Nichols, Nazia Tasnim, Yuting Yan, Nicholas Ikechukwu, Elva Zou, Deepti Ghadiyaram, and Bryan A Plummer. Right side up? disentangling orientation under- standing in mllms with fine-grained multi-axis perception tasks.arXiv preprint arXiv:2505.21649, 2025. 1, 2, 3, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Preprint, arXiv:2508.13968

Tianyi Niu, Jaemin Cho, Elias Stengel-Eskin, and Mo- hit Bansal. Rotbench: Evaluating multimodal large lan- guage models on identifying image rotation.arXiv preprint arXiv:2508.13968, 2025. 1, 2, 8

work page arXiv 2025
[26]

SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial represen- tations for visual-language-action model.arXiv preprint arXiv:2501.15830, 2025. 2

work page internal anchor Pith review arXiv 2025
[27]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 1

2021
[28]

Berg, and Li Fei-Fei

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Chal- lenge.International Journal of Computer Vision (IJCV), 115 (3):211–252, 2015. 3

2015
[29]

An empirical analysis on spatial reason- ing capabilities of large multimodal models

Fatemeh Shiri, Xiao-Yu Guo, Mona Far, Xin Yu, Reza Haf, and Yuan-Fang Li. An empirical analysis on spatial reason- ing capabilities of large multimodal models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 21440–21455, 2024. 2

2024
[30]

Orientation esti- mation network

Jie Sun, Wengang Zhou, and Houqiang Li. Orientation esti- mation network. InInternational Conference on Image and Graphics, pages 151–162. Springer, 2017. 2

2017
[31]

Why representation engineering works: A theoretical and empirical study in vision-language models

Bowei Tian, Xuntao Lyu, Meng Liu, Hongyi Wang, and Ang Li. Why representation engineering works: A theoretical and empirical study in vision-language models.arXiv preprint arXiv:2503.22720, 2025. 2

work page arXiv 2025
[32]

Eyes wide shut? exploring the visual shortcomings of multimodal llms

Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9568–9578, 2024. 1, 2, 8

2024
[33]

Spatialviz-bench: An mllm benchmark for spatial visualiza- tion.arXiv preprint arXiv:2507.07610, 2025

Siting Wang, Minnan Pei, Luoyang Sun, Cheng Deng, Kun Shao, Zheng Tian, Haifeng Zhang, and Jun Wang. Spatialviz-bench: An mllm benchmark for spatial visualiza- tion.arXiv preprint arXiv:2507.07610, 2025

work page arXiv 2025
[34]

Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.arXiv preprint arXiv:2505.23747, 2025

Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.arXiv preprint arXiv:2505.23747, 2025

work page arXiv 2025
[35]

Thinking in space: How mul- timodal large language models see, remember, and recall spaces

Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How mul- timodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025. 1, 2, 8

2025
[36]

arXiv preprint arXiv:2509.07979 (2025) 4

Heeji Yoon, Jaewoo Jung, Junwan Kim, Hyungyu Choi, Heeseong Shin, Sangbeom Lim, Honggyu An, Chaehyun Kim, Jisang Han, Donghyun Kim, et al. Visual representa- tion alignment for multimodal large language models.arXiv preprint arXiv:2509.07979, 2025. 2

work page arXiv 2025
[37]

How far are vlms from visual spatial intel- ligence? a benchmark-driven perspective.arXiv preprint arXiv:2509.18905, 2025

Songsong Yu, Yuxin Chen, Hao Ju, Lianjie Jia, Fuxi Zhang, Shaofei Huang, Yuhan Wu, Rundi Cui, Binghao Ran, Za- ibin Zhang, et al. How far are vlms from visual spatial in- telligence? a benchmark-driven perspective.arXiv preprint arXiv:2509.18905, 2025. 1, 2

work page arXiv 2025
[38]

DepthVLA: Enhancing vision-language-action models with depth-aware spatial rea- soning.arXiv preprint arXiv:2510.13375, 2025

Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Zhuoguang Chen, Tao Jiang, and Hang Zhao. Depthvla: Enhancing vision-language-action models with depth-aware spatial rea- soning.arXiv preprint arXiv:2510.13375, 2025. 2

work page arXiv 2025
[39]

Si-score: An image dataset for fine-grained analysis of robustness to object location, ro- tation and size.arXiv preprint arXiv:2104.04191, 2021

Jessica Yung, Rob Romijnders, Alexander Kolesnikov, Lu- cas Beyer, Josip Djolonga, Neil Houlsby, Sylvain Gelly, Mario Lucic, and Xiaohua Zhai. Si-score: An image dataset for fine-grained analysis of robustness to object location, ro- tation and size.arXiv preprint arXiv:2104.04191, 2021. 3

work page arXiv 2021
[40]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 1, 7

2023
[41]

Visual interpretability for deep learning: a survey.Frontiers of Information Tech- nology & Electronic Engineering, 19(1):27–39, 2018

Quan-shi Zhang and Song-Chun Zhu. Visual interpretability for deep learning: a survey.Frontiers of Information Tech- nology & Electronic Engineering, 19(1):27–39, 2018. 2

2018
[42]

Why do mllms struggle with spatial understanding? a system- atic analysis from data to architecture.arXiv preprint arXiv:2509.02359, 2025

Wanyue Zhang, Yibin Huang, Yangbin Xu, JingJing Huang, Helu Zhi, Shuo Ren, Wang Xu, and Jiajun Zhang. Why do mllms struggle with spatial understanding? a system- atic analysis from data to architecture.arXiv preprint arXiv:2509.02359, 2025. 1

work page arXiv 2025
[43]

Embodied-reasoner: Synergizing visual search, reasoning, and action for embodied interactive tasks

Wenqi Zhang, Mengna Wang, Gangao Liu, Xu Huixin, Yi- wei Jiang, Yongliang Shen, Guiyang Hou, Zhe Zheng, Hang Zhang, Xin Li, et al. Embodied-reasoner: Synergizing visual search, reasoning, and action for embodied interactive tasks. arXiv preprint arXiv:2503.21696, 2025. 2

work page arXiv 2025
[44]

Multimodal spatial reasoning in the large model era: A survey and benchmarks.arXiv preprint arXiv:2510.25760, 2025

Xu Zheng, Zihao Dongfang, Lutao Jiang, Boyuan Zheng, Yulong Guo, Zhenquan Zhang, Giuliano Albanese, Runyi Yang, Mengjiao Ma, Zixin Zhang, et al. Multimodal spatial reasoning in the large model era: A survey and benchmarks. arXiv preprint arXiv:2510.25760, 2025. 1, 2

work page arXiv 2025
[45]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 1, 8 Why MLLMs Struggle to Determine Object Orientations Supplementary Material

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Collage of Images for LLaV A 1.5 and 1.6 Figure 8

Hypothesis: CLIP Fails to Encode Object Orientation Information 7.1. Collage of Images for LLaV A 1.5 and 1.6 Figure 8. Collage of every 20th image from the images with the dog foreground (biggest foreground) 7.2. Plots showing Regression comparison be- tween LLaV A OneVision and Qwen2.5-VL- 7B-Instruct Figure 9. 2D orientation estimation performance comp...
[47]

Statistical Analysis using visual plots for LLaV A- OneVision - results for images with vase-toaster-indoor

is random/Gaussian Figure 40. Statistical Analysis using visual plots for LLaV A- OneVision - results for images with vase-toaster-indoor. Together with the numerical results in Table 3 and the visual plots in this Figure, we can conclude that the residual error distribution (Table
[48]

Statistical Analysis using visual plots for Qwen2.5-VL- 7B-Instruct - results for images with vase-toaster-indoor

is random/Gaussian Figure 41. Statistical Analysis using visual plots for Qwen2.5-VL- 7B-Instruct - results for images with vase-toaster-indoor. Together with the numerical results in Table 3 and the visual plots in this Figure, we can conclude that the residual error distribution (Table
[49]

Plots Showing Statistical Analysis for LLaV A 1.5 and 1.6 Figure 42

is random/Gaussian 7.5. Plots Showing Statistical Analysis for LLaV A 1.5 and 1.6 Figure 42. Statistical Analysis using visual plots for LLaV A 1.5 - results for images with dog foregrounds (Scale 2). Together with the numerical results in Table 3 and the visual plots in this Figure, we can conclude that the residual error distribution (Table 2) is random...
[50]

Statistical Analysis using visual plots for LLaV A 1.5 - results for images with lizard foregrounds (Scale 2)

is random/Gaussian Figure 48. Statistical Analysis using visual plots for LLaV A 1.5 - results for images with lizard foregrounds (Scale 2). Together with the numerical results in Table 3 and the visual plots in this Figure, we can conclude that the residual error distribution (Table
[51]

Statistical Analysis using visual plots for LLaV A 1.5 - results for images with lizard foregrounds (Scale 3)

is random/Gaussian Figure 49. Statistical Analysis using visual plots for LLaV A 1.5 - results for images with lizard foregrounds (Scale 3). Together with the numerical results in Table 3 and the visual plots in this Figure, we can conclude that the residual error distribution (Table
[52]

Statistical Analysis using visual plots for LLaV A 1.6 - results for images with lizard foregrounds (Scale 1)

is random/Gaussian Figure 50. Statistical Analysis using visual plots for LLaV A 1.6 - results for images with lizard foregrounds (Scale 1). Together with the numerical results in Table 3 and the visual plots in this Figure, we can conclude that the residual error distribution (Table
[53]

Statistical Analysis using visual plots for LLaV A 1.6 - results for images with lizard foregrounds (Scale 2)

is random/Gaussian Figure 51. Statistical Analysis using visual plots for LLaV A 1.6 - results for images with lizard foregrounds (Scale 2). Together with the numerical results in Table 3 and the visual plots in this Figure, we can conclude that the residual error distribution (Table
[54]

Statistical Analysis using visual plots for LLaV A 1.6 - results for images with lizard foregrounds (Scale 3)

is random/Gaussian Figure 52. Statistical Analysis using visual plots for LLaV A 1.6 - results for images with lizard foregrounds (Scale 3). Together with the numerical results in Table 3 and the visual plots in this Figure, we can conclude that the residual error distribution (Table
[55]

not possible to determine

is random/Gaussian Figure 53. Statistical Analysis using visual plots for LLaV A 1.5 - results for images with train foregrounds (Scale 1). Together with the numerical results in Table 3 and the visual plots in this Figure, we can conclude that the residual error distribution (Table 2) is random/Gaussian Figure 54. Statistical Analysis using visual plots ...
[56]

Feature Substitution Plots for LLaV A- OneVision and Qwen2.5-VL-7B-Instruct (a) Ordered By Model Weight (b) Ordered By Value Difference (c) Picked Randomly Figure 59

Orientation Encoding Properties 8.1. Feature Substitution Plots for LLaV A- OneVision and Qwen2.5-VL-7B-Instruct (a) Ordered By Model Weight (b) Ordered By Value Difference (c) Picked Randomly Figure 59. Incremental feature substitution for LLaV A-OneVision on images with the lizard scene. No matter how the features are selected (according to the magnitud...
[57]

Both experiments fared poorly with an MAE upwards of80 ◦. To understand why the model is unable to predict the foreground orientation when the background is rotated, we repeat the experiments on an image set with synthetic back- grounds (see Figure 82) with horizontal and vertical lines. Our hypothesis is that accurate foreground orientation is de- penden...
[58]

When background is rotated, performance: (1) degrades sharply when trained on only FG rot

using LLaV A 1.5 and LLaV A 1.6 - Mean Absolute Error (MAE) for synthetic background image sets under different rotation con- ditions. When background is rotated, performance: (1) degrades sharply when trained on only FG rot. images, (2) and (3) improves significantly when trained on BG+FG rot., (4) improves moder- ately when trained on only BG rot. S/N T...
[59]

When background is rotated, performance: (1) degrades sharply when trained on only FG rot

using LLaV A 1.5 and LLaV A 1.6 - Mean Absolute Error (MAE) for synthetic background image sets under different rotation con- ditions. When background is rotated, performance: (1) degrades sharply when trained on only FG rot. images, (2) and (3) improves significantly when trained on BG+FG rot., (4) improves moder- ately when trained on only BG rot