Recognition: unknown
Why MLLMs Struggle to Determine Object Orientations
Pith reviewed 2026-05-10 15:16 UTC · model grok-4.3
The pith
Orientation details are recoverable from MLLM visual encoder embeddings via linear models, showing encoders are not the cause of reasoning failures.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Contrary to the hypothesis that visual encoders fail to preserve orientation, simple linear models recover object rotation angles from SigLIP, ViT, and CLIP embeddings with high accuracy. The information exists in the representations from LLaVA and Qwen models but spreads across tens of thousands of features, which may prevent effective exploitation by the full MLLM during inference.
What carries the argument
Linear regressors that map encoder feature vectors to predicted object orientation angles, applied to full images or foreground patches.
Load-bearing premise
That accurate linear prediction from embeddings means the MLLM can locate and apply this orientation information when answering queries.
What would settle it
An auxiliary training run that adds an orientation prediction loss on the encoder outputs yet shows no gain in MLLM accuracy on orientation queries would indicate the information remains inaccessible in practice.
Figures
read the original abstract
Multimodal Large Language Models (MLLMs) struggle with tasks that require reasoning about 2D object orientation in images, as documented in prior work. Tong et al. and Nichols et al. hypothesize that these failures originate in the visual encoder, since commonly used encoders such as CLIP and SigLIP are trained for image-text semantic alignment rather than geometric reasoning. We design a controlled empirical protocol to test this claim by measuring whether rotations can be recovered from encoder representations. In particular, we examine SigLIP and ViT features from LLaVA OneVision and Qwen2.5-VL-7B-Instruct models, respectively, using full images, and examine CLIP representations in LLaVA 1.5 and 1.6 using rotated foreground patches against natural background images. Our null hypothesis is that orientation information is not preserved in the encoder embeddings and we test this by training linear regressors to predict object orientation from encoded features. Contrary to the hypothesis, we find that orientation information is recoverable from encoder representations: simple linear models accurately predict object orientations from embeddings. This contradicts the assumption that MLLM orientation failures originate in the visual encoder. Having rejected the accepted hypothesis that MLLMs struggle with 2D orientation tasks because of visual encoder limitations, we still don't know why they fail. Although a full explanation is beyond the scope of this paper, we show that although present, orientation information is spread diffusely across tens of thousands of features. This may or may not be while MLLMs fail to exploit the available orientation information.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript tests the hypothesis that MLLM failures on 2D object orientation tasks originate in visual encoders (e.g., SigLIP, ViT, CLIP) failing to preserve geometric information. Using linear regression probes on embeddings from full images (LLaVA OneVision, Qwen2.5-VL) and rotated foreground patches on natural backgrounds (LLaVA 1.5/1.6), the authors reject the null that orientation is not recoverable, showing accurate prediction from encoder features. They note the information is present but diffuse across many dimensions and do not claim this resolves MLLM inference failures.
Significance. If the results hold, the work is significant for providing a controlled empirical refutation of a common hypothesis about MLLM geometric reasoning limits. The use of standard linear probes as an information-presence test, combined with separate full-image and patch-based setups, offers a direct, falsifiable check against the encoder-origin claim. Credit is due for the reproducible probe design and the explicit acknowledgment that presence of information does not imply exploitability by the full model. This shifts attention to decoder or training factors without overclaiming.
minor comments (2)
- [Abstract] Abstract: the description of experimental controls (e.g., exact rotation ranges, background selection criteria, and error metrics such as MAE or R²) is incomplete, making it harder to assess the strength of the linear prediction results without the full methods section.
- The discussion of diffuse information across tens of thousands of features would benefit from a brief quantitative illustration (e.g., how many top dimensions are needed for a given accuracy threshold) to ground the observation that the signal is not localized.
Simulated Author's Rebuttal
We thank the referee for their supportive review and recommendation of minor revision. We appreciate the accurate summary of our controlled empirical protocol, the recognition of its falsifiability, and the acknowledgment that our results shift attention to decoder or training factors without overclaiming. The positive assessment of the reproducible probe design and explicit caveats is encouraging.
Circularity Check
No significant circularity; empirical refutation of external hypothesis
full rationale
The paper's central result is an empirical test: linear regressors are trained on encoder embeddings to recover object orientation angles, directly rejecting the null hypothesis (drawn from Tong et al. and Nichols et al.) that such information is absent from CLIP/SigLIP/ViT features. No equation or claim reduces to a fitted parameter renamed as a prediction, no self-citation supplies a load-bearing uniqueness theorem, and the diffuse-information observation is presented as an open question rather than a derivation. The protocol is self-contained against the stated external hypothesis and does not rely on internal self-definition or ansatz smuggling.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Linear probes can extract information that is linearly present in high-dimensional embeddings
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Lukasz Bartoszcze, Sarthak Munshi, Bryan Sukidi, Jen- nifer Yen, Zejia Yang, David Williams-King, Linh Le, Kosi Asuzu, and Carsten Maple. Representation engineering for large-language models: Survey and research challenges. arXiv preprint arXiv:2502.17601, 2025. 2
-
[3]
Per- ception tokens enhance visual reasoning in multimodal lan- guage models
Mahtab Bigverdi, Zelun Luo, Cheng-Yu Hsieh, Ethan Shen, Dongping Chen, Linda G Shapiro, and Ranjay Krishna. Per- ception tokens enhance visual reasoning in multimodal lan- guage models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3836–3845, 2025. 2
2025
-
[4]
Why is spatial reasoning hard for vlms? an attention mechanism perspective on focus ar- eas
Shiqi Chen, Tongyao Zhu, Ruochen Zhou, Jinghan Zhang, Siyang Gao, Juan Carlos Niebles, Mor Geva, Junxian He, Jiajun Wu, and Manling Li. Why is spatial reasoning hard for vlms? an attention mechanism perspective on focus ar- eas. InForty-second International Conference on Machine Learning, 2025. 1, 2, 8
2025
-
[5]
The dual mechanisms of spatial reasoning in vision–language models
Kelly Cui, Nikhil Prakash, Ayush Raina, David Bau, Antonio Torralba, and Tamar Rott Shaham. The dual mechanisms of spatial reasoning in vision–language models. InThe First Workshop on Efficient Spatial Reasoning. 2
-
[6]
Aligning vision- language models with human directional reference
KIM Daehyun and Hyounghun Kim. Aligning vision- language models with human directional reference. 1, 2
-
[7]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 1
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[8]
Im- age orientation estimation with convolutional networks
Philipp Fischer, Alexey Dosovitskiy, and Thomas Brox. Im- age orientation estimation with convolutional networks. In German conference on pattern recognition, pages 368–378. Springer, 2015. 2
2015
-
[9]
Large language models are challenged by habitat-centered reasoning
Sadaf Ghaffari and Nikhil Krishnaswamy. Large language models are challenged by habitat-centered reasoning. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 13047–13059, 2024. 2
2024
-
[10]
Data distribution: normal or abnormal? Journal of Korean medical science, 39(3), 2024
Farrokh Habibzadeh. Data distribution: normal or abnormal? Journal of Korean medical science, 39(3), 2024. 5
2024
-
[11]
Vision-language models can’t see the obvious
Ngoc Dung Huynh, Yasser Dahou, Phuc H Le-Khac, Wamiq Reyaz Para, Ankit Singh, and Sanath Narayan. Vision-language models can’t see the obvious. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 24159–24169, 2025. 2
2025
-
[12]
Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, and Li Yi. Omnispatial: Towards comprehensive spatial reasoning benchmark for vi- sion language models.arXiv preprint arXiv:2506.03135,
-
[13]
Isright’right? enhancing ob- ject orientation understanding in multimodal large language models through egocentric instruction tuning
Ji Hyeok Jung, Eun Tae Kim, Seoyeon Kim, Joo Ho Lee, Bumsoo Kim, and Buru Chang. Isright’right? enhancing ob- ject orientation understanding in multimodal large language models through egocentric instruction tuning. InProceed- ings of the Computer Vision and Pattern Recognition Con- ference, pages 14257–14267, 2025. 2
2025
-
[14]
Ryo Kamoi, Yusen Zhang, Sarkar Snigdha Sarathi Das, Ran- ran Haoran Zhang, and Rui Zhang. Visonlyqa: Large vision language models still struggle with visual perception of geo- metric information.arXiv preprint arXiv:2412.00947, 2024. 1
-
[15]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 1, 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Shijie Lian, Changti Wu, Laurence Tianruo Yang, Hang Yuan, Bin Yu, Lei Zhang, and Kai Chen. Euclid’s gift: En- hancing spatial perception and reasoning in vision-language models via geometric surrogate tasks.arXiv preprint arXiv:2509.24473, 2025. 1, 8
-
[17]
Spatial intelligence in vision- language models: A comprehensive survey
Disheng Liu, Tuo Liang, Zhe Hu, Jierui Peng, Yiren Lu, Yi Xu, Yun Fu, and Yu Yin. Spatial intelligence in vision- language models: A comprehensive survey. 2026. 2
2026
-
[18]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024. 1
2024
-
[19]
Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024. 1, 7
2024
-
[20]
Os- kdet: Orientation-sensitive keypoint localization for rotated object detection
Dongchen Lu, Dongmei Li, Yali Li, and Shengjin Wang. Os- kdet: Orientation-sensitive keypoint localization for rotated object detection. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 1182–1192, 2022. 2
2022
-
[21]
Xueqi Ma, Shuo Yang, Yanbei Jiang, Shu Liu, Zhenzhen Liu, Jiayang Ao, Xingjun Ma, Sarah Monazam Erfani, and James Bailey. Attention in space: Functional roles of vlm heads for spatial reasoning.arXiv preprint arXiv:2603.20662, 2026. 1, 2
-
[22]
The kolmogorov-smirnov test for good- ness of fit.Journal of the American statistical Association, 46(253):68–78, 1951
Frank J Massey Jr. The kolmogorov-smirnov test for good- ness of fit.Journal of the American statistical Association, 46(253):68–78, 1951. 5
1951
-
[23]
Descriptive statis- tics and normality tests for statistical data.Annals of cardiac anaesthesia, 22(1):67–72, 2019
Prabhaker Mishra, Chandra M Pandey, Uttam Singh, Anshul Gupta, Chinmoy Sahu, and Amit Keshri. Descriptive statis- tics and normality tests for statistical data.Annals of cardiac anaesthesia, 22(1):67–72, 2019. 5
2019
-
[24]
Keanu Nichols, Nazia Tasnim, Yuting Yan, Nicholas Ikechukwu, Elva Zou, Deepti Ghadiyaram, and Bryan A Plummer. Right side up? disentangling orientation under- standing in mllms with fine-grained multi-axis perception tasks.arXiv preprint arXiv:2505.21649, 2025. 1, 2, 3, 7, 8
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
Tianyi Niu, Jaemin Cho, Elias Stengel-Eskin, and Mo- hit Bansal. Rotbench: Evaluating multimodal large lan- guage models on identifying image rotation.arXiv preprint arXiv:2508.13968, 2025. 1, 2, 8
-
[26]
SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model
Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial represen- tations for visual-language-action model.arXiv preprint arXiv:2501.15830, 2025. 2
work page internal anchor Pith review arXiv 2025
-
[27]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 1
2021
-
[28]
Berg, and Li Fei-Fei
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Chal- lenge.International Journal of Computer Vision (IJCV), 115 (3):211–252, 2015. 3
2015
-
[29]
An empirical analysis on spatial reason- ing capabilities of large multimodal models
Fatemeh Shiri, Xiao-Yu Guo, Mona Far, Xin Yu, Reza Haf, and Yuan-Fang Li. An empirical analysis on spatial reason- ing capabilities of large multimodal models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 21440–21455, 2024. 2
2024
-
[30]
Orientation esti- mation network
Jie Sun, Wengang Zhou, and Houqiang Li. Orientation esti- mation network. InInternational Conference on Image and Graphics, pages 151–162. Springer, 2017. 2
2017
-
[31]
Why representation engineering works: A theoretical and empirical study in vision-language models
Bowei Tian, Xuntao Lyu, Meng Liu, Hongyi Wang, and Ang Li. Why representation engineering works: A theoretical and empirical study in vision-language models.arXiv preprint arXiv:2503.22720, 2025. 2
-
[32]
Eyes wide shut? exploring the visual shortcomings of multimodal llms
Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9568–9578, 2024. 1, 2, 8
2024
-
[33]
Siting Wang, Minnan Pei, Luoyang Sun, Cheng Deng, Kun Shao, Zheng Tian, Haifeng Zhang, and Jun Wang. Spatialviz-bench: An mllm benchmark for spatial visualiza- tion.arXiv preprint arXiv:2507.07610, 2025
-
[34]
Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.arXiv preprint arXiv:2505.23747, 2025
-
[35]
Thinking in space: How mul- timodal large language models see, remember, and recall spaces
Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How mul- timodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025. 1, 2, 8
2025
-
[36]
arXiv preprint arXiv:2509.07979 (2025) 4
Heeji Yoon, Jaewoo Jung, Junwan Kim, Hyungyu Choi, Heeseong Shin, Sangbeom Lim, Honggyu An, Chaehyun Kim, Jisang Han, Donghyun Kim, et al. Visual representa- tion alignment for multimodal large language models.arXiv preprint arXiv:2509.07979, 2025. 2
-
[37]
Songsong Yu, Yuxin Chen, Hao Ju, Lianjie Jia, Fuxi Zhang, Shaofei Huang, Yuhan Wu, Rundi Cui, Binghao Ran, Za- ibin Zhang, et al. How far are vlms from visual spatial in- telligence? a benchmark-driven perspective.arXiv preprint arXiv:2509.18905, 2025. 1, 2
-
[38]
Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Zhuoguang Chen, Tao Jiang, and Hang Zhao. Depthvla: Enhancing vision-language-action models with depth-aware spatial rea- soning.arXiv preprint arXiv:2510.13375, 2025. 2
-
[39]
Jessica Yung, Rob Romijnders, Alexander Kolesnikov, Lu- cas Beyer, Josip Djolonga, Neil Houlsby, Sylvain Gelly, Mario Lucic, and Xiaohua Zhai. Si-score: An image dataset for fine-grained analysis of robustness to object location, ro- tation and size.arXiv preprint arXiv:2104.04191, 2021. 3
-
[40]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 1, 7
2023
-
[41]
Visual interpretability for deep learning: a survey.Frontiers of Information Tech- nology & Electronic Engineering, 19(1):27–39, 2018
Quan-shi Zhang and Song-Chun Zhu. Visual interpretability for deep learning: a survey.Frontiers of Information Tech- nology & Electronic Engineering, 19(1):27–39, 2018. 2
2018
-
[42]
Wanyue Zhang, Yibin Huang, Yangbin Xu, JingJing Huang, Helu Zhi, Shuo Ren, Wang Xu, and Jiajun Zhang. Why do mllms struggle with spatial understanding? a system- atic analysis from data to architecture.arXiv preprint arXiv:2509.02359, 2025. 1
-
[43]
Embodied-reasoner: Synergizing visual search, reasoning, and action for embodied interactive tasks
Wenqi Zhang, Mengna Wang, Gangao Liu, Xu Huixin, Yi- wei Jiang, Yongliang Shen, Guiyang Hou, Zhe Zheng, Hang Zhang, Xin Li, et al. Embodied-reasoner: Synergizing visual search, reasoning, and action for embodied interactive tasks. arXiv preprint arXiv:2503.21696, 2025. 2
-
[44]
Xu Zheng, Zihao Dongfang, Lutao Jiang, Boyuan Zheng, Yulong Guo, Zhenquan Zhang, Giuliano Albanese, Runyi Yang, Mengjiao Ma, Zixin Zhang, et al. Multimodal spatial reasoning in the large model era: A survey and benchmarks. arXiv preprint arXiv:2510.25760, 2025. 1, 2
-
[45]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 1, 8 Why MLLMs Struggle to Determine Object Orientations Supplementary Material
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
Collage of Images for LLaV A 1.5 and 1.6 Figure 8
Hypothesis: CLIP Fails to Encode Object Orientation Information 7.1. Collage of Images for LLaV A 1.5 and 1.6 Figure 8. Collage of every 20th image from the images with the dog foreground (biggest foreground) 7.2. Plots showing Regression comparison be- tween LLaV A OneVision and Qwen2.5-VL- 7B-Instruct Figure 9. 2D orientation estimation performance comp...
-
[47]
Statistical Analysis using visual plots for LLaV A- OneVision - results for images with vase-toaster-indoor
is random/Gaussian Figure 40. Statistical Analysis using visual plots for LLaV A- OneVision - results for images with vase-toaster-indoor. Together with the numerical results in Table 3 and the visual plots in this Figure, we can conclude that the residual error distribution (Table
-
[48]
Statistical Analysis using visual plots for Qwen2.5-VL- 7B-Instruct - results for images with vase-toaster-indoor
is random/Gaussian Figure 41. Statistical Analysis using visual plots for Qwen2.5-VL- 7B-Instruct - results for images with vase-toaster-indoor. Together with the numerical results in Table 3 and the visual plots in this Figure, we can conclude that the residual error distribution (Table
-
[49]
Plots Showing Statistical Analysis for LLaV A 1.5 and 1.6 Figure 42
is random/Gaussian 7.5. Plots Showing Statistical Analysis for LLaV A 1.5 and 1.6 Figure 42. Statistical Analysis using visual plots for LLaV A 1.5 - results for images with dog foregrounds (Scale 2). Together with the numerical results in Table 3 and the visual plots in this Figure, we can conclude that the residual error distribution (Table 2) is random...
-
[50]
Statistical Analysis using visual plots for LLaV A 1.5 - results for images with lizard foregrounds (Scale 2)
is random/Gaussian Figure 48. Statistical Analysis using visual plots for LLaV A 1.5 - results for images with lizard foregrounds (Scale 2). Together with the numerical results in Table 3 and the visual plots in this Figure, we can conclude that the residual error distribution (Table
-
[51]
Statistical Analysis using visual plots for LLaV A 1.5 - results for images with lizard foregrounds (Scale 3)
is random/Gaussian Figure 49. Statistical Analysis using visual plots for LLaV A 1.5 - results for images with lizard foregrounds (Scale 3). Together with the numerical results in Table 3 and the visual plots in this Figure, we can conclude that the residual error distribution (Table
-
[52]
Statistical Analysis using visual plots for LLaV A 1.6 - results for images with lizard foregrounds (Scale 1)
is random/Gaussian Figure 50. Statistical Analysis using visual plots for LLaV A 1.6 - results for images with lizard foregrounds (Scale 1). Together with the numerical results in Table 3 and the visual plots in this Figure, we can conclude that the residual error distribution (Table
-
[53]
Statistical Analysis using visual plots for LLaV A 1.6 - results for images with lizard foregrounds (Scale 2)
is random/Gaussian Figure 51. Statistical Analysis using visual plots for LLaV A 1.6 - results for images with lizard foregrounds (Scale 2). Together with the numerical results in Table 3 and the visual plots in this Figure, we can conclude that the residual error distribution (Table
-
[54]
Statistical Analysis using visual plots for LLaV A 1.6 - results for images with lizard foregrounds (Scale 3)
is random/Gaussian Figure 52. Statistical Analysis using visual plots for LLaV A 1.6 - results for images with lizard foregrounds (Scale 3). Together with the numerical results in Table 3 and the visual plots in this Figure, we can conclude that the residual error distribution (Table
-
[55]
not possible to determine
is random/Gaussian Figure 53. Statistical Analysis using visual plots for LLaV A 1.5 - results for images with train foregrounds (Scale 1). Together with the numerical results in Table 3 and the visual plots in this Figure, we can conclude that the residual error distribution (Table 2) is random/Gaussian Figure 54. Statistical Analysis using visual plots ...
-
[56]
Feature Substitution Plots for LLaV A- OneVision and Qwen2.5-VL-7B-Instruct (a) Ordered By Model Weight (b) Ordered By Value Difference (c) Picked Randomly Figure 59
Orientation Encoding Properties 8.1. Feature Substitution Plots for LLaV A- OneVision and Qwen2.5-VL-7B-Instruct (a) Ordered By Model Weight (b) Ordered By Value Difference (c) Picked Randomly Figure 59. Incremental feature substitution for LLaV A-OneVision on images with the lizard scene. No matter how the features are selected (according to the magnitud...
-
[57]
Both experiments fared poorly with an MAE upwards of80 ◦. To understand why the model is unable to predict the foreground orientation when the background is rotated, we repeat the experiments on an image set with synthetic back- grounds (see Figure 82) with horizontal and vertical lines. Our hypothesis is that accurate foreground orientation is de- penden...
-
[58]
When background is rotated, performance: (1) degrades sharply when trained on only FG rot
using LLaV A 1.5 and LLaV A 1.6 - Mean Absolute Error (MAE) for synthetic background image sets under different rotation con- ditions. When background is rotated, performance: (1) degrades sharply when trained on only FG rot. images, (2) and (3) improves significantly when trained on BG+FG rot., (4) improves moder- ately when trained on only BG rot. S/N T...
-
[59]
When background is rotated, performance: (1) degrades sharply when trained on only FG rot
using LLaV A 1.5 and LLaV A 1.6 - Mean Absolute Error (MAE) for synthetic background image sets under different rotation con- ditions. When background is rotated, performance: (1) degrades sharply when trained on only FG rot. images, (2) and (3) improves significantly when trained on BG+FG rot., (4) improves moder- ately when trained on only BG rot
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.