pith. machine review for the scientific record. sign in

arxiv: 2604.01848 · v2 · submitted 2026-04-02 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Semantic Richness or Geometric Reasoning? The Fragility of VLM's Visual Invariance

Jason Qiu , Zachary Meurer , Xavier Thomas , Deepti Ghadiyaram

Authors on Pith no claims yet

Pith reviewed 2026-05-13 22:21 UTC · model grok-4.3

classification 💻 cs.CV
keywords vision-language modelsgeometric transformationsspatial invariancevisual equivariancesemantic understandingmodel evaluationgeometric reasoning
0
0 comments X

The pith

Vision-language models fail to maintain object identity under basic rotations, scaling, and translations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows that current vision-language models, strong on semantic recognition in standard views, lose the ability to identify the same object after simple geometric changes. Tests across sketches, photographs, and art reveal consistent drops in performance, especially when images contain little semantic detail. These issues appear across different model sizes and prompting methods, indicating the problem is not fixed by scale or instructions alone. A reader would care because it suggests VLMs depend on semantic shortcuts rather than building stable representations of shape and position.

Core claim

State-of-the-art VLMs exhibit systematic failures at a fundamental level: insufficient spatial invariance and equivariance to reliably determine object identity under rotations, scaling, and translations. Performance drops sharply as semantic content becomes sparse, and this pattern holds across symbolic sketches, natural photographs, abstract art, model architectures, capacities, and prompting strategies.

What carries the argument

Systematic evaluation of VLMs on controlled geometric transformations across multiple visual domains to measure drops in object identity accuracy.

If this is right

  • VLMs display a gap between semantic understanding and spatial reasoning that persists regardless of architecture or model size.
  • Semantic sparsity causes sharper accuracy losses when images undergo geometric changes.
  • Current prompting strategies do not mitigate the lack of invariance and equivariance.
  • Future multimodal systems require stronger geometric grounding to close this gap.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models could incorporate explicit geometric consistency objectives during training to reduce reliance on semantic cues.
  • Tasks like robotic manipulation or scene navigation may require auxiliary checks beyond standard VLM outputs.
  • The same fragility pattern could appear in other vision-language benchmarks that assume stable object representations.

Load-bearing premise

The observed performance drops under geometric transformations directly reflect missing geometric reasoning rather than artifacts from prompting, dataset biases, or evaluation metrics.

What would settle it

A controlled test showing that multiple VLMs maintain high accuracy on object identification tasks after applying varied rotations, scalings, and translations to the same images would contradict the central claim.

Figures

Figures reproduced from arXiv: 2604.01848 by Deepti Ghadiyaram, Jason Qiu, Xavier Thomas, Zachary Meurer.

Figure 1
Figure 1. Figure 1: Failure of visual transformation reasoning across visual domains. Given a pair of images, models are asked to determine whether they depict the same object under transformations of rotation, scale, or identity. While performance remains near-perfect on natural images (Art, Photo), accuracy drops sharply on abstract and symbolic images (Symbolic and Semantic Sketches), particularly for rotation. Results sho… view at source ↗
Figure 2
Figure 2. Figure 2: Cosine similarity between features extracted from different vision encoders on pairs of images under rotation. Select Omniglot scripts are shown in orange, while Times New Roman and Handwritten English are shown in blue and purple respectively. Across all encoders, similarity decreases with increasing rotation angle, with DINOv2 showing the steepest drop and SigLIP and Qwen2.5-VL-7B maintaining relatively … view at source ↗
Figure 3
Figure 3. Figure 3: Failure cases on the identity task (Sec. 3.3.1) for Qwen2.5-VL-7B. We show four randomly selected examples from Omniglot dataset where the model incorrectly predicts that two identical inputs correspond to different characters. Qwen2.5-VL-7B Qwen2.5-VL-32B Qwen3-VL-8B Qwen3-VL-30B GPT-5.2 Gemini-2.5-Pro Dataset Acc. TNR TPR Acc. TNR TPR Acc. TNR TPR Acc. TNR TPR Acc. TNR TPR Acc. TNR TPR Times New Roman 98… view at source ↗
Figure 4
Figure 4. Figure 4: Datasets used in our evaluation. Omniglot (Lake et al., 2015) contains handwritten binary characters from 50 diverse scripts. Times New Roman (Morison & Lardent, 1932) provides standardized English characters rendered in a fixed typeface. Handwritten English Mann (2024) includes handwritten characters from the English alphabet. PACS (Yu et al., 2022) contains images of common object categories (e.g., guita… view at source ↗
Figure 5
Figure 5. Figure 5: Examples from the Omniglot dataset 16 [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Examples from the PACS dataset 17 [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Examples from the Times New Roman dataset 18 [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Examples from the Handwritten English dataset 19 [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Padding artifacts under rotation and scaling. Top row shows rotation and bottom row shows scaling. Non-90◦ rotations introduce padding regions corresponding to pixels not covered by the original image. Scaling similarly introduces padding for resized images. For character datasets (left), padding matches the white background and is not visually salient. In contrast, for PACS (right), padding introduces vis… view at source ↗
Figure 10
Figure 10. Figure 10: Cosine similarity between features extracted from different vision encoders on pairs of images under rotation. Select Omniglot scripts are shown in orange, while Times New Roman and Handwritten English are shown in blue and purple respectively. Across all encoders, similarity decreases with increasing rotation angle, with DINOv2 showing the steepest drop and SigLIP and Qwen2.5-VL-7B maintaining relatively… view at source ↗
Figure 11
Figure 11. Figure 11: Scale invariance performance across datasets for Qwen2.5-VL. aggregated over all scales (Sec. 3.3.2). Performance is high on natural image domains (Art Painting, Cartoon, Photo, Sketch) and familiar scripts (Times New Roman, Handwritten English), but drops on Omniglot, indicating reduced robustness under lower semantic familiarity. Art Painting Cartoon Photo Sketch Omniglot Times New Roman Handwritten Eng… view at source ↗
Figure 12
Figure 12. Figure 12: Scale invariance performance across datasets for Gemini-2.5-Pro. aggregated over all scales (Sec. 3.3.2). Performance is near-perfect on natural image domains (Art Painting, Cartoon, Photo, Sketch) and familiar scripts (Times New Roman, Handwritten English), but drops on Omniglot, indicating reduced robustness under lower semantic familiarity. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Perimetric complexity vs performance. Perimetric complexity shows a weak correlation (r = −0.18) with accuracy of Qwen2.5-VL-7B on the scale invariance task. Times New Roman and Handwritten English are shown in blue and purple respectively, while Omniglot scripts are highlighted in orange. Visual complexity alone does not account for performance differences across scripts. We study if the performance stru… view at source ↗
Figure 14
Figure 14. Figure 14: Recall and Specificity at scale 0.3× for representative scripts for the scale-invariance task. English characters rendered in Times New Roman, Handwritten English characters, and Omniglot scripts are shown in blue, purple, and orange respectively, and are selected to represent high-, medium-, and low-performing groups. Across both models, familiar scripts such as Greek and Latin consistently outperform le… view at source ↗
Figure 15
Figure 15. Figure 15: Model accuracy on the scale-invariance task across scale factors. Both Qwen2.5-VL-7B and Qwen2.5-VL-32B models maintain near-perfect accuracy for both Times New Roman and Handwritten English characters across all scales, while performance on Omniglot scripts is substantially and consistently lower for both models. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: In-context learning (ICL) prompting setup. The system prompt includes two labeled examples: a positive pair (top) where one image is a rotated version of the other, and a negative pair (bottom) where the images are not related by rotation. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Rotational grid prompting setup. The model is first given a structured system prompt describing the layout of rotational grids. Two reference grids (Character A and Character B) are then provided to illustrate rotations across different characters. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_17.png] view at source ↗
read the original abstract

This work investigates the fundamental fragility of state-of-the-art Vision-Language Models (VLMs) under basic geometric transformations. While modern VLMs excel at semantic tasks such as recognizing objects in canonical orientations and describing complex scenes, they exhibit systematic failures at a more fundamental level: lack of robust spatial invariance and equivariance required to reliably determine object identity under simple rotations, scaling, and identity transformations. We demonstrate this limitation through a systematic evaluation across diverse visual domains, including symbolic sketches, natural photographs, and abstract art. Performance drops sharply as semantic content becomes sparse, and this behavior is observed across architectures, model capacities, and prompting strategies. Overall, our results reveal a systematic gap between semantic understanding and spatial reasoning in current VLMs, highlighting the need for stronger geometric grounding in future multimodal systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims that state-of-the-art VLMs lack robust spatial invariance and equivariance, as evidenced by sharp performance drops under simple geometric transformations (rotations, scaling, identity changes) across symbolic sketches, natural photographs, and abstract art. These failures are reported to intensify when semantic content is sparse and to persist across architectures, model sizes, and prompting strategies, revealing a fundamental gap between semantic understanding and geometric reasoning in current VLMs.

Significance. If the observed performance patterns can be shown to isolate geometric reasoning deficits rather than distribution shifts or metric artifacts, the result would be significant for the field: it would provide concrete empirical motivation for incorporating explicit geometric grounding into VLM training and architectures, moving beyond reliance on semantic richness alone.

major comments (1)
  1. [Abstract] Abstract: the central interpretation—that performance drops demonstrate missing spatial invariance/equivariance at the reasoning level—depends on the unshown assumption that the chosen transformations, prompts, and success metrics isolate geometric structure from confounds such as patch-tokenization changes, feature-distribution shifts, or description variability. No quantitative controls (e.g., embedding-distance comparisons pre/post-transform or ablation on metric choice) are described, leaving the claim vulnerable to alternative explanations.
minor comments (2)
  1. [Title] Title: the possessive 'VLM's' should be 'VLMs'' to match the plural usage throughout the abstract.
  2. [Abstract] Abstract: the phrase 'identity transformations' is ambiguous; clarify whether it refers to translations, reflections, or another class of geometric operations.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. The concern about potential confounds is well-taken, and we address it directly below while clarifying the design choices in our evaluation.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central interpretation—that performance drops demonstrate missing spatial invariance/equivariance at the reasoning level—depends on the unshown assumption that the chosen transformations, prompts, and success metrics isolate geometric structure from confounds such as patch-tokenization changes, feature-distribution shifts, or description variability. No quantitative controls (e.g., embedding-distance comparisons pre/post-transform or ablation on metric choice) are described, leaving the claim vulnerable to alternative explanations.

    Authors: We agree that explicit controls would further isolate geometric deficits from alternative explanations. Our multi-domain design (symbolic sketches, natural photos, abstract art) and consistent degradation patterns across architectures, sizes, and prompts were intended to mitigate domain-specific confounds, as tokenization or distribution shifts would be expected to vary substantially across these visual regimes rather than produce the observed uniform fragility tied to semantic sparsity. Nevertheless, we will add quantitative controls in the revision: (1) cosine-distance comparisons of image embeddings (from the vision encoder) pre- and post-transformation to quantify feature-distribution shifts, and (2) an ablation on success metrics (exact-match accuracy versus CLIP-based semantic similarity) reported in a new supplementary section. These additions will be presented alongside the main results to strengthen the geometric-reasoning interpretation. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation with no derivations or self-referential steps

full rationale

The paper is an empirical evaluation study that measures VLM performance drops under geometric transformations across domains, architectures, and prompting strategies. No equations, fitted parameters, predictions derived from inputs, or self-citation chains appear in the provided text or abstract. All claims rest on direct observational patterns that are externally falsifiable via replication on the same models and transformations, with no reduction of results to definitions or prior author work by construction. This is the standard non-circular outcome for benchmark-style empirical papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is purely empirical and introduces no free parameters, mathematical axioms, or invented entities; the central claim rests on the assumption that the chosen transformations and domains isolate geometric reasoning deficits.

pith-pipeline@v0.9.0 · 5438 in / 977 out tokens · 28423 ms · 2026-05-13T22:21:54.861123+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    performance drops sharply as semantic content becomes sparse... models rely on semantic recognition as a shortcut rather than genuine transformation reasoning

  • IndisputableMonolith/Foundation/BranchSelection.lean branch_selection echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    the apparent invariance observed in real-world images is not just because of geometric reasoning, but rather a byproduct of dataset familiarity

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    , " * write output.state after.block = add.period write

    ENTRY address author booktitle chapter doi edition editor eid howpublished institution journal key month note number organization pages publisher school series title type url volume year label INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1 'mid.sentence := #2 'after.sentence := #3 '...

  3. [3]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize ":" * " " *...