arxiv: 2604.17354 · v1 · submitted 2026-04-19 · 💻 cs.CL · cs.CV

Recognition: unknown

More Than Meets the Eye: Measuring the Semiotic Gap in Vision-Language Models via Semantic Anchorage

Wei He

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:22 UTC · model grok-4.3

classification 💻 cs.CL cs.CV

keywords vision-language modelsliteral superiority biassemiotic gapidiomatic interpretationsemantic alignmentvisual abstractionnoun compoundsDIVA benchmark

0 comments

The pith

Vision-language models exhibit a persistent literal superiority bias that scale does not fix and realistic images make stronger.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether high visual detail in vision-language models interferes with their grasp of abstract, idiomatic language such as figurative noun compounds. It builds DIVA, a benchmark that swaps photorealistic pictures for simple schematic icons tied directly to literal versus idiomatic senses, then measures how far model outputs diverge from each intended meaning. Results across eight recent models show the bias toward literal visuals holds steady or grows with model size and with greater image realism, pointing to interference rather than a simple data shortage. This matters because it indicates that current training approaches may be locking models into surface-level visual matching at the expense of flexible symbolic understanding.

Core claim

VLMs display a consistent Literal Superiority Bias when grounding noun compounds, favoring literal visual interpretations over idiomatic ones. The bias is quantified by the Semantic Alignment Gap metric that tracks divergence between the two readings and by a signed directional bias score. Evaluations reveal that larger model scale fails to reduce the preference while higher visual fidelity correlates with weaker symbolic alignment, which the authors interpret as cognitive interference arising from hyper-realistic imagery. The work concludes that better compositional understanding requires iconographic abstraction of visual inputs together with explicit anchoring of generation and inference

What carries the argument

DIVA benchmark that produces paired, sense-anchored schematic visualizations for literal and idiomatic readings, together with the architecture-agnostic Semantic Alignment Gap (Δ) and directional signed bias b(t) that separately quantify divergence magnitude and preference direction.

Load-bearing premise

That the schematic icons and sense-anchored pairs in DIVA cleanly isolate the semiotic gap without their own generation process or selection choices creating the observed literal preference.

What would settle it

A follow-up run on the same eight models that finds the literal preference vanishes or reverses when the DIVA schematic images replace photorealistic ones, or that finds no correlation between visual detail level and alignment scores.

Figures

Figures reproduced from arXiv: 2604.17354 by Wei He.

**Figure 1.** Figure 1: Overview of the Iconographic Abstraction Framework. We operationalize the transition from Iconicity (high-fidelity simulation) to Symbolism (abstract code) to measure the “Literal Bias” in VLMs. 2022; Yuksekgonul et al., 2022; Hsieh et al., 2023; Saakyan et al., 2025; Kundu et al., 2025). This limitation is particularly evident in the processing of Noun Compounds (NCs), where the visual representation of… view at source ↗

**Figure 2.** Figure 2: Iconographic Abstraction in Action. Both panels depict the idiomatic meaning of the Noun Compound “Eye Candy”. We illustrate the transition from the high-fidelity domain (Panel a) to the visually simplified iconographic domain (Panel b), which isolates the semantic core. Note that Panel (a) exhibits higher visual fidelity (e.g., shadows, gradients, detailed objects) compared to Panel (b); we use “high v… view at source ↗

**Figure 3.** Figure 3: Visual comparison of the Semantic Alignment Gap (∆). The chart illustrates the consistent reduction in ∆ when shifting from high-fidelity (ADMIRE, blue) to iconographic (DIVA, pink) within each architectural family. capacity to resolving high-frequency details, such as the texture of a “potato” or the specular reflection on an “eye,” it reinforces the analog nature of the image. According to our framework… view at source ↗

**Figure 4.** Figure 4: Iconographic Abstraction in Action (AdMIRe vs. DIVA). Top Row: The original high-fidelity images from ADMIRE, where high-frequency texture creates “semiotic noise.” Bottom Row: Our corresponding DIVA icons. By systematically simplifying the images across the full semantic spectrum (from Literal to Idiomatic), DIVA provides a clean, structure-aware testbed for multimodal reasoning [PITH_FULL_IMAGE:figures/… view at source ↗

read the original abstract

Vision-Language Models (VLMs) excel at photorealistic generation, yet often struggle to represent abstract meaning such as idiomatic interpretations of noun compounds. To study whether high visual fidelity interferes with idiomatic compositionality under visual abstraction, we introduce DIVA, a controlled benchmark that replaces high-fidelity visual detail with schematic iconicity by generating paired, sense-anchored visualizations for literal and idiomatic readings. We further propose Semantic Alignment Gap ($\Delta$), an architecture-agnostic metric that quantifies divergence between literal and idiomatic visual grounding. We additionally introduce a directional signed bias $b(t)$ to separately measure the direction and strength of literal preference. Evaluating 8 recent VLMs, we reveal a consistent Literal Superiority Bias: model scale alone does not resolve literal preference, and increased visual fidelity is associated with weaker symbolic alignment, suggesting cognitive interference from hyper-realistic imagery. Our findings suggest that improving compositional understanding requires iconographic abstraction of visual input and anchoring interpretation and generation in intended meaning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DIVA gives a clean benchmark and metric for literal vs idiomatic grounding in VLMs, but the fidelity-interference claim rests on unvalidated schematics that could artifactually favor literal depictions.

read the letter

The paper's main new pieces are the DIVA benchmark of paired schematic icons for literal and idiomatic noun-compound senses plus the architecture-agnostic Semantic Alignment Gap metric and signed bias measure. Those let them run eight recent VLMs and report a consistent literal preference that scale does not remove, plus a correlation between higher visual fidelity and weaker symbolic alignment. That setup is straightforward and the directional bias tracking is a useful addition for separating preference strength from direction. The evaluations are presented as new results rather than re-derivations of prior work. The idea of testing whether hyper-realistic training hurts abstract compositionality is worth pursuing. The soft spot is exactly where the stress-test note flags it: the abstract and available description give no human fidelity ratings, inter-annotator checks, or controls that compare schematic versus photorealistic versions of the same items. If literal senses are systematically easier to render as clean icons than idiomatic ones, or if the chosen compounds over-represent easy literal cases, then both the gap metric and the bias score could reflect benchmark construction more than model behavior. Without those controls the interference interpretation stays provisional. This is for multimodal researchers who build or evaluate VLMs and want concrete tools to measure semiotic divergence. A reader focused on evaluation design or compositionality failures would get practical value from the benchmark definition even if the causal claim needs more support. It deserves a serious referee because the core problem is real, the metric is simple to apply, and the empirical pattern on current models is worth checking for robustness. I would send it to review with explicit instructions to examine the icon-generation process and any statistical safeguards in the full methods section.

Referee Report

1 major / 2 minor

Summary. The paper introduces the DIVA benchmark, which generates paired schematic, sense-anchored visualizations for literal and idiomatic readings of noun compounds, to quantify the semiotic gap in VLMs. It defines the architecture-agnostic Semantic Alignment Gap metric Δ and the directional signed bias b(t), evaluates eight recent VLMs, and reports a consistent Literal Superiority Bias: literal preference persists across scales, and higher visual fidelity correlates with weaker symbolic alignment, implying interference from hyper-realistic imagery.

Significance. If the central findings hold after validation, the work provides a concrete, falsifiable method for measuring divergence between literal and idiomatic visual grounding in multimodal models. It supplies evidence that scale alone does not close the gap and points to iconographic abstraction as a potential remedy, which could inform future VLM training objectives focused on compositional and abstract reasoning.

major comments (1)

[Section 3] Section 3 (DIVA benchmark construction): no human fidelity ratings, inter-annotator agreement scores, or controls comparing schematic vs. photorealistic renderings of the same noun-compound items are reported. Because Δ and b(t) are computed directly from model outputs on these visualizations, any systematic difference in how faithfully literal senses are depicted relative to idiomatic ones (due to generation prompts, base model, or post-selection) would artifactually inflate the Literal Superiority Bias, making the benchmark validity load-bearing for the main claim.

minor comments (2)

[Abstract and Section 4] The abstract and Section 4 would benefit from explicit statement of the noun-compound sample size, number of visualizations per item, and any statistical controls for multiple comparisons when reporting the correlation between visual fidelity and symbolic alignment.
[Section 4] Notation for b(t) should include a brief reminder of its sign convention (positive = literal preference) on first use in the results section to aid readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on the DIVA benchmark and its validation. We address the single major comment below and will incorporate the suggested controls in the revised manuscript.

read point-by-point responses

Referee: [Section 3] Section 3 (DIVA benchmark construction): no human fidelity ratings, inter-annotator agreement scores, or controls comparing schematic vs. photorealistic renderings of the same noun-compound items are reported. Because Δ and b(t) are computed directly from model outputs on these visualizations, any systematic difference in how faithfully literal senses are depicted relative to idiomatic ones (due to generation prompts, base model, or post-selection) would artifactually inflate the Literal Superiority Bias, making the benchmark validity load-bearing for the main claim.

Authors: We acknowledge that the current manuscript does not report human fidelity ratings, inter-annotator agreement, or direct schematic-vs-photorealistic controls. The benchmark relies on sense-anchored prompts drawn from dictionary definitions and consistent generation procedures to align visualizations with intended literal or idiomatic readings, which reduces but does not eliminate the possibility of differential depiction fidelity. To address this directly, the revised version will add a human evaluation study in which annotators rate the semantic fidelity of both literal and idiomatic renderings on a standardized scale, report inter-annotator agreement, and include a control subset where the same noun-compound items are rendered in both schematic and photorealistic styles. Model outputs on these paired controls will be analyzed to quantify any contribution of visual style to the measured Δ and b(t). These additions will be presented in an expanded Section 3 and will not change the core architecture-agnostic metrics or the main experimental results. revision: yes

Circularity Check

0 steps flagged

No significant circularity; metrics are direct computations from benchmark outputs

full rationale

The paper defines Semantic Alignment Gap Δ and signed bias b(t) explicitly as functions of VLM outputs on the newly introduced DIVA benchmark pairs. These quantities are computed post-hoc from model responses rather than fitted to data or reduced to prior parameters by construction. No self-citations, uniqueness theorems, or ansatzes from prior author work are invoked to justify the core claims or metric definitions. The Literal Superiority Bias is reported as an empirical pattern observed across eight evaluated models, not derived algebraically from the benchmark construction itself. The derivation chain therefore remains independent of its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The claims rest on the unvalidated assumption that schematic icons cleanly separate literal and idiomatic senses and that the new metrics capture the intended semiotic gap; no external benchmarks or prior derivations are referenced.

axioms (1)

domain assumption Schematic iconicity can represent idiomatic meanings of noun compounds without loss or distortion relative to literal meanings
Invoked in the construction of paired sense-anchored visualizations for DIVA.

invented entities (3)

DIVA benchmark no independent evidence
purpose: Controlled testbed replacing photorealistic detail with schematic iconicity for literal vs idiomatic evaluation
Newly introduced in this work.
Semantic Alignment Gap (Δ) no independent evidence
purpose: Quantify divergence between literal and idiomatic visual grounding in an architecture-agnostic way
Newly proposed metric.
directional signed bias b(t) no independent evidence
purpose: Separately measure direction and strength of literal preference
Newly introduced bias measure.

pith-pipeline@v0.9.0 · 5466 in / 1419 out tokens · 68043 ms · 2026-05-10T06:22:50.112092+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 2 canonical work pages · 1 internal anchor

[1]

WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

Isambard-AI: a leadership class supercom- puter optimised specifically for Artificial Intelligence. Technical report, University of Bristol. Preslav I Nakov and Marti A Hearst. 2013. Semantic interpretation of noun compounds using verbal and other paraphrases.ACM Transactions on Speech and Language Processing (TSLP), 10(3):1–51. Yuwei Niu, Munan Ning, Men...

work page internal anchor Pith review arXiv 2013
[2]

InProceedings of the 60th Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8253–8280

Valse: A task-independent benchmark for vision and language models centered on linguistic phenomena. InProceedings of the 60th Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8253–8280. Thomas Pickard, Aline Villavicencio, Maggie Mi, Wei He, Dylan Phelps, and Marco Idiart. 2025. SemEval- 2025 task 1: AdMIRe...

2025
[3]

InPro- ceedings of the Fifth BlackboxNLP Workshop on An- alyzing and Interpreting Neural Networks for NLP, pages 335–345, Abu Dhabi, United Arab Emirates (Hybrid)

DALLE-2 is seeing double: Flaws in word- to-concept mapping in Text2Image models. InPro- ceedings of the Fifth BlackboxNLP Workshop on An- alyzing and Interpreting Neural Networks for NLP, pages 335–345, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics. Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and...

2022
[4]

InProceedings of the 2nd Workshop on Uncertainty-Aware NLP (UncertaiNLP 2025), pages 89–102, Suzhou, China

HALLUCINOGEN: Benchmarking halluci- nation in implicit reasoning within large vision lan- guage models. InProceedings of the 2nd Workshop on Uncertainty-Aware NLP (UncertaiNLP 2025), pages 89–102, Suzhou, China. Association for Com- putational Linguistics. Thomas Lloyd Short. 2007.Peirce’s theory of signs. Cambridge University Press. Belkiss Souayed, Sara...

2025
[5]

InProceedings of the Fourth Workshop on Text Simplification, Accessibility and Readability (TSAR 2025), pages 1–18, Suzhou, China

Template-based text-to-image alignment for language accessibility a study on visualizing text sim- plifications. InProceedings of the Fourth Workshop on Text Simplification, Accessibility and Readability (TSAR 2025), pages 1–18, Suzhou, China. Associa- tion for Computational Linguistics. Kaiyue Sun, Rongyao Fang, Chengqi Duan, Xian Liu, and Xihui Liu. 202...

work page arXiv 2025
[6]

person watching loud TV ,

Conceptual Instructions (De- Noising): • Identify the Core Essence:Deter- mine the fundamental meaning or action of the image. Ignore spe- cific details, individuals, or environ- ments. • Abstract & Merge (Metonymy): If the image contains multiple el- ements forming a narrative, distill them into a single, unified glyph that represents the entire concept ...
[7]

Avoid organic or sketchy lines

Stylistic Instructions (Flat Iconog- raphy): • Geometric Reconstruction:Re- build the concept using only pure geometric primitives (perfect cir- cles, squares, triangles, and clean, uniform arcs). Avoid organic or sketchy lines. • Strict Flat Design:There must be absolutely zero gradients, shadows, textures, or lighting effects. All col- ors must be solid...
[8]

the target noun compound text (e.g.,Eye Candy)
[9]

the original high-fidelity image from AD- MIRE, used as a semantic reference
[10]

Task instructions.Annotators were instructed to evaluate each candidate along two dimensions:

four candidate iconographic images generated by our abstraction pipeline. Task instructions.Annotators were instructed to evaluate each candidate along two dimensions:
[11]

Semantic preservation:whether the candi- date preserved the intended meaning of the noun compound
[12]

semiotic noise

Stylistic conformity:whether the candidate satisfied the required iconographic constraints (flat design, geometric composition, limited palette, minimal background detail). Annotators selected the best candidate only if at least one image satisfied both criteria. This wasnot a forced-choice task: if all four candidates failed to preserve the intended mean...