Recognition: unknown
More Than Meets the Eye: Measuring the Semiotic Gap in Vision-Language Models via Semantic Anchorage
Pith reviewed 2026-05-10 06:22 UTC · model grok-4.3
The pith
Vision-language models exhibit a persistent literal superiority bias that scale does not fix and realistic images make stronger.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VLMs display a consistent Literal Superiority Bias when grounding noun compounds, favoring literal visual interpretations over idiomatic ones. The bias is quantified by the Semantic Alignment Gap metric that tracks divergence between the two readings and by a signed directional bias score. Evaluations reveal that larger model scale fails to reduce the preference while higher visual fidelity correlates with weaker symbolic alignment, which the authors interpret as cognitive interference arising from hyper-realistic imagery. The work concludes that better compositional understanding requires iconographic abstraction of visual inputs together with explicit anchoring of generation and inference
What carries the argument
DIVA benchmark that produces paired, sense-anchored schematic visualizations for literal and idiomatic readings, together with the architecture-agnostic Semantic Alignment Gap (Δ) and directional signed bias b(t) that separately quantify divergence magnitude and preference direction.
Load-bearing premise
That the schematic icons and sense-anchored pairs in DIVA cleanly isolate the semiotic gap without their own generation process or selection choices creating the observed literal preference.
What would settle it
A follow-up run on the same eight models that finds the literal preference vanishes or reverses when the DIVA schematic images replace photorealistic ones, or that finds no correlation between visual detail level and alignment scores.
Figures
read the original abstract
Vision-Language Models (VLMs) excel at photorealistic generation, yet often struggle to represent abstract meaning such as idiomatic interpretations of noun compounds. To study whether high visual fidelity interferes with idiomatic compositionality under visual abstraction, we introduce DIVA, a controlled benchmark that replaces high-fidelity visual detail with schematic iconicity by generating paired, sense-anchored visualizations for literal and idiomatic readings. We further propose Semantic Alignment Gap ($\Delta$), an architecture-agnostic metric that quantifies divergence between literal and idiomatic visual grounding. We additionally introduce a directional signed bias $b(t)$ to separately measure the direction and strength of literal preference. Evaluating 8 recent VLMs, we reveal a consistent Literal Superiority Bias: model scale alone does not resolve literal preference, and increased visual fidelity is associated with weaker symbolic alignment, suggesting cognitive interference from hyper-realistic imagery. Our findings suggest that improving compositional understanding requires iconographic abstraction of visual input and anchoring interpretation and generation in intended meaning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the DIVA benchmark, which generates paired schematic, sense-anchored visualizations for literal and idiomatic readings of noun compounds, to quantify the semiotic gap in VLMs. It defines the architecture-agnostic Semantic Alignment Gap metric Δ and the directional signed bias b(t), evaluates eight recent VLMs, and reports a consistent Literal Superiority Bias: literal preference persists across scales, and higher visual fidelity correlates with weaker symbolic alignment, implying interference from hyper-realistic imagery.
Significance. If the central findings hold after validation, the work provides a concrete, falsifiable method for measuring divergence between literal and idiomatic visual grounding in multimodal models. It supplies evidence that scale alone does not close the gap and points to iconographic abstraction as a potential remedy, which could inform future VLM training objectives focused on compositional and abstract reasoning.
major comments (1)
- [Section 3] Section 3 (DIVA benchmark construction): no human fidelity ratings, inter-annotator agreement scores, or controls comparing schematic vs. photorealistic renderings of the same noun-compound items are reported. Because Δ and b(t) are computed directly from model outputs on these visualizations, any systematic difference in how faithfully literal senses are depicted relative to idiomatic ones (due to generation prompts, base model, or post-selection) would artifactually inflate the Literal Superiority Bias, making the benchmark validity load-bearing for the main claim.
minor comments (2)
- [Abstract and Section 4] The abstract and Section 4 would benefit from explicit statement of the noun-compound sample size, number of visualizations per item, and any statistical controls for multiple comparisons when reporting the correlation between visual fidelity and symbolic alignment.
- [Section 4] Notation for b(t) should include a brief reminder of its sign convention (positive = literal preference) on first use in the results section to aid readability.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on the DIVA benchmark and its validation. We address the single major comment below and will incorporate the suggested controls in the revised manuscript.
read point-by-point responses
-
Referee: [Section 3] Section 3 (DIVA benchmark construction): no human fidelity ratings, inter-annotator agreement scores, or controls comparing schematic vs. photorealistic renderings of the same noun-compound items are reported. Because Δ and b(t) are computed directly from model outputs on these visualizations, any systematic difference in how faithfully literal senses are depicted relative to idiomatic ones (due to generation prompts, base model, or post-selection) would artifactually inflate the Literal Superiority Bias, making the benchmark validity load-bearing for the main claim.
Authors: We acknowledge that the current manuscript does not report human fidelity ratings, inter-annotator agreement, or direct schematic-vs-photorealistic controls. The benchmark relies on sense-anchored prompts drawn from dictionary definitions and consistent generation procedures to align visualizations with intended literal or idiomatic readings, which reduces but does not eliminate the possibility of differential depiction fidelity. To address this directly, the revised version will add a human evaluation study in which annotators rate the semantic fidelity of both literal and idiomatic renderings on a standardized scale, report inter-annotator agreement, and include a control subset where the same noun-compound items are rendered in both schematic and photorealistic styles. Model outputs on these paired controls will be analyzed to quantify any contribution of visual style to the measured Δ and b(t). These additions will be presented in an expanded Section 3 and will not change the core architecture-agnostic metrics or the main experimental results. revision: yes
Circularity Check
No significant circularity; metrics are direct computations from benchmark outputs
full rationale
The paper defines Semantic Alignment Gap Δ and signed bias b(t) explicitly as functions of VLM outputs on the newly introduced DIVA benchmark pairs. These quantities are computed post-hoc from model responses rather than fitted to data or reduced to prior parameters by construction. No self-citations, uniqueness theorems, or ansatzes from prior author work are invoked to justify the core claims or metric definitions. The Literal Superiority Bias is reported as an empirical pattern observed across eight evaluated models, not derived algebraically from the benchmark construction itself. The derivation chain therefore remains independent of its inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Schematic iconicity can represent idiomatic meanings of noun compounds without loss or distortion relative to literal meanings
invented entities (3)
-
DIVA benchmark
no independent evidence
-
Semantic Alignment Gap (Δ)
no independent evidence
-
directional signed bias b(t)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation
Isambard-AI: a leadership class supercom- puter optimised specifically for Artificial Intelligence. Technical report, University of Bristol. Preslav I Nakov and Marti A Hearst. 2013. Semantic interpretation of noun compounds using verbal and other paraphrases.ACM Transactions on Speech and Language Processing (TSLP), 10(3):1–51. Yuwei Niu, Munan Ning, Men...
work page internal anchor Pith review arXiv 2013
-
[2]
InProceedings of the 60th Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8253–8280
Valse: A task-independent benchmark for vision and language models centered on linguistic phenomena. InProceedings of the 60th Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8253–8280. Thomas Pickard, Aline Villavicencio, Maggie Mi, Wei He, Dylan Phelps, and Marco Idiart. 2025. SemEval- 2025 task 1: AdMIRe...
2025
-
[3]
InPro- ceedings of the Fifth BlackboxNLP Workshop on An- alyzing and Interpreting Neural Networks for NLP, pages 335–345, Abu Dhabi, United Arab Emirates (Hybrid)
DALLE-2 is seeing double: Flaws in word- to-concept mapping in Text2Image models. InPro- ceedings of the Fifth BlackboxNLP Workshop on An- alyzing and Interpreting Neural Networks for NLP, pages 335–345, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics. Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and...
2022
-
[4]
InProceedings of the 2nd Workshop on Uncertainty-Aware NLP (UncertaiNLP 2025), pages 89–102, Suzhou, China
HALLUCINOGEN: Benchmarking halluci- nation in implicit reasoning within large vision lan- guage models. InProceedings of the 2nd Workshop on Uncertainty-Aware NLP (UncertaiNLP 2025), pages 89–102, Suzhou, China. Association for Com- putational Linguistics. Thomas Lloyd Short. 2007.Peirce’s theory of signs. Cambridge University Press. Belkiss Souayed, Sara...
2025
-
[5]
Template-based text-to-image alignment for language accessibility a study on visualizing text sim- plifications. InProceedings of the Fourth Workshop on Text Simplification, Accessibility and Readability (TSAR 2025), pages 1–18, Suzhou, China. Associa- tion for Computational Linguistics. Kaiyue Sun, Rongyao Fang, Chengqi Duan, Xian Liu, and Xihui Liu. 202...
-
[6]
person watching loud TV ,
Conceptual Instructions (De- Noising): • Identify the Core Essence:Deter- mine the fundamental meaning or action of the image. Ignore spe- cific details, individuals, or environ- ments. • Abstract & Merge (Metonymy): If the image contains multiple el- ements forming a narrative, distill them into a single, unified glyph that represents the entire concept ...
-
[7]
Avoid organic or sketchy lines
Stylistic Instructions (Flat Iconog- raphy): • Geometric Reconstruction:Re- build the concept using only pure geometric primitives (perfect cir- cles, squares, triangles, and clean, uniform arcs). Avoid organic or sketchy lines. • Strict Flat Design:There must be absolutely zero gradients, shadows, textures, or lighting effects. All col- ors must be solid...
-
[8]
the target noun compound text (e.g.,Eye Candy)
-
[9]
the original high-fidelity image from AD- MIRE, used as a semantic reference
-
[10]
Task instructions.Annotators were instructed to evaluate each candidate along two dimensions:
four candidate iconographic images generated by our abstraction pipeline. Task instructions.Annotators were instructed to evaluate each candidate along two dimensions:
-
[11]
Semantic preservation:whether the candi- date preserved the intended meaning of the noun compound
-
[12]
semiotic noise
Stylistic conformity:whether the candidate satisfied the required iconographic constraints (flat design, geometric composition, limited palette, minimal background detail). Annotators selected the best candidate only if at least one image satisfied both criteria. This wasnot a forced-choice task: if all four candidates failed to preserve the intended mean...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.