Visual representations inside the language model.arXiv preprint arXiv:2510.04819

Benlin Liu, Amita Kamath, Madeleine Grunde-McLaughlin, Winson Han, Ranjay Krishna · 2025 · arXiv 2510.04819

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

representative citing papers

Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models

cs.CV · 2026-04-13 · unverdicted · novelty 7.0

Unified multimodal models exhibit pseudo-unification due to modality-asymmetric entropy encoding and pattern-split responses between text and image generation.

Do multimodal models imagine electric sheep?

cs.CV · 2026-05-10 · conditional · novelty 6.0

Fine-tuning VLMs to output action sequences for puzzles causes emergent internal visual representations that improve performance when integrated into reasoning.

When Language Overwrites Vision: Over-Alignment and Geometric Debiasing in Vision-Language Models

cs.CV · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

Decoder-based VLMs hallucinate due to geometric over-alignment of visual embeddings with the text manifold in a universal dataset-agnostic subspace, mitigated by projecting out the linguistic bias.

VLMs Need Words: Vision Language Models Ignore Visual Detail In Favor of Semantic Anchors

cs.CV · 2026-04-02 · unverdicted · novelty 6.0

VLMs bypass visual comparison by recovering semantic labels for nameable entities and hallucinate on unnamable ones, as shown by performance gaps and Logit Lens analysis.

citing papers explorer

Showing 4 of 4 citing papers.

Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models cs.CV · 2026-04-13 · unverdicted · none · ref 37
Unified multimodal models exhibit pseudo-unification due to modality-asymmetric entropy encoding and pattern-split responses between text and image generation.
Do multimodal models imagine electric sheep? cs.CV · 2026-05-10 · conditional · none · ref 35
Fine-tuning VLMs to output action sequences for puzzles causes emergent internal visual representations that improve performance when integrated into reasoning.
When Language Overwrites Vision: Over-Alignment and Geometric Debiasing in Vision-Language Models cs.CV · 2026-05-07 · unverdicted · none · ref 11 · 2 links
Decoder-based VLMs hallucinate due to geometric over-alignment of visual embeddings with the text manifold in a universal dataset-agnostic subspace, mitigated by projecting out the linguistic bias.
VLMs Need Words: Vision Language Models Ignore Visual Detail In Favor of Semantic Anchors cs.CV · 2026-04-02 · unverdicted · none · ref 9
VLMs bypass visual comparison by recovering semantic labels for nameable entities and hallucinate on unnamable ones, as shown by performance gaps and Logit Lens analysis.

Visual representations inside the language model.arXiv preprint arXiv:2510.04819

fields

years

verdicts

representative citing papers

citing papers explorer