Unified multimodal models exhibit pseudo-unification due to modality-asymmetric entropy encoding and pattern-split responses between text and image generation.
Visual representations inside the language model.arXiv preprint arXiv:2510.04819
4 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CV 4years
2026 4representative citing papers
Fine-tuning VLMs to output action sequences for puzzles causes emergent internal visual representations that improve performance when integrated into reasoning.
Decoder-based VLMs hallucinate due to geometric over-alignment of visual embeddings with the text manifold in a universal dataset-agnostic subspace, mitigated by projecting out the linguistic bias.
VLMs bypass visual comparison by recovering semantic labels for nameable entities and hallucinate on unnamable ones, as shown by performance gaps and Logit Lens analysis.
citing papers explorer
-
Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models
Unified multimodal models exhibit pseudo-unification due to modality-asymmetric entropy encoding and pattern-split responses between text and image generation.
-
Do multimodal models imagine electric sheep?
Fine-tuning VLMs to output action sequences for puzzles causes emergent internal visual representations that improve performance when integrated into reasoning.
-
When Language Overwrites Vision: Over-Alignment and Geometric Debiasing in Vision-Language Models
Decoder-based VLMs hallucinate due to geometric over-alignment of visual embeddings with the text manifold in a universal dataset-agnostic subspace, mitigated by projecting out the linguistic bias.
-
VLMs Need Words: Vision Language Models Ignore Visual Detail In Favor of Semantic Anchors
VLMs bypass visual comparison by recovering semantic labels for nameable entities and hallucinate on unnamable ones, as shown by performance gaps and Logit Lens analysis.