VLMs exhibit affirmation bias that varies by language, with a new multilingual benchmark showing CLIP at or below chance on non-Latin scripts, MultiCLIP most uniform, and SpaceVLM corrections effective unevenly across typologies.
ICLR , year=
2 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 2representative citing papers
VPiT enables pretrained LLMs to perform both visual understanding and generation by predicting discrete text tokens and continuous visual tokens, with understanding data proving more effective than generation-specific data.
citing papers explorer
-
Disparities In Negation Understanding Across Languages In Vision-Language Models
VLMs exhibit affirmation bias that varies by language, with a new multilingual benchmark showing CLIP at or below chance on non-Latin scripts, MultiCLIP most uniform, and SpaceVLM corrections effective unevenly across typologies.
-
MetaMorph: Multimodal Understanding and Generation via Instruction Tuning
VPiT enables pretrained LLMs to perform both visual understanding and generation by predicting discrete text tokens and continuous visual tokens, with understanding data proving more effective than generation-specific data.