VLMs exhibit only slight performance degradation on hallucination benchmarks when substantial image tokens are removed, with layer-wise analysis showing increased visual token similarity in deeper layers, suggesting current benchmarks inadequately test fine-grained visual grounding.
Iso- bench: Benchmarking multimodal causal reasoning in visual-language models through procedural plans
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.CV 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
T2I models frequently exhibit semantic errors, logical inconsistencies, and incorrect reasoning steps in visual text generation tasks, unlike text-only models.
citing papers explorer
-
Seeing without Looking: Do Vision-Language Benchmarks Really Test Vision?
VLMs exhibit only slight performance degradation on hallucination benchmarks when substantial image tokens are removed, with layer-wise analysis showing increased visual token similarity in deeper layers, suggesting current benchmarks inadequately test fine-grained visual grounding.
-
Evaluating Reasoning Fidelity in Visual Text Generation
T2I models frequently exhibit semantic errors, logical inconsistencies, and incorrect reasoning steps in visual text generation tasks, unlike text-only models.