BloomBench reveals that state-of-the-art VLMs perform well on semantic understanding but struggle with factual recall and creative synthesis, while also showing large English-Arabic performance gaps.
Automated benchmark generation from domain guidelines informed by Bloom’s taxonomy
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
CAKE benchmark shows MCQ accuracy on cloud architecture plateaus near 99% above 3B parameters while free-response scores improve steadily with size, and reasoning steps help but tools hurt small models.
citing papers explorer
-
Almieyar-Oryx-BloomBench: A Bilingual Multimodal Benchmark for Cognitively Informed Evaluation of Vision-Language Models
BloomBench reveals that state-of-the-art VLMs perform well on semantic understanding but struggle with factual recall and creative synthesis, while also showing large English-Arabic performance gaps.
-
CAKE: Cloud Architecture Knowledge Evaluation of Large Language Models
CAKE benchmark shows MCQ accuracy on cloud architecture plateaus near 99% above 3B parameters while free-response scores improve steadily with size, and reasoning steps help but tools hurt small models.