Automated benchmark generation from domain guidelines informed by Bloom’s taxonomy

Evaluating large language models trained on code · 2026 · arXiv 2601.20253

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

Almieyar-Oryx-BloomBench: A Bilingual Multimodal Benchmark for Cognitively Informed Evaluation of Vision-Language Models

cs.CV · 2026-06-04 · unverdicted · novelty 7.0

BloomBench reveals that state-of-the-art VLMs perform well on semantic understanding but struggle with factual recall and creative synthesis, while also showing large English-Arabic performance gaps.

CAKE: Cloud Architecture Knowledge Evaluation of Large Language Models

cs.SE · 2026-04-07 · unverdicted · novelty 6.0

CAKE benchmark shows MCQ accuracy on cloud architecture plateaus near 99% above 3B parameters while free-response scores improve steadily with size, and reasoning steps help but tools hurt small models.

citing papers explorer

Showing 2 of 2 citing papers.

Almieyar-Oryx-BloomBench: A Bilingual Multimodal Benchmark for Cognitively Informed Evaluation of Vision-Language Models cs.CV · 2026-06-04 · unverdicted · none · ref 2
BloomBench reveals that state-of-the-art VLMs perform well on semantic understanding but struggle with factual recall and creative synthesis, while also showing large English-Arabic performance gaps.
CAKE: Cloud Architecture Knowledge Evaluation of Large Language Models cs.SE · 2026-04-07 · unverdicted · none · ref 22
CAKE benchmark shows MCQ accuracy on cloud architecture plateaus near 99% above 3B parameters while free-response scores improve steadily with size, and reasoning steps help but tools hurt small models.

Automated benchmark generation from domain guidelines informed by Bloom’s taxonomy

fields

years

verdicts

representative citing papers

citing papers explorer