LiveK12Bench is a growing multi-disciplinary benchmark showing LMMs like GPT-5 drop from 79 to 53 under realistic exam constraints including process rigor and efficiency.
Mdk12-bench: a multi-discipline benchmark for evaluating reasoning in multimodal large language models.arXiv preprint arXiv:2504.05782, 2025
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
TeachObs is a new human-validated benchmark dataset and evaluation protocol for multimodal AI on classroom teaching observation, showing no model dominates across tracks and that models over-rate procedurally clear lessons.
citing papers explorer
-
LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations?
LiveK12Bench is a growing multi-disciplinary benchmark showing LMMs like GPT-5 drop from 79 to 53 under realistic exam constraints including process rigor and efficiency.
-
TeachObs: A Human-Validated Benchmark for Multimodal Teaching Observation and Model Evaluation
TeachObs is a new human-validated benchmark dataset and evaluation protocol for multimodal AI on classroom teaching observation, showing no model dominates across tracks and that models over-rate procedurally clear lessons.