M³Eval is a new cognitively-grounded benchmark that evaluates memory dimensions in multi-modal video models and reports consistent model weaknesses in disentanglement, interference, spatial-temporal grounding, and symbolic recall.
X-lebench: A benchmark for extremely long egocentric video understanding.arXiv preprint arXiv:2501.06835, 2025
3 Pith papers cite this work. Polarity classification is still indexing.
3
Pith papers citing it
years
2026 3verdicts
UNVERDICTED 3representative citing papers
Introduces V-RAGBench benchmark and CARVE method that selects per-chunk retrieval configurations via parallel retrievers and adaptive reranking, outperforming eight VideoRAG baselines.
TeachObs is a new human-validated benchmark dataset and evaluation protocol for multimodal AI on classroom teaching observation, showing no model dominates across tracks and that models over-rate procedurally clear lessons.
citing papers explorer
-
TeachObs: A Human-Validated Benchmark for Multimodal Teaching Observation and Model Evaluation
TeachObs is a new human-validated benchmark dataset and evaluation protocol for multimodal AI on classroom teaching observation, showing no model dominates across tracks and that models over-rate procedurally clear lessons.