HumanVBench provides a 16-task benchmark for human-centric video understanding in MLLMs, created through automated annotation and distractor synthesis pipelines, and shows top models lag human performance on emotion perception and cross-modal alignment.
Herm: Benchmarking and enhancing multimodal llms for human- centric understanding
2 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CV 2verdicts
UNVERDICTED 2representative citing papers
Vision-language models enable zero-shot face image quality assessment whose biometric utility depends on model architecture rather than size, with outputs that align with traditional methods but vary by prompt.
citing papers explorer
-
HumanVBench: Probing Human-Centric Video Understanding in MLLMs with Automatically Synthesized Benchmarks
HumanVBench provides a 16-task benchmark for human-centric video understanding in MLLMs, created through automated annotation and distractor synthesis pipelines, and shows top models lag human performance on emotion perception and cross-modal alignment.
-
Employing Vision-Language Models for Face Image Quality Assessment
Vision-language models enable zero-shot face image quality assessment whose biometric utility depends on model architecture rather than size, with outputs that align with traditional methods but vary by prompt.