pith. sign in

hub Baseline reference

MMBench: Is Your Multi-modal Model an All-around Player?

Baseline reference. 55% of citing Pith papers use this work as a benchmark or comparison.

83 Pith papers citing it
Baseline 55% of classified citations
abstract

Large vision-language models (VLMs) have recently achieved remarkable progress, exhibiting impressive multimodal perception and reasoning abilities. However, effectively evaluating these large VLMs remains a major challenge, hindering future development in this domain. Traditional benchmarks like VQAv2 or COCO Caption provide quantitative performance measurements but lack fine-grained ability assessment and robust evaluation metrics. Meanwhile, subjective benchmarks, such as OwlEval, offer comprehensive evaluations of a model's abilities by incorporating human labor, which is not scalable and may display significant bias. In response to these challenges, we propose MMBench, a bilingual benchmark for assessing the multi-modal capabilities of VLMs. MMBench methodically develops a comprehensive evaluation pipeline, primarily comprised of the following key features: 1. MMBench is meticulously curated with well-designed quality control schemes, surpassing existing similar benchmarks in terms of the number and variety of evaluation questions and abilities; 2. MMBench introduces a rigorous CircularEval strategy and incorporates large language models to convert free-form predictions into pre-defined choices, which helps to yield accurate evaluation results for models with limited instruction-following capabilities. 3. MMBench incorporates multiple-choice questions in both English and Chinese versions, enabling an apples-to-apples comparison of VLMs' performance under a bilingual context. To summarize, MMBench is a systematically designed objective benchmark for a robust and holistic evaluation of vision-language models. We hope MMBench will assist the research community in better evaluating their models and facilitate future progress in this area. The evalutation code of MMBench has been integrated into VLMEvalKit: https://github.com/open-compass/VLMEvalKit.

hub tools

citation-role summary

dataset 16 background 12 baseline 2 other 1

citation-polarity summary

claims ledger

  • abstract Large vision-language models (VLMs) have recently achieved remarkable progress, exhibiting impressive multimodal perception and reasoning abilities. However, effectively evaluating these large VLMs remains a major challenge, hindering future development in this domain. Traditional benchmarks like VQAv2 or COCO Caption provide quantitative performance measurements but lack fine-grained ability assessment and robust evaluation metrics. Meanwhile, subjective benchmarks, such as OwlEval, offer comprehensive evaluations of a model's abilities by incorporating human labor, which is not scalable and
  • dataset 2 / 82.0 69.8 62.8 76.0 49.3 53.5 Table 2. Comparison with SoTA models on 16 multimodal benchmarks. OCR-related benchmarks include: DocVQA test [82], ChartQA test [81], InfographicVQA test [83], TextVQA val [100], and OCRBench [67]. General multimodal benchmarks encompass: MME [26], RealWorldQA [125], AI2D test [39], MMMU val [135], MMBench-EN/CN test [66], CCBench dev [66], MMVet [133], SEED Image [46], and HallusionBench [30]. Additionally, the math dataset includes MathVista testmini [75]. *
  • dataset MME[ 68]: MME is the first comprehensive evaluation benchmark designed for MLLMs. It assesses models' perception and cognitive abilities across 14 subtasks, including object presence, counting, position, color recognition, as well as commonsense reasoning, numerical computation, text translation, and code reasoning. We report the overall score across all tasks. MMBench[ 156]: MMBench evaluates the multimodal understanding of MLLMs through nearly 3,000 multiple- choice questions spanning 20 dimen
  • dataset With 7B parameters, ShareGPT4V-7B outperforms competitors in 9 out of 11 benchmarks and ranks second on the others, despite these competitors using larger training datasets or more parameters. Benchmark names are abbreviated due to space limits. LLaV AW : LLaV A-Bench (In-the-Wild) [31]; MMEP : MME Perception [15]; MME C: MME Cognition [15]; MMB: MMBenchmark [33]; MMB CN : MMBench-Chinese [33]; SEED I: SEED-Bench (Image) [24]; MM-Vet [58]; QBench [55]; SQAI: ScienceQA-IMG [34]; VQAV 2 [17]; VizW
  • dataset soning and even outperforms later approaches like Instruct- BLIP [14] on such benchmarks [ 55], while InstructBLIP excels in traditional VQA benchmarks that demands single- word or short answers. Given the significant differences in the model architecture and training data between them, the root cause of the disparity in their capabilities remains elusive, despite conjectures [37, 55]: the amount of training data, the usage of resamplers like Qformer [32], etc. To this arXiv:2310.03744v2 [cs.CV]
  • dataset 1 Evaluation Setting Benchmarks.To comprehensively assess the capabilities of our models, we conduct evaluations across 42 public benchmarks, covering eight distinct categories:General VQA,STEM,OCR & Doc- ument,Visual Grounding,Spatial Reasoning,GUI Agents,Coding, andVideo Understanding. The following benchmarks are used for evaluation: • General VQA: MMBench-V1.1 [ 30], MMStar [ 7], BLINK(val) [ 11], MUIRBENCH [ 53], ZeroBench(val) [39], HallusionBench [15], GeoBench [2]; • STEM: MMMU(val) [67]
  • dataset comparison results of human evaluation scores based on English prompts. For the vision-language understanding task, we assess the average scores across twelve benchmarks: SEEDBench-Img [45], OCRBench [ 59](with normalized results), MMVet [ 98], POPE [ 51], VQAv2 [ 27], GQA [ 34], TextVQA [78], ChartQA [61], AI2D [36], RealWorldQA [91], MMMU [99], and MMbench [58]. For the video generation task, we present comparison results of VBench. 1 Introduction Next-token prediction has revolutionized the f

co-cited works

representative citing papers

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

cs.CV · 2024-06-24 · unverdicted · novelty 7.0

Cambrian-1 is a vision-centric multimodal LLM family that evaluates over 20 vision encoders, introduces CV-Bench and the Spatial Vision Aggregator, and releases open models, code, and data achieving strong performance on visual grounding tasks.

LVBench: An Extreme Long Video Understanding Benchmark

cs.CV · 2024-06-12 · accept · novelty 7.0

LVBench is a new benchmark for extreme long video understanding that evaluates multimodal large language models on hour-scale videos using tasks designed to probe extended memory and comprehension.

MLVU: Benchmarking Multi-task Long Video Understanding

cs.CV · 2024-06-06 · conditional · novelty 7.0

MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.

citing papers explorer

Showing 50 of 83 citing papers.