hub Baseline reference

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu · 2023 · cs.AI · arXiv 2308.02490

Baseline reference. 67% of citing Pith papers use this work as a benchmark or comparison.

41 Pith papers citing it

Baseline 67% of classified citations

open full Pith review browse 41 citing papers arXiv PDF

abstract

We propose MM-Vet, an evaluation benchmark that examines large multimodal models (LMMs) on complicated multimodal tasks. Recent LMMs have shown various intriguing abilities, such as solving math problems written on the blackboard, reasoning about events and celebrities in news images, and explaining visual jokes. Rapid model advancements pose challenges to evaluation benchmark development. Problems include: (1) How to systematically structure and evaluate the complicated multimodal tasks; (2) How to design evaluation metrics that work well across question and answer types; and (3) How to give model insights beyond a simple performance ranking. To this end, we present MM-Vet, designed based on the insight that the intriguing ability to solve complicated tasks is often achieved by a generalist model being able to integrate different core vision-language (VL) capabilities. MM-Vet defines 6 core VL capabilities and examines the 16 integrations of interest derived from the capability combination. For evaluation metrics, we propose an LLM-based evaluator for open-ended outputs. The evaluator enables the evaluation across different question types and answer styles, resulting in a unified scoring metric. We evaluate representative LMMs on MM-Vet, providing insights into the capabilities of different LMM system paradigms and models.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

dataset 3 background 2 baseline 1

citation-polarity summary

use dataset 3 background 2 baseline 1

claims ledger

abstract We propose MM-Vet, an evaluation benchmark that examines large multimodal models (LMMs) on complicated multimodal tasks. Recent LMMs have shown various intriguing abilities, such as solving math problems written on the blackboard, reasoning about events and celebrities in news images, and explaining visual jokes. Rapid model advancements pose challenges to evaluation benchmark development. Problems include: (1) How to systematically structure and evaluate the complicated multimodal tasks; (2) How to design evaluation metrics that work well across question and answer types; and (3) How to give

co-cited works

representative citing papers

SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

cs.CV · 2026-04-20 · unverdicted · novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs

cs.CV · 2026-05-07 · conditional · novelty 7.0

Medical VLMs frequently select negated options that contradict visible chest X-ray findings, achieving only ~30% accuracy on direct presence probes, but a post-hoc consistency verifier raises accuracy above 95%.

MirrorBench: Evaluating Self-centric Intelligence in MLLMs by Introducing a Mirror

cs.AI · 2026-04-16 · unverdicted · novelty 7.0

MirrorBench reveals that leading MLLMs perform far below humans on tasks requiring self-referential perception and representation, even at the simplest level.

SafeSteer: A Decoding-level Defense Mechanism for Multimodal Large Language Models

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

SafeSteer improves safety in multimodal large language models by up to 33.4% via a decoding probe and modal alignment vector without any fine-tuning.

20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone

cs.LG · 2026-05-12 · conditional · novelty 6.0 · 2 refs

Data curation alone raises VLM accuracy by more than 11 points on average across many benchmarks while cutting required training compute by up to 87 times.

LithoBench: Benchmarking Large Multimodal Models for Remote-Sensing Lithology Interpretation

cs.CV · 2026-05-08 · conditional · novelty 6.0

LithoBench is a new multi-level benchmark showing that existing large multimodal models have substantial limitations in geological semantic understanding for remote sensing lithology interpretation.

Large Vision-Language Models Get Lost in Attention

cs.AI · 2026-05-07 · unverdicted · novelty 6.0

In LVLMs, attention can be replaced by random Gaussian weights with little or no performance loss, indicating that current models get lost in attention rather than efficiently using visual context.

Where Reliability Lives in Vision-Language Models: A Mechanistic Study of Attention, Hidden States, and Causal Circuits

cs.AI · 2026-05-05 · unverdicted · novelty 6.0

Attention sharpness barely predicts VLM correctness while hidden-state probes and self-consistency strongly do, with late-fusion models showing fragile reliability bottlenecks unlike early-fusion ones.

Online Self-Calibration Against Hallucination in Vision-Language Models

cs.CV · 2026-05-01 · unverdicted · novelty 6.0

OSCAR exploits the generative-discriminative gap in LVLMs to build online preference data with MCTS and dual-granularity rewards for DPO-based calibration, claiming SOTA hallucination reduction and improved multimodal performance.

MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference

cs.LG · 2026-04-19 · unverdicted · novelty 6.0

MACS improves inference speed in multimodal MoE models by entropy-weighted balancing of visual tokens and real-time modality-adaptive expert capacity allocation.

PivotMerge: Bridging Heterogeneous Multimodal Pre-training via Post-Alignment Model Merging

cs.CV · 2026-04-18 · unverdicted · novelty 6.0

PivotMerge merges heterogeneous multimodal pre-trained models via shared-space decomposition to filter conflicts and layer-wise weights based on alignment contributions, outperforming baselines on multimodal benchmarks.

HTDC: Hesitation-Triggered Differential Calibration for Mitigating Hallucination in Large Vision-Language Models

cs.CV · 2026-04-13 · unverdicted · novelty 6.0

HTDC mitigates hallucinations in LVLMs by triggering calibration only at hesitation-prone decoding steps via contrasts with visual-nullification and semantic-nullification probes.

POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs

cs.CV · 2026-04-13 · unverdicted · novelty 6.0

POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.

See Fair, Speak Truth: Equitable Attention Improves Grounding and Reduces Hallucination in Vision-Language Alignment

cs.CV · 2026-04-10 · conditional · novelty 6.0

Equitable attention via Dominant Object Penalty and Outlier Boost Coefficient reduces object hallucinations in multimodal LLMs without retraining.

Precise Shield: Explaining and Aligning VLLM Safety via Neuron-Level Guidance

cs.CV · 2026-04-10 · unverdicted · novelty 6.0

Precise Shield identifies safety neurons in VLLMs via activation contrasts and aligns only them with gradient masking, boosting safety, preserving generalization, and enabling zero-shot cross-lingual and cross-modal transfer.

Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs

cs.LG · 2026-04-10 · unverdicted · novelty 6.0

DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and JailBreakV while preserving general capabilities.

CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models

cs.CV · 2026-04-06 · unverdicted · novelty 6.0

CLEAR uses degradation-aware fine-tuning, a latent representation bridge, and interleaved reinforcement learning to connect generative and reasoning capabilities in multimodal models for better degraded image understanding.

DeepSeek-OCR: Contexts Optical Compression

cs.CV · 2025-10-21 · unverdicted · novelty 6.0

DeepSeek-OCR compresses text contexts up to 20x via 2D optical mapping while achieving 97% OCR accuracy below 10x and 60% at 20x, outperforming prior OCR tools with fewer vision tokens.

InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy

cs.RO · 2025-10-15 · unverdicted · novelty 6.0

InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

cs.CV · 2025-08-25 · unverdicted · novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and agentic tasks.

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

cs.CV · 2025-04-14 · conditional · novelty 6.0

InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

cs.CV · 2024-12-06 · unverdicted · novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.

citing papers explorer

Showing 8 of 8 citing papers after filters.

DeepSeek-OCR: Contexts Optical Compression cs.CV · 2025-10-21 · unverdicted · none · ref 41 · internal anchor
DeepSeek-OCR compresses text contexts up to 20x via 2D optical mapping while achieving 97% OCR accuracy below 10x and 60% at 20x, outperforming prior OCR tools with fewer vision tokens.
InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy cs.RO · 2025-10-15 · unverdicted · none · ref 42 · internal anchor
InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency cs.CV · 2025-08-25 · unverdicted · none · ref 169 · internal anchor
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and agentic tasks.
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models cs.CV · 2025-04-14 · conditional · none · ref 139 · internal anchor
InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation cs.CV · 2025-06-03 · unverdicted · none · ref 50 · internal anchor
UniWorld-V1 shows that semantic features from large multimodal models enable unified visual understanding and generation, achieving strong results on perception and manipulation tasks with only 2.7 million training samples.
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset cs.CV · 2025-05-14 · conditional · none · ref 41 · internal anchor
BLIP3-o uses a diffusion transformer to generate CLIP image features and a sequential pretraining strategy to build open models that perform strongly on both image understanding and generation benchmarks.
Seed1.5-VL Technical Report cs.CV · 2025-05-11 · unverdicted · none · ref 167 · internal anchor
Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling cs.AI · 2025-01-29 · conditional · none · ref 51 · internal anchor
Scaling data, model size, and training optimization on the Janus architecture yields better multimodal understanding and more stable, instruction-following text-to-image generation.

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer