MMErroR: A Benchmark for Erroneous Reasoning in Vision-Language Models

Yang Shi , Yifeng Xie , Minzhe Guo , Liangsi Lu , Mingxuan Huang , Jingchao Wang , Zhihong Zhu , Boyan Xu

show 1 more author

Zhiqi Huang

Authors on Pith no claims yet

classification 💻 cs.CV cs.AIcs.LG

keywords modelsreasoningerrormmerrormulti-modalvlmsanswerbenchmark

0 comments

read the original abstract

Recent advances in Vision-Language Models (VLMs) have improved performance in multi-modal learning, raising the question of whether these models truly understand the content they process. Crucially, can VLMs detect when a reasoning process is wrong and identify its error type? To answer this, we present MMErroR, a multi-modal benchmark of 1997 samples, each embedding a single coherent reasoning error. These samples span 24 subdomains across six top-level domains, ensuring broad coverage and taxonomic richness. Unlike existing benchmarks that focus on answer correctness, MMErroR targets a process-level, error-centric evaluation that requires models to detect incorrect reasoning and classify the error type within both visual and linguistic contexts. We evaluate 12 representative VLMs, and even the best model, Gemini-3-Pro-Preview, classifies the error correctly in only 66.65\% of cases, underscoring the challenge of identifying erroneous reasoning. Furthermore, the ability to accurately identify errors offers valuable insights into the capabilities of multi-modal models. Project Page: https://mmerror-benchmark.github.io

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TransSplat: Unbalanced Semantic Transport for Language-Driven 3DGS Editing
cs.CV 2026-04 unverdicted novelty 7.0

TransSplat uses unbalanced semantic transport to match edited 2D evidence with 3D Gaussians and recover a shared 3D edit field, yielding better local accuracy and structural consistency than prior view-consistency methods.
HiP-LoRA: Budgeted Spectral Plasticity for Robust Low-Rank Adaptation
cs.LG 2026-04 unverdicted novelty 5.0

HiP-LoRA decomposes LoRA updates into principal and residual spectral channels with a singular-value-weighted stability budget to reduce forgetting and interference during foundation model adaptation.
CoCo-SAM3: Harnessing Concept Conflict in Open-Vocabulary Semantic Segmentation
cs.CV 2026-04 unverdicted novelty 4.0

CoCo-SAM3 improves SAM3 by aligning evidence from synonymous prompts for concept consistency and then running inter-class competition on a unified scale to reduce mask overlaps.