arxiv: 2310.02255 · v3 · submitted 2023-10-03 · 💻 cs.CV · cs.AI· cs.CL· cs.LG

Recognition: 2 theorem links

· Lean Theorem

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Hritik Bansal, Jiacheng Liu, Jianfeng Gao, Kai-Wei Chang, Michel Galley, Pan Lu, Tony Xia

Pith reviewed 2026-05-11 01:25 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.LG

keywords MathVistamathematical reasoningvisual contextsfoundation modelslarge multimodal modelsbenchmarkGPT-4Vmodel evaluation

0 comments

The pith

MathVista benchmark shows the top foundation model reaches only 49.9 percent accuracy on visual mathematical reasoning tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates MathVista to measure how foundation models combine visual understanding with mathematical problem solving, an area not previously tested in a unified way. It assembles 6,141 examples drawn from 28 existing datasets plus three new ones that require interpreting diagrams, charts, and figures alongside algebraic, geometric, or logical steps. Testing twelve prominent models finds GPT-4V highest at 49.9 percent, ahead of the next model by 15.1 points yet still 10.4 points below human performance, mainly because of failures on complex figures and chained reasoning. The results indicate current systems remain limited for real-world tasks that mix seeing and calculating.

Core claim

MathVista is a benchmark of 6,141 examples that merges challenges from 28 prior multimodal datasets with three newly created ones (IQTest, FunctionQA, PaperQA) to test fine-grained visual understanding together with compositional mathematical reasoning. Evaluation of twelve foundation models shows GPT-4V attaining 49.9 percent overall accuracy, outperforming Bard by 15.1 points, while still trailing humans by 10.4 points because it struggles to parse intricate figures and to carry out rigorous step-by-step reasoning.

What carries the argument

The MathVista benchmark, which unifies existing and new visual-math tasks to quantify combined perception and reasoning in foundation models.

If this is right

GPT-4V's lead stems mainly from stronger visual perception and mathematical reasoning than other models.
Self-verification and self-consistency methods can raise accuracy on these tasks.
Interactive chatbot use of GPT-4V shows promise for handling visually rich math problems.
The remaining gap to human performance highlights the need for better joint vision-and-reasoning capabilities in future models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Benchmarks like this may accelerate targeted training data that pairs diagrams with explicit reasoning chains.
Models passing MathVista could still fail on dynamic scenes or real-time visual math not captured in static images.
The performance ordering among models may shift if new architectures emphasize diagram parsing before symbolic manipulation.

Load-bearing premise

The 6,141 selected examples form a representative sample of the mathematical reasoning problems that appear in visual settings.

What would settle it

A collection of visual mathematics problems drawn independently of the 28 source datasets on which GPT-4V scores near human levels while still scoring 50 percent on MathVista would undermine the benchmark's claimed representativeness.

read the original abstract

Large Language Models (LLMs) and Large Multimodal Models (LMMs) exhibit impressive problem-solving skills in many tasks and domains, but their ability in mathematical reasoning in visual contexts has not been systematically studied. To bridge this gap, we present MathVista, a benchmark designed to combine challenges from diverse mathematical and visual tasks. It consists of 6,141 examples, derived from 28 existing multimodal datasets involving mathematics and 3 newly created datasets (i.e., IQTest, FunctionQA, and PaperQA). Completing these tasks requires fine-grained, deep visual understanding and compositional reasoning, which all state-of-the-art foundation models find challenging. With MathVista, we have conducted a comprehensive, quantitative evaluation of 12 prominent foundation models. The best-performing GPT-4V model achieves an overall accuracy of 49.9%, substantially outperforming Bard, the second-best performer, by 15.1%. Our in-depth analysis reveals that the superiority of GPT-4V is mainly attributed to its enhanced visual perception and mathematical reasoning. However, GPT-4V still falls short of human performance by 10.4%, as it often struggles to understand complex figures and perform rigorous reasoning. This significant gap underscores the critical role that MathVista will play in the development of general-purpose AI agents capable of tackling mathematically intensive and visually rich real-world tasks. We further explore the new ability of self-verification, the application of self-consistency, and the interactive chatbot capabilities of GPT-4V, highlighting its promising potential for future research. The project is available at https://mathvista.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MathVista is a practical new benchmark for visual math reasoning that shows GPT-4V at 49.9% and a clear gap to humans, but the new dataset construction lacks enough transparency to fully rule out sampling bias.

read the letter

The main thing to know is that MathVista pulls together visual math problems into one test set of 6,141 examples and runs 12 foundation models on it, with GPT-4V coming out on top at 49.9% while still trailing humans by about 10 points. The paper aggregates 28 existing datasets and adds three new ones (IQTest, FunctionQA, PaperQA) that target specific visual reasoning needs like interpreting figures or functions. They also check a few extra behaviors such as self-verification and self-consistency in GPT-4V. Making the benchmark and the evaluation code public is the part that actually moves the field forward, since it gives others a concrete way to measure progress on this mix of perception and math.

Referee Report

2 major / 2 minor

Summary. The paper introduces MathVista, a benchmark for mathematical reasoning in visual contexts that aggregates 6,141 examples from 28 existing multimodal datasets plus three newly authored ones (IQTest, FunctionQA, PaperQA). It reports a comprehensive evaluation of 12 foundation models, with GPT-4V achieving the highest overall accuracy of 49.9% (15.1 points above Bard) while remaining 10.4 points below human performance. The work further examines GPT-4V's self-verification, self-consistency, and interactive capabilities, and releases the benchmark publicly.

Significance. If the benchmark proves representative, MathVista supplies a much-needed quantitative yardstick for multimodal models on tasks that combine fine-grained visual perception with compositional mathematical reasoning. The public release of the 6,141-example suite, together with the systematic comparison of 12 models, provides a reproducible baseline that can guide future model development. The observed gap between GPT-4V and both weaker models and humans underscores the value of such targeted benchmarks.

major comments (2)

[§3] §3 (Dataset Construction): The paper supplies only high-level descriptions of the three newly created datasets (IQTest, FunctionQA, PaperQA). No quantitative diversity statistics, pre-registered inclusion criteria, inter-annotator agreement figures, or explicit filtering rules are reported. Because the headline result (GPT-4V at 49.9 % vs. Bard at ~34.8 %) rests on the claim that these 6,141 examples constitute an unbiased sample of visual mathematical reasoning, the absence of these details is load-bearing for interpreting the 15.1-point margin.
[§4.2] §4.2 (Evaluation Protocol): The scoring procedure for open-ended and free-form answers is not fully specified (e.g., exact matching rules, use of LLM judges, handling of partial credit). Ambiguities here could systematically affect accuracy numbers across models and thereby the reported performance gaps.

minor comments (2)

[Table 1] Table 1 and Figure 2: The breakdown of example counts and task categories across the 28 existing datasets plus the three new ones would benefit from an explicit column showing the proportion of each source that survived filtering.
[Abstract] Abstract: The sentence stating that GPT-4V 'falls short of human performance by 10.4%' should be accompanied by a brief parenthetical on how human accuracy was measured (number of annotators, time limits, etc.).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation of minor revision. We appreciate the emphasis on improving the transparency of dataset construction and evaluation details, which will strengthen the paper. We address each major comment below and describe the planned revisions.

read point-by-point responses

Referee: [§3] §3 (Dataset Construction): The paper supplies only high-level descriptions of the three newly created datasets (IQTest, FunctionQA, PaperQA). No quantitative diversity statistics, pre-registered inclusion criteria, inter-annotator agreement figures, or explicit filtering rules are reported. Because the headline result (GPT-4V at 49.9 % vs. Bard at ~34.8 %) rests on the claim that these 6,141 examples constitute an unbiased sample of visual mathematical reasoning, the absence of these details is load-bearing for interpreting the 15.1-point margin.

Authors: We acknowledge that Section 3 currently offers primarily high-level descriptions of IQTest, FunctionQA, and PaperQA. In the revised manuscript we will expand this section with quantitative diversity statistics (e.g., topic distributions across algebra, geometry, and statistics; counts of visual complexity levels) and explicit statements of the inclusion criteria and filtering rules applied during curation. Because these three datasets were created internally by the author team without a pre-registered protocol or multiple independent annotators, inter-annotator agreement figures are not available and cannot be supplied. We will add a brief limitations paragraph clarifying this point and its implications for claims of representativeness. revision: partial
Referee: [§4.2] §4.2 (Evaluation Protocol): The scoring procedure for open-ended and free-form answers is not fully specified (e.g., exact matching rules, use of LLM judges, handling of partial credit). Ambiguities here could systematically affect accuracy numbers across models and thereby the reported performance gaps.

Authors: We agree that the current description of the scoring protocol in §4.2 leaves room for ambiguity. In the revised version we will provide a detailed specification of the exact matching rules (including normalization steps for units, formatting, and synonyms), confirm that no LLM judges were used for the main results (scoring relied on rule-based matching supplemented by human verification for edge cases), and clarify the policy on partial credit and multiple valid answers. These additions will improve reproducibility and allow readers to better assess the reported gaps, such as the 15.1-point margin between GPT-4V and Bard. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical evaluation on assembled benchmark

full rationale

The paper constructs MathVista by merging 28 existing multimodal math datasets with three newly authored ones (IQTest, FunctionQA, PaperQA) and then reports model accuracies obtained by direct, independent inference on the resulting 6,141 examples. No equations, fitted parameters, or first-principles derivations appear; the headline numbers (GPT-4V at 49.9 %) are raw test-set measurements, not quantities that reduce to the curation process by construction. Self-citations to prior datasets are ordinary sourcing and do not bear the load of any claimed prediction or uniqueness theorem. The evaluation chain is therefore self-contained against external benchmarks and exhibits none of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the selected tasks validly probe fine-grained visual understanding and compositional mathematical reasoning; no free parameters are fitted and no new entities are postulated.

axioms (1)

domain assumption Completing these tasks requires fine-grained, deep visual understanding and compositional reasoning.
Stated directly in the abstract as the core requirement for the benchmark tasks.

pith-pipeline@v0.9.0 · 5631 in / 1323 out tokens · 45751 ms · 2026-05-11T01:25:59.279348+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images
cs.CV 2026-04 unverdicted novelty 8.0

S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.
MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval
cs.AI 2026-04 accept novelty 8.0

MathNet delivers the largest multilingual Olympiad math dataset and benchmarks where models like Gemini-3.1-Pro reach 78% on solving but embedding models struggle on equivalent problem retrieval, with retrieval augmen...
EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations
cs.CV 2026-04 unverdicted novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
cs.CL 2024-09 accept novelty 8.0

MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning
cs.MM 2026-05 unverdicted novelty 7.0

UniPath adaptively models coordination-path diversity in unified multimodal models by training a path-conditioned executor and using a lightweight planner for input-dependent selection, improving performance over fixe...
Perception Without Engagement: Dissecting the Causal Discovery Deficit in LMMs
cs.CL 2026-05 unverdicted novelty 7.0

LMMs perceive videos but underexploit visual content for causal reasoning due to textual shortcuts; ProCauEval diagnoses this and ADPO training reduces reliance on priors.
GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning
cs.CV 2026-05 unverdicted novelty 7.0

GazeVLM introduces internal gaze tokens that allow VLMs to dynamically suppress irrelevant visual features and simulate foveal attention for improved high-resolution multimodal reasoning.
Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients
cs.CL 2026-05 unverdicted novelty 7.0

POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on A...
AstroAlertBench: Evaluating the Accuracy, Reasoning, and Honesty of Multimodal LLMs in Astronomical Classification
astro-ph.IM 2026-05 unverdicted novelty 7.0

AstroAlertBench evaluates multimodal LLMs on astronomical classification accuracy, reasoning, and honesty using real ZTF alerts, revealing that high accuracy often diverges from self-assessed reasoning quality.
Unveiling Fine-Grained Visual Traces: Evaluating Multimodal Interleaved Reasoning Chains in Multimodal STEM Tasks
cs.CV 2026-04 unverdicted novelty 7.0

StepSTEM benchmark and dynamic-programming step alignment show top MLLMs achieve only 38.29% accuracy on graduate STEM tasks requiring interleaved cross-modal reasoning.
ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety
cs.CR 2026-04 unverdicted novelty 7.0

ProjLens shows that backdoor parameters in MLLMs are encoded in low-rank subspaces of the projector and that embeddings shift toward the target direction with magnitude linear in input norm, activating only on poisone...
Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models
cs.LG 2026-04 unverdicted novelty 7.0

RL post-training on hallucination-forced multimodal data improves reasoning performance and can outperform standard training.
Self-Distilled RLVR
cs.LG 2026-04 unverdicted novelty 7.0

RLSD mixes self-distillation for token-level policy difference magnitudes with RLVR for reliable update directions from response correctness to reach higher convergence and better training stability.
Toward an Artificial General Teacher: Procedural Geometry Data Generation and Visual Grounding with Vision-Language Models
cs.CV 2026-04 unverdicted novelty 7.0

A procedural engine generates 200k+ synthetic geometry diagrams to fine-tune VLMs for referring image segmentation on abstract diagrams, yielding 49% IoU and 85% Buffered IoU with Florence-2 versus under 1% zero-shot.
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
cs.CV 2025-05 unverdicted novelty 7.0

DeepEyes uses reinforcement learning to teach vision-language models active perception and image-based thinking, yielding gains on perception, reasoning, grounding, and hallucination benchmarks.
Unified Reward Model for Multimodal Understanding and Generation
cs.CV 2025-03 unverdicted novelty 7.0

UnifiedReward is the first unified reward model that jointly assesses multimodal understanding and generation to provide better preference signals for aligning vision models via DPO.
Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context
cs.CV 2026-05 unverdicted novelty 6.0

Continued pre-training with balanced long-document VQA data extends a 7B LVLM to 128K context, improving long-document VQA by 7.1% and generalizing to 512K without further training.
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
cs.MM 2026-05 unverdicted novelty 6.0

Staged post-training with self-distillation lets a 3B omni-modal model match or slightly exceed a 30B model on a visually debiased benchmark.
LatentRouter: Can We Choose the Right Multimodal Model Before Seeing Its Answer?
cs.AI 2026-05 unverdicted novelty 6.0

LatentRouter routes image-question queries to the best MLLM by predicting counterfactual performance via latent communication between learned query capsules and model capability tokens.
Reinforcing Multimodal Reasoning Against Visual Degradation
cs.CV 2026-05 unverdicted novelty 6.0

ROMA improves MLLM robustness to seen and unseen visual corruptions by +2.3-2.4% over GRPO on seven reasoning benchmarks while matching clean accuracy.
LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?
cs.CV 2026-05 unverdicted novelty 6.0

LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% for high-resolution images in MLLMs via slice-based encoding plus intra-ViT early compression while matching or exceeding baseline performance on document, OCR, and ...
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces
cs.AI 2026-05 unverdicted novelty 6.0

OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
PAAC: Privacy-Aware Agentic Device-Cloud Collaboration
cs.LG 2026-05 unverdicted novelty 6.0

PAAC aligns planner-executor decomposition with the device-cloud boundary via typed placeholders and on-device sanitization, delivering 15-36% higher accuracy and 2-6x lower leakage than prior device-cloud baselines o...
Large Vision-Language Models Get Lost in Attention
cs.AI 2026-05 unverdicted novelty 6.0

In LVLMs, attention can be replaced by random Gaussian weights with little or no performance loss, indicating that current models get lost in attention rather than efficiently using visual context.
Segment-Aligned Policy Optimization for Multi-Modal Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

SAPO introduces segment-level policy optimization using a step-wise MDP abstraction to better align RL updates with reasoning structure in multi-modal LLM tasks.
Co-Evolving Policy Distillation
cs.LG 2026-04 unverdicted novelty 6.0

CoPD integrates multiple expert capabilities by running parallel RLVR training with bidirectional online policy distillation among experts, outperforming mixed RLVR and sequential OPD while surpassing domain-specific ...
Length Value Model: Scalable Value Pretraining for Token-Level Length Modeling
cs.CL 2026-04 unverdicted novelty 6.0

LenVM models token-level remaining generation length as a bounded discounted value function derived from constant negative per-token rewards, providing a scalable proxy for generation horizon.
Latent Denoising Improves Visual Alignment in Large Multimodal Models
cs.CV 2026-04 unverdicted novelty 6.0

A latent denoising objective with saliency-aware corruption and contrastive distillation improves visual alignment and corruption robustness in large multimodal models.
LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model
cs.CV 2026-04 unverdicted novelty 6.0

LLaDA2.0-Uni unifies multimodal understanding and generation inside one discrete diffusion large language model with a semantic tokenizer, MoE backbone, and diffusion decoder.
Unveiling Fine-Grained Visual Traces: Evaluating Multimodal Interleaved Reasoning Chains in Multimodal STEM Tasks
cs.CV 2026-04 unverdicted novelty 6.0

StepSTEM benchmark and step-level DP evaluation show top MLLMs achieve only 38.29% accuracy on fine-grained multimodal STEM reasoning, relying primarily on textual cues.
DR-MMSearchAgent: Deepening Reasoning in Multimodal Search Agents
cs.CV 2026-04 unverdicted novelty 6.0

DR-MMSearchAgent derives batch-wide trajectory advantages and uses differentiated Gaussian rewards to prevent premature collapse in multimodal agents, outperforming MMSearch-R1 by 8.4% on FVQA-test.
One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models
cs.LG 2026-04 unverdicted novelty 6.0

Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.
Multiplication in Multimodal LLMs: Computation with Text, Image, and Audio Inputs
cs.CL 2026-04 unverdicted novelty 6.0

Multimodal LLMs perceive numbers accurately across modalities but fail at multi-digit multiplication, with performance predicted by an arithmetic load metric C and degradation confirmed as computational rather than pe...
SkillGraph: Self-Evolving Multi-Agent Collaboration with Multimodal Graph Topology
cs.AI 2026-04 unverdicted novelty 6.0

SkillGraph jointly evolves agent skills and collaboration topologies in multi-agent vision-language systems using a multimodal graph transformer and a skill designer, yielding consistent performance gains on benchmarks.
Boosting Visual Instruction Tuning with Self-Supervised Guidance
cs.CV 2026-04 unverdicted novelty 6.0

Mixing 3-10% of visually grounded self-supervised instructions into visual instruction tuning consistently boosts MLLM performance on vision-centric benchmarks.
POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs
cs.CV 2026-04 unverdicted novelty 6.0

POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.
Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs
cs.LG 2026-04 unverdicted novelty 6.0

DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and...
Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization
cs.CV 2026-04 unverdicted novelty 6.0

MAPO improves multimodal chain-of-thought reasoning by requiring explicit textual descriptions of visual tool results and using a novel advantage estimator that combines semantic alignment with task rewards.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
cs.CV 2025-08 unverdicted novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
cs.CV 2025-07 unverdicted novelty 6.0

GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
cs.CV 2025-04 conditional novelty 6.0

InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
cs.CV 2025-04 unverdicted novelty 6.0

VLM-R1 applies R1-style RL using rule-based rewards on visual tasks with clear ground truth to achieve competitive performance and superior generalization over SFT in vision-language models.
Muon is Scalable for LLM Training
cs.LG 2025-02 unverdicted novelty 6.0

Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
cs.CV 2024-12 unverdicted novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
Are We on the Right Way for Evaluating Large Vision-Language Models?
cs.CV 2024-03 conditional novelty 6.0

Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6...
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
cs.CV 2026-05 unverdicted novelty 5.0

SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents
cs.CV 2026-04 unverdicted novelty 5.0

GLM-5V-Turbo integrates multimodal perception as a core part of reasoning and execution for agentic tasks, reporting strong results in visual tool use and multimodal coding while keeping text-only performance competitive.
Structural Pruning of Large Vision Language Models: A Comprehensive Study on Pruning Dynamics, Recovery, and Data Efficiency
cs.CL 2026-04 conditional novelty 5.0

Widthwise pruning of LVLM language backbones combined with supervised finetuning and hidden-state distillation recovers over 95% performance using just 5% of data across 3B-7B models.
Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding
cs.LG 2026-04 unverdicted novelty 5.0

A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.
Back to the Barn with LLAMAs: Evolving Pretrained LLM Backbones in Finetuning Vision Language Models
cs.AI 2026-04 unverdicted novelty 5.0

Newer LLM backbones in VLMs do not always improve performance; gains are task-dependent, with VQA models solving different questions due to better confidence calibration and stable representations.
SVSR: A Self-Verification and Self-Rectification Paradigm for Multimodal Reasoning
cs.AI 2026-04 unverdicted novelty 5.0

SVSR trains multimodal models to verify and correct their own reasoning using a preference dataset, supervised fine-tuning, and semi-online DPO with a teacher model.
Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models
cs.AI 2026-04 unverdicted novelty 5.0

Multimodal reasoning models hallucinate at high-entropy cognitive bifurcation points due to loss of visual semantic anchoring, and the V-STAR training paradigm with HVAR rewards and FRM reflection mitigates this by re...
Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models
cs.CV 2026-04 unverdicted novelty 5.0

HDPO reframes tool efficiency as a conditional objective within accurate trajectories, enabling Metis to reduce tool invocations by orders of magnitude while raising reasoning accuracy.
Too long; didn't solve
cs.AI 2026-04 unverdicted novelty 5.0

Longer prompts and solutions in a new expert-authored math dataset correlate with higher failure rates across LLMs, with length linked to empirical difficulty after difficulty adjustment.
Decompose, Look, and Reason: Reinforced Latent Reasoning for VLMs
cs.CL 2026-04 unverdicted novelty 5.0

DLR is a new reinforced latent reasoning method for VLMs that decomposes queries, uses continuous visual latents, and outperforms text-only and multimodal CoT baselines on vision-centric benchmarks with better interpr...
Kimi K2.5: Visual Agentic Intelligence
cs.CL 2026-02 unverdicted novelty 5.0

Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.
Humanity's Last Exam
cs.LG 2025-01 unverdicted novelty 5.0

Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.
LLaVA-OneVision: Easy Visual Task Transfer
cs.CV 2024-08 unverdicted novelty 5.0

LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
cs.CV 2024-08 conditional novelty 5.0

MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.
ZAYA1-VL-8B Technical Report
cs.CV 2026-05 unverdicted novelty 4.0

ZAYA1-VL-8B is a new MoE vision-language model with vision-specific LoRA adapters and bidirectional image attention that reports competitive performance against several 3B-4B models on image, reasoning, and counting b...

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · cited by 68 Pith papers · 2 internal anchors

[1]

OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

10, 20 10 Published as a conference paper at ICLR 2024 Anthropic. Claude 2, 2023. URL https://www.anthropic.com/index/claude-2. 6, 20 Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zit- nick, and Devi Parikh. VQA: Visual question answering. InProceedings of the IEEE international conference on computer vision, pp...

work page internal anchor Pith review arXiv 2024
[2]

1, 20, 27 Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi

CEUR-WS, 2022. 1, 20, 27 Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. InstructBLIP: Towards general-purpose vision- language models with instruction tuning, 2023. 6, 20, 39 Qingxiu Dong, Li Dong, Ke Xu, Guangyan Zhou, Yaru Hao, Zhifang Sui, and Furu Wei. Large language mod...

work page arXiv 2022
[3]

VizWiz grand challenge: Answering visual questions from blind people

20, 27 Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. VizWiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3608–3617,

work page
[4]

C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models

10, 20, 27 Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In International Conference on Machine Learning, pp. 9118–9147. PMLR, 2022. 20 Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yika...

work page arXiv 2022
[5]

Multimodal founda- tion models: From specialists to general-purpose assistants

6, 39 Kenton Lee, Mandar Joshi, Iulia Raluca Turc, Hexiang Hu, Fangyu Liu, Julian Martin Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, and Kristina Toutanova. Pix2Struct: Screen- shot parsing as pretraining for visual language understanding. In International Conference on Machine Learning, pp. 18893–18912. PMLR, 2023. 10, 20 Chunyuan Li, Zh...

work page arXiv 2023
[6]

Visual Instruction Tuning

10, 20 Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023a. 6, 20 Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. AgentBench: Evaluating LLMs as agents. arXiv preprint arXiv:2308.03688, 2023b. 20 Yuan Liu, Haodong...

work page internal anchor Pith review arXiv 2022
[7]

that are trained on vast amount of paired (Schuhmann et al., 2022; Sharma et al., 2018; Lin et al., 2014) and interleaved image-text data (Zhu et al., 2023b). In addition, there has been recent development on specialized versions of these LMMs for document understanding where visual con- texts require text recognition, math understanding being one of them...

work page 2022
[8]

Table (450, 7.33%) Figure 12: Examples of the visual context for the table type

work page
[9]

Function Plot (400, 6.51%) Figure 13: Examples of the visual context for the function plot type

work page
[10]

AbstractScene (375, 6.11%) Figure 14: Examples of the visual context for the abstract scene type

work page
[11]

Puzzle Test (226, 3.68%) Figure 15: Examples of the visual context for the puzzle test type

work page
[12]

ScatterPlot (205, 3.34%) Figure 16: Examples of the visual context for the scatter plot type

work page
[13]

25 Published as a conference paper at ICLR 2024

Line Plot (202, 3.29%) Figure 17: Examples of the visual context for the line plot type. 25 Published as a conference paper at ICLR 2024

work page 2024
[14]

Pie Chart (97, 1.58%) Figure 18: Examples of the visual context for the pie chart type

work page
[15]

Document Image (59, 0.96%) Figure 19: Examples of the visual context for the document image type

work page
[16]

Medical Image (6, 0.10%) Figure 20: Examples of the visual context for the medical image type

work page
[17]

26 Published as a conference paper at ICLR 2024 C.4 S OURCE DATASET SUMMARY The source datasets are summarized in Table 5

Map Chart (30, 0.49%) Figure 21: Examples of the visual context for other types, including word cloud, map chart, radar chart, violin plot, and heatmap chart. 26 Published as a conference paper at ICLR 2024 C.4 S OURCE DATASET SUMMARY The source datasets are summarized in Table 5. Dataset Category Task Context Math Skill IQTest (Ours) Math-Targeted FQA Pu...

work page 2024
[18]

Less frequent, yet equally important visual contexts such as medical images, word clouds, map charts, radar charts, violin plots, and heatmap charts are depicted in Figure 20 and Figure 21. These visual contexts, ranging from common to specialized representations, challenge the models to de- code and reason with varying visual information, contributing to...

work page 2024
[19]

Find the angle between the acceleration vector and the radius vector

work page
[20]

Use the cosine function to find the cosine of the angle

work page
[21]

" choices = [’3/11’, ’8/11’, ’6/11’, ’3/5’] description =

Multiply the magnitude of the acceleration vector by the cosine of the angle to find the normal component of the acceleration. Wrong output: 7.07 Ground truth: 9 Figure 47: A failure example from Multimodal Bard: the question is incomplete (or requires implicit measuring ability), but the model hallucinates the angle (the actual angle is 25◦). Here, an = ...

work page 2024
[22]

Set the x-coordinate to 0

work page
[23]

Swirls” and “Velvet

Solve for the y-coordinate. In this case, we set x = 0 and get y = 3. Therefore, the y-intercept is 3. GPT-4V: The y-intercept of a function is the point where the graph of the function crosses the y-axis. From the provided graph, the curve intersects the y-axis at y = 4. Thus, the y-intercept is 4. Figure 60: The plot represents a function that intersect...

work page 2024
[24]

GPT-4V: From the given figure: Circle + Square = 5 =⇒ Circle = 5 − Square

20/2 = 10 The answer is the final number, which is 10. GPT-4V: From the given figure: Circle + Square = 5 =⇒ Circle = 5 − Square . . . (i) Triangle + Triangle = 8 =⇒ Triangle = 4 . . . (ii) Circle + Triangle = 7 Using (ii), Circle + 4 = 7 =⇒ Circle = 3 . . . (iii) Using (i) and (iii), 5 − Square = 3 =⇒ Square = 2. Correct output: 2 Figure 66: The question...

work page 2024
[25]

acuminate

So, Richard was 3 years older than Anne when they married in 1472. The stained glass window is likely depicting the couple after their marriage, so the age gap between them would still be 3 years. GPT-4V: The two figures depicted in the stained glass are King Richard III and Anne Neville. Richard III was born on October 2, 1452, and Anne Neville was born ...

work page 2024
[26]

area under the curve

Although GPT-4V outperforms other LMMs, such as Multimodal Bard, and even humans in overall accuracy (Figure 1), it often fails in the cases where fine-grained understanding is required, as in Figure 92. ID: 105 Question: Does Dark Violet have the minimum area under the curve? Choices: (A) yes (B) no Ground truth: (A) yes LLaMA-Adapter-V2: No, Dark Violet...

work page 2024
[27]

For the first triangle: (8 − 4) × 2 = 8

work page
[28]

So, the missing number is: 5 (unit’s place of 55) This is not in the given options

For the second triangle: (3 − 2) × 3 = 3 Applying the same pattern to the third triangle: (12 − 1) × 5 = 55 However, since the answer is in single digits, it seems we take only the unit’s place of the result. So, the missing number is: 5 (unit’s place of 55) This is not in the given options. Therefore, let’s explore another possible pattern:

work page
[29]

For the first triangle: (8 + 4)/2 = 6

work page
[30]

Considering other possible operations:

For the second triangle: (3 + 2)/3 = 5 /3 = 1 .67 (This does not match the number inside) But let’s continue with this pattern for the third triangle: (12 + 1)/5 = 13 /5 = 2 .6 Again, this does not match any of the options. Considering other possible operations:

work page
[31]

For the first triangle: (8 × 4)/2 = 16

work page
[32]

For the second triangle: (3 × 2)/3 = 2

work page
[33]

Given the information and the options provided, I am unable to find a consistent pattern that applies to all the triangles

For the third triangle: (12 × 1)/5 = 12 /5 = 2 .4 Again, this does not match any of the options. Given the information and the options provided, I am unable to find a consistent pattern that applies to all the triangles. Therefore, I cannot determine the missing number from the given choices. Figure 104: In this logical reasoning question, GPT-4V verifies...

work page 2024
[34]

There are six large cubes, which represent a value of 6 × 1000 = 6000

work page
[35]

There are three medium-sized cubes, which represent a value of 3×100 = 300

work page
[36]

There are two small cubes, which represent a value of 2 × 10 = 20

work page
[37]

Adding up these values: 6000 + 300 + 20 + 9 = 6329 The number shown is 6329

The vertical stack of single unit cubes shows nine cubes, which represent a value of 9 × 1 = 9 . Adding up these values: 6000 + 300 + 20 + 9 = 6329 The number shown is 6329. Wrong output: 6329 GPT-4V (Reasoning Path 3): The image displays a series of cubes representing numbers in base 10 using place value. From left to right:

work page
[38]

There are seven large cubes, which represent a value of 7 × 1000 = 7000

work page
[39]

There are two medium-sized cubes, which represent a value of 2 × 100 = 200

work page
[40]

There are nine small cubes, which represent a value of 9 × 10 = 90

work page
[41]

What’s the lowest number yard line that you can see?

The vertical stack of single unit cubes shows eight cubes, which represent a value of 8 × 1 = 8 . Adding up these values: 7000 + 200 + 90 + 8 = 7298 The number shown is 7298. Wrong output: 7298 Figure 110: In this arithmetic reasoning task, GPT-4V struggles with comprehending the abstract scene, leading to its inability to calculate the correct answer thr...

work page 2024