Recognition: 2 theorem links
· Lean TheoremMathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Pith reviewed 2026-05-11 01:25 UTC · model grok-4.3
The pith
MathVista benchmark shows the top foundation model reaches only 49.9 percent accuracy on visual mathematical reasoning tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MathVista is a benchmark of 6,141 examples that merges challenges from 28 prior multimodal datasets with three newly created ones (IQTest, FunctionQA, PaperQA) to test fine-grained visual understanding together with compositional mathematical reasoning. Evaluation of twelve foundation models shows GPT-4V attaining 49.9 percent overall accuracy, outperforming Bard by 15.1 points, while still trailing humans by 10.4 points because it struggles to parse intricate figures and to carry out rigorous step-by-step reasoning.
What carries the argument
The MathVista benchmark, which unifies existing and new visual-math tasks to quantify combined perception and reasoning in foundation models.
If this is right
- GPT-4V's lead stems mainly from stronger visual perception and mathematical reasoning than other models.
- Self-verification and self-consistency methods can raise accuracy on these tasks.
- Interactive chatbot use of GPT-4V shows promise for handling visually rich math problems.
- The remaining gap to human performance highlights the need for better joint vision-and-reasoning capabilities in future models.
Where Pith is reading between the lines
- Benchmarks like this may accelerate targeted training data that pairs diagrams with explicit reasoning chains.
- Models passing MathVista could still fail on dynamic scenes or real-time visual math not captured in static images.
- The performance ordering among models may shift if new architectures emphasize diagram parsing before symbolic manipulation.
Load-bearing premise
The 6,141 selected examples form a representative sample of the mathematical reasoning problems that appear in visual settings.
What would settle it
A collection of visual mathematics problems drawn independently of the 28 source datasets on which GPT-4V scores near human levels while still scoring 50 percent on MathVista would undermine the benchmark's claimed representativeness.
read the original abstract
Large Language Models (LLMs) and Large Multimodal Models (LMMs) exhibit impressive problem-solving skills in many tasks and domains, but their ability in mathematical reasoning in visual contexts has not been systematically studied. To bridge this gap, we present MathVista, a benchmark designed to combine challenges from diverse mathematical and visual tasks. It consists of 6,141 examples, derived from 28 existing multimodal datasets involving mathematics and 3 newly created datasets (i.e., IQTest, FunctionQA, and PaperQA). Completing these tasks requires fine-grained, deep visual understanding and compositional reasoning, which all state-of-the-art foundation models find challenging. With MathVista, we have conducted a comprehensive, quantitative evaluation of 12 prominent foundation models. The best-performing GPT-4V model achieves an overall accuracy of 49.9%, substantially outperforming Bard, the second-best performer, by 15.1%. Our in-depth analysis reveals that the superiority of GPT-4V is mainly attributed to its enhanced visual perception and mathematical reasoning. However, GPT-4V still falls short of human performance by 10.4%, as it often struggles to understand complex figures and perform rigorous reasoning. This significant gap underscores the critical role that MathVista will play in the development of general-purpose AI agents capable of tackling mathematically intensive and visually rich real-world tasks. We further explore the new ability of self-verification, the application of self-consistency, and the interactive chatbot capabilities of GPT-4V, highlighting its promising potential for future research. The project is available at https://mathvista.github.io/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MathVista, a benchmark for mathematical reasoning in visual contexts that aggregates 6,141 examples from 28 existing multimodal datasets plus three newly authored ones (IQTest, FunctionQA, PaperQA). It reports a comprehensive evaluation of 12 foundation models, with GPT-4V achieving the highest overall accuracy of 49.9% (15.1 points above Bard) while remaining 10.4 points below human performance. The work further examines GPT-4V's self-verification, self-consistency, and interactive capabilities, and releases the benchmark publicly.
Significance. If the benchmark proves representative, MathVista supplies a much-needed quantitative yardstick for multimodal models on tasks that combine fine-grained visual perception with compositional mathematical reasoning. The public release of the 6,141-example suite, together with the systematic comparison of 12 models, provides a reproducible baseline that can guide future model development. The observed gap between GPT-4V and both weaker models and humans underscores the value of such targeted benchmarks.
major comments (2)
- [§3] §3 (Dataset Construction): The paper supplies only high-level descriptions of the three newly created datasets (IQTest, FunctionQA, PaperQA). No quantitative diversity statistics, pre-registered inclusion criteria, inter-annotator agreement figures, or explicit filtering rules are reported. Because the headline result (GPT-4V at 49.9 % vs. Bard at ~34.8 %) rests on the claim that these 6,141 examples constitute an unbiased sample of visual mathematical reasoning, the absence of these details is load-bearing for interpreting the 15.1-point margin.
- [§4.2] §4.2 (Evaluation Protocol): The scoring procedure for open-ended and free-form answers is not fully specified (e.g., exact matching rules, use of LLM judges, handling of partial credit). Ambiguities here could systematically affect accuracy numbers across models and thereby the reported performance gaps.
minor comments (2)
- [Table 1] Table 1 and Figure 2: The breakdown of example counts and task categories across the 28 existing datasets plus the three new ones would benefit from an explicit column showing the proportion of each source that survived filtering.
- [Abstract] Abstract: The sentence stating that GPT-4V 'falls short of human performance by 10.4%' should be accompanied by a brief parenthetical on how human accuracy was measured (number of annotators, time limits, etc.).
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation of minor revision. We appreciate the emphasis on improving the transparency of dataset construction and evaluation details, which will strengthen the paper. We address each major comment below and describe the planned revisions.
read point-by-point responses
-
Referee: [§3] §3 (Dataset Construction): The paper supplies only high-level descriptions of the three newly created datasets (IQTest, FunctionQA, PaperQA). No quantitative diversity statistics, pre-registered inclusion criteria, inter-annotator agreement figures, or explicit filtering rules are reported. Because the headline result (GPT-4V at 49.9 % vs. Bard at ~34.8 %) rests on the claim that these 6,141 examples constitute an unbiased sample of visual mathematical reasoning, the absence of these details is load-bearing for interpreting the 15.1-point margin.
Authors: We acknowledge that Section 3 currently offers primarily high-level descriptions of IQTest, FunctionQA, and PaperQA. In the revised manuscript we will expand this section with quantitative diversity statistics (e.g., topic distributions across algebra, geometry, and statistics; counts of visual complexity levels) and explicit statements of the inclusion criteria and filtering rules applied during curation. Because these three datasets were created internally by the author team without a pre-registered protocol or multiple independent annotators, inter-annotator agreement figures are not available and cannot be supplied. We will add a brief limitations paragraph clarifying this point and its implications for claims of representativeness. revision: partial
-
Referee: [§4.2] §4.2 (Evaluation Protocol): The scoring procedure for open-ended and free-form answers is not fully specified (e.g., exact matching rules, use of LLM judges, handling of partial credit). Ambiguities here could systematically affect accuracy numbers across models and thereby the reported performance gaps.
Authors: We agree that the current description of the scoring protocol in §4.2 leaves room for ambiguity. In the revised version we will provide a detailed specification of the exact matching rules (including normalization steps for units, formatting, and synonyms), confirm that no LLM judges were used for the main results (scoring relied on rule-based matching supplemented by human verification for edge cases), and clarify the policy on partial credit and multiple valid answers. These additions will improve reproducibility and allow readers to better assess the reported gaps, such as the 15.1-point margin between GPT-4V and Bard. revision: yes
Circularity Check
No circularity: direct empirical evaluation on assembled benchmark
full rationale
The paper constructs MathVista by merging 28 existing multimodal math datasets with three newly authored ones (IQTest, FunctionQA, PaperQA) and then reports model accuracies obtained by direct, independent inference on the resulting 6,141 examples. No equations, fitted parameters, or first-principles derivations appear; the headline numbers (GPT-4V at 49.9 %) are raw test-set measurements, not quantities that reduce to the curation process by construction. Self-citations to prior datasets are ordinary sourcing and do not bear the load of any claimed prediction or uniqueness theorem. The evaluation chain is therefore self-contained against external benchmarks and exhibits none of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Completing these tasks requires fine-grained, deep visual understanding and compositional reasoning.
Forward citations
Cited by 60 Pith papers
-
S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images
S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.
-
MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval
MathNet delivers the largest multilingual Olympiad math dataset and benchmarks where models like Gemini-3.1-Pro reach 78% on solving but embedding models struggle on equivalent problem retrieval, with retrieval augmen...
-
EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations
EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
-
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
-
UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning
UniPath adaptively models coordination-path diversity in unified multimodal models by training a path-conditioned executor and using a lightweight planner for input-dependent selection, improving performance over fixe...
-
Perception Without Engagement: Dissecting the Causal Discovery Deficit in LMMs
LMMs perceive videos but underexploit visual content for causal reasoning due to textual shortcuts; ProCauEval diagnoses this and ADPO training reduces reliance on priors.
-
GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning
GazeVLM introduces internal gaze tokens that allow VLMs to dynamically suppress irrelevant visual features and simulate foveal attention for improved high-resolution multimodal reasoning.
-
Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients
POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on A...
-
AstroAlertBench: Evaluating the Accuracy, Reasoning, and Honesty of Multimodal LLMs in Astronomical Classification
AstroAlertBench evaluates multimodal LLMs on astronomical classification accuracy, reasoning, and honesty using real ZTF alerts, revealing that high accuracy often diverges from self-assessed reasoning quality.
-
Unveiling Fine-Grained Visual Traces: Evaluating Multimodal Interleaved Reasoning Chains in Multimodal STEM Tasks
StepSTEM benchmark and dynamic-programming step alignment show top MLLMs achieve only 38.29% accuracy on graduate STEM tasks requiring interleaved cross-modal reasoning.
-
ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety
ProjLens shows that backdoor parameters in MLLMs are encoded in low-rank subspaces of the projector and that embeddings shift toward the target direction with magnitude linear in input norm, activating only on poisone...
-
Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models
RL post-training on hallucination-forced multimodal data improves reasoning performance and can outperform standard training.
-
Self-Distilled RLVR
RLSD mixes self-distillation for token-level policy difference magnitudes with RLVR for reliable update directions from response correctness to reach higher convergence and better training stability.
-
Toward an Artificial General Teacher: Procedural Geometry Data Generation and Visual Grounding with Vision-Language Models
A procedural engine generates 200k+ synthetic geometry diagrams to fine-tune VLMs for referring image segmentation on abstract diagrams, yielding 49% IoU and 85% Buffered IoU with Florence-2 versus under 1% zero-shot.
-
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
DeepEyes uses reinforcement learning to teach vision-language models active perception and image-based thinking, yielding gains on perception, reasoning, grounding, and hallucination benchmarks.
-
Unified Reward Model for Multimodal Understanding and Generation
UnifiedReward is the first unified reward model that jointly assesses multimodal understanding and generation to provide better preference signals for aligning vision models via DPO.
-
Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context
Continued pre-training with balanced long-document VQA data extends a 7B LVLM to 128K context, improving long-document VQA by 7.1% and generalizing to 512K without further training.
-
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
Staged post-training with self-distillation lets a 3B omni-modal model match or slightly exceed a 30B model on a visually debiased benchmark.
-
LatentRouter: Can We Choose the Right Multimodal Model Before Seeing Its Answer?
LatentRouter routes image-question queries to the best MLLM by predicting counterfactual performance via latent communication between learned query capsules and model capability tokens.
-
Reinforcing Multimodal Reasoning Against Visual Degradation
ROMA improves MLLM robustness to seen and unseen visual corruptions by +2.3-2.4% over GRPO on seven reasoning benchmarks while matching clean accuracy.
-
LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?
LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% for high-resolution images in MLLMs via slice-based encoding plus intra-ViT early compression while matching or exceeding baseline performance on document, OCR, and ...
-
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces
OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
-
PAAC: Privacy-Aware Agentic Device-Cloud Collaboration
PAAC aligns planner-executor decomposition with the device-cloud boundary via typed placeholders and on-device sanitization, delivering 15-36% higher accuracy and 2-6x lower leakage than prior device-cloud baselines o...
-
Large Vision-Language Models Get Lost in Attention
In LVLMs, attention can be replaced by random Gaussian weights with little or no performance loss, indicating that current models get lost in attention rather than efficiently using visual context.
-
Segment-Aligned Policy Optimization for Multi-Modal Reasoning
SAPO introduces segment-level policy optimization using a step-wise MDP abstraction to better align RL updates with reasoning structure in multi-modal LLM tasks.
-
Co-Evolving Policy Distillation
CoPD integrates multiple expert capabilities by running parallel RLVR training with bidirectional online policy distillation among experts, outperforming mixed RLVR and sequential OPD while surpassing domain-specific ...
-
Length Value Model: Scalable Value Pretraining for Token-Level Length Modeling
LenVM models token-level remaining generation length as a bounded discounted value function derived from constant negative per-token rewards, providing a scalable proxy for generation horizon.
-
Latent Denoising Improves Visual Alignment in Large Multimodal Models
A latent denoising objective with saliency-aware corruption and contrastive distillation improves visual alignment and corruption robustness in large multimodal models.
-
LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model
LLaDA2.0-Uni unifies multimodal understanding and generation inside one discrete diffusion large language model with a semantic tokenizer, MoE backbone, and diffusion decoder.
-
Unveiling Fine-Grained Visual Traces: Evaluating Multimodal Interleaved Reasoning Chains in Multimodal STEM Tasks
StepSTEM benchmark and step-level DP evaluation show top MLLMs achieve only 38.29% accuracy on fine-grained multimodal STEM reasoning, relying primarily on textual cues.
-
DR-MMSearchAgent: Deepening Reasoning in Multimodal Search Agents
DR-MMSearchAgent derives batch-wide trajectory advantages and uses differentiated Gaussian rewards to prevent premature collapse in multimodal agents, outperforming MMSearch-R1 by 8.4% on FVQA-test.
-
One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models
Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.
-
Multiplication in Multimodal LLMs: Computation with Text, Image, and Audio Inputs
Multimodal LLMs perceive numbers accurately across modalities but fail at multi-digit multiplication, with performance predicted by an arithmetic load metric C and degradation confirmed as computational rather than pe...
-
SkillGraph: Self-Evolving Multi-Agent Collaboration with Multimodal Graph Topology
SkillGraph jointly evolves agent skills and collaboration topologies in multi-agent vision-language systems using a multimodal graph transformer and a skill designer, yielding consistent performance gains on benchmarks.
-
Boosting Visual Instruction Tuning with Self-Supervised Guidance
Mixing 3-10% of visually grounded self-supervised instructions into visual instruction tuning consistently boosts MLLM performance on vision-centric benchmarks.
-
POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs
POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.
-
Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs
DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and...
-
Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization
MAPO improves multimodal chain-of-thought reasoning by requiring explicit textual descriptions of visual tool results and using a novel advantage estimator that combines semantic alignment with task rewards.
-
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
-
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.
-
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
-
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
VLM-R1 applies R1-style RL using rule-based rewards on visual tasks with clear ground truth to achieve competitive performance and superior generalization over SFT in vision-language models.
-
Muon is Scalable for LLM Training
Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.
-
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
-
Are We on the Right Way for Evaluating Large Vision-Language Models?
Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6...
-
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
-
GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents
GLM-5V-Turbo integrates multimodal perception as a core part of reasoning and execution for agentic tasks, reporting strong results in visual tool use and multimodal coding while keeping text-only performance competitive.
-
Structural Pruning of Large Vision Language Models: A Comprehensive Study on Pruning Dynamics, Recovery, and Data Efficiency
Widthwise pruning of LVLM language backbones combined with supervised finetuning and hidden-state distillation recovers over 95% performance using just 5% of data across 3B-7B models.
-
Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding
A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.
-
Back to the Barn with LLAMAs: Evolving Pretrained LLM Backbones in Finetuning Vision Language Models
Newer LLM backbones in VLMs do not always improve performance; gains are task-dependent, with VQA models solving different questions due to better confidence calibration and stable representations.
-
SVSR: A Self-Verification and Self-Rectification Paradigm for Multimodal Reasoning
SVSR trains multimodal models to verify and correct their own reasoning using a preference dataset, supervised fine-tuning, and semi-online DPO with a teacher model.
-
Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models
Multimodal reasoning models hallucinate at high-entropy cognitive bifurcation points due to loss of visual semantic anchoring, and the V-STAR training paradigm with HVAR rewards and FRM reflection mitigates this by re...
-
Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models
HDPO reframes tool efficiency as a conditional objective within accurate trajectories, enabling Metis to reduce tool invocations by orders of magnitude while raising reasoning accuracy.
-
Too long; didn't solve
Longer prompts and solutions in a new expert-authored math dataset correlate with higher failure rates across LLMs, with length linked to empirical difficulty after difficulty adjustment.
-
Decompose, Look, and Reason: Reinforced Latent Reasoning for VLMs
DLR is a new reinforced latent reasoning method for VLMs that decomposes queries, uses continuous visual latents, and outperforms text-only and multimodal CoT baselines on vision-centric benchmarks with better interpr...
-
Kimi K2.5: Visual Agentic Intelligence
Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.
-
Humanity's Last Exam
Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.
-
LLaVA-OneVision: Easy Visual Task Transfer
LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.
-
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.
-
ZAYA1-VL-8B Technical Report
ZAYA1-VL-8B is a new MoE vision-language model with vision-specific LoRA adapters and bidirectional image attention that reports competitive performance against several 3B-4B models on image, reasoning, and counting b...
Reference graph
Works this paper leans on
-
[1]
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
10, 20 10 Published as a conference paper at ICLR 2024 Anthropic. Claude 2, 2023. URL https://www.anthropic.com/index/claude-2. 6, 20 Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zit- nick, and Devi Parikh. VQA: Visual question answering. InProceedings of the IEEE international conference on computer vision, pp...
work page internal anchor Pith review arXiv 2024
-
[2]
CEUR-WS, 2022. 1, 20, 27 Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. InstructBLIP: Towards general-purpose vision- language models with instruction tuning, 2023. 6, 20, 39 Qingxiu Dong, Li Dong, Ke Xu, Guangyan Zhou, Yaru Hao, Zhifang Sui, and Furu Wei. Large language mod...
-
[3]
VizWiz grand challenge: Answering visual questions from blind people
20, 27 Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. VizWiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3608–3617,
-
[4]
C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models
10, 20, 27 Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In International Conference on Machine Learning, pp. 9118–9147. PMLR, 2022. 20 Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yika...
-
[5]
Multimodal founda- tion models: From specialists to general-purpose assistants
6, 39 Kenton Lee, Mandar Joshi, Iulia Raluca Turc, Hexiang Hu, Fangyu Liu, Julian Martin Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, and Kristina Toutanova. Pix2Struct: Screen- shot parsing as pretraining for visual language understanding. In International Conference on Machine Learning, pp. 18893–18912. PMLR, 2023. 10, 20 Chunyuan Li, Zh...
-
[6]
10, 20 Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023a. 6, 20 Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. AgentBench: Evaluating LLMs as agents. arXiv preprint arXiv:2308.03688, 2023b. 20 Yuan Liu, Haodong...
work page internal anchor Pith review arXiv 2022
-
[7]
that are trained on vast amount of paired (Schuhmann et al., 2022; Sharma et al., 2018; Lin et al., 2014) and interleaved image-text data (Zhu et al., 2023b). In addition, there has been recent development on specialized versions of these LMMs for document understanding where visual con- texts require text recognition, math understanding being one of them...
work page 2022
-
[8]
Table (450, 7.33%) Figure 12: Examples of the visual context for the table type
-
[9]
Function Plot (400, 6.51%) Figure 13: Examples of the visual context for the function plot type
-
[10]
AbstractScene (375, 6.11%) Figure 14: Examples of the visual context for the abstract scene type
-
[11]
Puzzle Test (226, 3.68%) Figure 15: Examples of the visual context for the puzzle test type
-
[12]
ScatterPlot (205, 3.34%) Figure 16: Examples of the visual context for the scatter plot type
-
[13]
25 Published as a conference paper at ICLR 2024
Line Plot (202, 3.29%) Figure 17: Examples of the visual context for the line plot type. 25 Published as a conference paper at ICLR 2024
work page 2024
-
[14]
Pie Chart (97, 1.58%) Figure 18: Examples of the visual context for the pie chart type
-
[15]
Document Image (59, 0.96%) Figure 19: Examples of the visual context for the document image type
-
[16]
Medical Image (6, 0.10%) Figure 20: Examples of the visual context for the medical image type
-
[17]
Map Chart (30, 0.49%) Figure 21: Examples of the visual context for other types, including word cloud, map chart, radar chart, violin plot, and heatmap chart. 26 Published as a conference paper at ICLR 2024 C.4 S OURCE DATASET SUMMARY The source datasets are summarized in Table 5. Dataset Category Task Context Math Skill IQTest (Ours) Math-Targeted FQA Pu...
work page 2024
-
[18]
Less frequent, yet equally important visual contexts such as medical images, word clouds, map charts, radar charts, violin plots, and heatmap charts are depicted in Figure 20 and Figure 21. These visual contexts, ranging from common to specialized representations, challenge the models to de- code and reason with varying visual information, contributing to...
work page 2024
-
[19]
Find the angle between the acceleration vector and the radius vector
-
[20]
Use the cosine function to find the cosine of the angle
-
[21]
" choices = [’3/11’, ’8/11’, ’6/11’, ’3/5’] description =
Multiply the magnitude of the acceleration vector by the cosine of the angle to find the normal component of the acceleration. Wrong output: 7.07 Ground truth: 9 Figure 47: A failure example from Multimodal Bard: the question is incomplete (or requires implicit measuring ability), but the model hallucinates the angle (the actual angle is 25◦). Here, an = ...
work page 2024
-
[22]
Set the x-coordinate to 0
-
[23]
Solve for the y-coordinate. In this case, we set x = 0 and get y = 3. Therefore, the y-intercept is 3. GPT-4V: The y-intercept of a function is the point where the graph of the function crosses the y-axis. From the provided graph, the curve intersects the y-axis at y = 4. Thus, the y-intercept is 4. Figure 60: The plot represents a function that intersect...
work page 2024
-
[24]
GPT-4V: From the given figure: Circle + Square = 5 =⇒ Circle = 5 − Square
20/2 = 10 The answer is the final number, which is 10. GPT-4V: From the given figure: Circle + Square = 5 =⇒ Circle = 5 − Square . . . (i) Triangle + Triangle = 8 =⇒ Triangle = 4 . . . (ii) Circle + Triangle = 7 Using (ii), Circle + 4 = 7 =⇒ Circle = 3 . . . (iii) Using (i) and (iii), 5 − Square = 3 =⇒ Square = 2. Correct output: 2 Figure 66: The question...
work page 2024
-
[25]
So, Richard was 3 years older than Anne when they married in 1472. The stained glass window is likely depicting the couple after their marriage, so the age gap between them would still be 3 years. GPT-4V: The two figures depicted in the stained glass are King Richard III and Anne Neville. Richard III was born on October 2, 1452, and Anne Neville was born ...
work page 2024
-
[26]
Although GPT-4V outperforms other LMMs, such as Multimodal Bard, and even humans in overall accuracy (Figure 1), it often fails in the cases where fine-grained understanding is required, as in Figure 92. ID: 105 Question: Does Dark Violet have the minimum area under the curve? Choices: (A) yes (B) no Ground truth: (A) yes LLaMA-Adapter-V2: No, Dark Violet...
work page 2024
-
[27]
For the first triangle: (8 − 4) × 2 = 8
-
[28]
So, the missing number is: 5 (unit’s place of 55) This is not in the given options
For the second triangle: (3 − 2) × 3 = 3 Applying the same pattern to the third triangle: (12 − 1) × 5 = 55 However, since the answer is in single digits, it seems we take only the unit’s place of the result. So, the missing number is: 5 (unit’s place of 55) This is not in the given options. Therefore, let’s explore another possible pattern:
-
[29]
For the first triangle: (8 + 4)/2 = 6
-
[30]
Considering other possible operations:
For the second triangle: (3 + 2)/3 = 5 /3 = 1 .67 (This does not match the number inside) But let’s continue with this pattern for the third triangle: (12 + 1)/5 = 13 /5 = 2 .6 Again, this does not match any of the options. Considering other possible operations:
-
[31]
For the first triangle: (8 × 4)/2 = 16
-
[32]
For the second triangle: (3 × 2)/3 = 2
-
[33]
For the third triangle: (12 × 1)/5 = 12 /5 = 2 .4 Again, this does not match any of the options. Given the information and the options provided, I am unable to find a consistent pattern that applies to all the triangles. Therefore, I cannot determine the missing number from the given choices. Figure 104: In this logical reasoning question, GPT-4V verifies...
work page 2024
-
[34]
There are six large cubes, which represent a value of 6 × 1000 = 6000
-
[35]
There are three medium-sized cubes, which represent a value of 3×100 = 300
-
[36]
There are two small cubes, which represent a value of 2 × 10 = 20
-
[37]
Adding up these values: 6000 + 300 + 20 + 9 = 6329 The number shown is 6329
The vertical stack of single unit cubes shows nine cubes, which represent a value of 9 × 1 = 9 . Adding up these values: 6000 + 300 + 20 + 9 = 6329 The number shown is 6329. Wrong output: 6329 GPT-4V (Reasoning Path 3): The image displays a series of cubes representing numbers in base 10 using place value. From left to right:
-
[38]
There are seven large cubes, which represent a value of 7 × 1000 = 7000
-
[39]
There are two medium-sized cubes, which represent a value of 2 × 100 = 200
-
[40]
There are nine small cubes, which represent a value of 9 × 10 = 90
-
[41]
What’s the lowest number yard line that you can see?
The vertical stack of single unit cubes shows eight cubes, which represent a value of 8 × 1 = 8 . Adding up these values: 7000 + 200 + 90 + 8 = 7298 The number shown is 7298. Wrong output: 7298 Figure 110: In this arithmetic reasoning task, GPT-4V struggles with comprehending the abstract scene, leading to its inability to calculate the correct answer thr...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.