pith. machine review for the scientific record. sign in

arxiv: 2403.14624 · v2 · submitted 2024-03-21 · 💻 cs.CV · cs.AI· cs.CL· cs.LG

Recognition: 2 theorem links

MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

Authors on Pith no claims yet

Pith reviewed 2026-05-17 01:22 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.LG
keywords multi-modal LLMsvisual math problemsdiagram understandingbenchmark evaluationchain-of-thought reasoningmathematical reasoning
0
0 comments X

The pith

MathVerse shows multi-modal LLMs often solve visual math problems using text rather than diagrams.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that existing visual math benchmarks give away too much information in the text, allowing multi-modal LLMs to answer correctly without understanding the diagrams. To fix this, it presents MathVerse, built from 2,612 real math problems that human annotators rewrite into six versions each with different levels of visual and textual detail. This creates a scale from fully supported text to diagram-only cases so that performance drops can be attributed to lack of visual understanding. A new evaluation method scores the quality of reasoning steps in the model's chain of thought instead of just checking the final answer. Readers should care because it offers a clearer way to measure and improve how well these models handle combined visual and mathematical information.

Core claim

MathVerse collects 2,612 high-quality math problems with diagrams from public sources and transforms each into six versions with varying multi-modal information content for a total of 15K samples. This design enables an equitable evaluation of whether MLLMs truly understand visual diagrams when solving math problems. The benchmark further includes a Chain-of-Thought evaluation strategy that uses GPT-4(V) to extract key reasoning steps and provide detailed error analysis on intermediate outputs.

What carries the argument

The multi-version problem transformation that systematically varies the amount of textual information provided alongside the diagram to measure reliance on visual input.

Load-bearing premise

The six versions of each problem preserve the original mathematical intent and difficulty while only changing the distribution of information between text and diagram.

What would settle it

If MLLM accuracy remains consistent even on the versions with the least textual information and greatest dependence on the diagram, the benchmark would fail to show that models need to interpret visuals.

read the original abstract

The remarkable progress of Multi-modal Large Language Models (MLLMs) has garnered unparalleled attention, due to their superior performance in visual contexts. However, their capabilities in visual math problem-solving remain insufficiently evaluated and understood. We investigate current benchmarks to incorporate excessive visual content within textual questions, which potentially assist MLLMs in deducing answers without truly interpreting the input diagrams. To this end, we introduce MathVerse, an all-around visual math benchmark designed for an equitable and in-depth evaluation of MLLMs. We meticulously collect 2,612 high-quality, multi-subject math problems with diagrams from publicly available sources. Each problem is then transformed by human annotators into six distinct versions, each offering varying degrees of information content in multi-modality, contributing to 15K test samples in total. This approach allows MathVerse to comprehensively assess whether and how much MLLMs can truly understand the visual diagrams for mathematical reasoning. In addition, we propose a Chain-of-Thought (CoT) evaluation strategy for a fine-grained assessment of the output answers. Rather than naively judging True or False, we employ GPT-4(V) to adaptively extract crucial reasoning steps, and then score each step with detailed error analysis, which can reveal the intermediate CoT reasoning quality by MLLMs. We hope the MathVerse benchmark may provide unique insights to guide the future development of MLLMs. Project page: https://mathverse-cuhk.github.io

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that existing visual math benchmarks embed excessive textual information in questions, enabling MLLMs to deduce answers without genuinely interpreting diagrams. To enable equitable evaluation, the authors collect 2,612 high-quality multi-subject math problems with diagrams from public sources and have human annotators transform each into six versions that vary the amount of multi-modal information (textual vs. visual), yielding 15K test samples total. They further propose a Chain-of-Thought evaluation that uses GPT-4(V) to extract key reasoning steps and score them with error analysis rather than binary correctness.

Significance. If the version transformations are shown to preserve mathematical semantics and difficulty, MathVerse would provide a valuable, fine-grained benchmark for isolating and measuring visual diagram understanding in MLLMs. The multi-version design and step-level GPT-4 scoring could become a standard protocol for diagnosing whether models truly integrate visual and textual cues in mathematical reasoning, directly informing future architecture and training improvements.

major comments (2)
  1. [§3.2] §3.2 (Human Annotation and Version Transformation): The manuscript describes the process of transforming each problem into six versions with differing textual/visual content but reports no quantitative validation—such as inter-annotator agreement scores, expert equivalence ratings, or solve-rate consistency on a held-out set—to confirm that core problem semantics, difficulty, and mathematical equivalence are preserved. Without these checks, performance gaps across versions could reflect annotation artifacts rather than genuine differences in visual understanding.
  2. [§4.3] §4.3 (CoT Evaluation with GPT-4(V)): The adaptive step-extraction and scoring procedure is presented as enabling fine-grained assessment, yet the paper provides insufficient detail on prompt templates, exact scoring rubrics, or any human-GPT agreement study. This weakens the claim that the method reliably reveals intermediate reasoning quality, as GPT-4(V) errors could systematically bias the reported insights.
minor comments (2)
  1. [§3.1] The selection criteria and filtering steps used to arrive at the final 2,612 problems from public sources are only briefly summarized; a table or paragraph detailing subject distribution, diagram complexity, and exclusion reasons would improve reproducibility.
  2. In the example figures illustrating the six versions, the visual differences between versions could be highlighted with explicit callouts or color coding to make the information-content gradient immediately clear to readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights key areas for strengthening the validation and transparency of MathVerse. We address each major comment below and will incorporate the necessary additions in the revised manuscript.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Human Annotation and Version Transformation): The manuscript describes the process of transforming each problem into six versions with differing textual/visual content but reports no quantitative validation—such as inter-annotator agreement scores, expert equivalence ratings, or solve-rate consistency on a held-out set—to confirm that core problem semantics, difficulty, and mathematical equivalence are preserved. Without these checks, performance gaps across versions could reflect annotation artifacts rather than genuine differences in visual understanding.

    Authors: We agree that quantitative validation would strengthen confidence in the version transformations. The manuscript details the annotation guidelines and process used by human annotators to derive the six versions through controlled, incremental removal of textual or visual information while aiming to preserve mathematical semantics and difficulty. However, we did not report inter-annotator agreement or equivalence metrics. In the revision, we will add inter-annotator agreement scores on a multi-annotated subset, expert equivalence ratings, and solve-rate consistency analysis on a held-out set to empirically confirm preservation of core problem properties. revision: yes

  2. Referee: [§4.3] §4.3 (CoT Evaluation with GPT-4(V)): The adaptive step-extraction and scoring procedure is presented as enabling fine-grained assessment, yet the paper provides insufficient detail on prompt templates, exact scoring rubrics, or any human-GPT agreement study. This weakens the claim that the method reliably reveals intermediate reasoning quality, as GPT-4(V) errors could systematically bias the reported insights.

    Authors: We concur that greater detail on the evaluation procedure is warranted to support its reliability. The manuscript describes the adaptive use of GPT-4(V) for step extraction and error-annotated scoring as an alternative to binary correctness. In the revised version, we will include the full prompt templates, the precise scoring rubrics with error categories, and a human-GPT agreement study on a sampled set of responses to quantify alignment and address potential biases in the automated assessment. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark construction is externally grounded

full rationale

The paper collects 2,612 math problems from publicly available sources and applies explicit human annotation to produce six modality variants, yielding 15K samples. No equations, fitted parameters, or model predictions are defined; the evaluation strategy (CoT extraction via GPT-4(V)) operates on external test samples rather than quantities derived from the paper's own outputs. The central claim—that performance differences isolate visual understanding—rests on the annotation process itself, which is described as an independent construction step without self-referential reduction or load-bearing self-citations. The derivation chain is therefore self-contained against external benchmarks and sources.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The benchmark rests on the assumption that human annotators can reliably create information-controlled variants without altering core mathematical content or introducing unintended cues. No free parameters or invented entities are introduced.

axioms (2)
  • domain assumption Human annotators can produce six versions of each problem that differ only in the amount of visual versus textual information while preserving mathematical equivalence.
    Invoked in the description of transforming each of the 2,612 problems into six distinct versions.
  • domain assumption GPT-4(V) can accurately extract and score individual reasoning steps from MLLM outputs for error analysis.
    Used in the proposed Chain-of-Thought evaluation strategy.

pith-pipeline@v0.9.0 · 5608 in / 1409 out tokens · 33382 ms · 2026-05-17T01:22:47.288047+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

    cs.CL 2024-09 accept novelty 8.0

    MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.

  2. Reflection Anchors for Propagation-Aware Visual Retention in Long-Chain Multimodal Reasoning

    cs.CV 2026-05 unverdicted novelty 7.0

    RAPO uses an information-theoretic lower bound on visual gain to select high-entropy reflection anchors and optimizes a chain-masked KL surrogate, delivering gains over baselines on reasoning benchmarks across LVLM backbones.

  3. Structured Role-Aware Policy Optimization for Multimodal Reasoning

    cs.AI 2026-05 unverdicted novelty 7.0

    SRPO refines GRPO into role-aware token-level advantages by emphasizing perception tokens based on visual dependency (original vs. corrupted inputs) and reasoning tokens based on consistency with perception, unified v...

  4. Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding

    cs.CV 2026-04 unverdicted novelty 7.0

    Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.

  5. Addressing Overthinking in Large Vision-Language Models via Gated Perception-Reasoning Optimization

    cs.CV 2026-01 unverdicted novelty 7.0

    GPRO trains a meta-controller on 790k failure-labeled samples to dynamically select fast, perception, or reasoning paths in LVLMs, yielding higher accuracy and shorter responses than prior slow-thinking methods.

  6. Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models

    cs.CL 2024-10 conditional novelty 7.0

    Omni-MATH supplies 4428 human-verified Olympiad math problems that expose top LLMs achieving only 52.55% to 60.54% accuracy on the most difficult items.

  7. LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

    cs.CV 2024-07 unverdicted novelty 7.0

    LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving ...

  8. We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

    cs.AI 2024-07 accept novelty 7.0

    WE-MATH benchmark reveals most LMMs rely on rote memorization for visual math while GPT-4o has shifted toward knowledge generalization.

  9. MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

    cs.CV 2024-06 conditional novelty 7.0

    MuirBench is a new benchmark showing that top multimodal LLMs struggle with robust multi-image understanding, with GPT-4o at 68% and open-source models below 33% accuracy.

  10. LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

    cs.CV 2023-03 conditional novelty 7.0

    LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.

  11. LatentRouter: Can We Choose the Right Multimodal Model Before Seeing Its Answer?

    cs.AI 2026-05 unverdicted novelty 6.0

    LatentRouter routes image-question queries to the best MLLM by predicting counterfactual performance via latent communication between learned query capsules and model capability tokens.

  12. Co-Evolving Policy Distillation

    cs.LG 2026-04 unverdicted novelty 6.0

    CoPD integrates multiple expert capabilities by running parallel RLVR training with bidirectional online policy distillation among experts, outperforming mixed RLVR and sequential OPD while surpassing domain-specific ...

  13. VLMs Need Words: Vision Language Models Ignore Visual Detail In Favor of Semantic Anchors

    cs.CV 2026-04 unverdicted novelty 6.0

    VLMs bypass visual comparison by recovering semantic labels for nameable entities and hallucinate on unnamable ones, as shown by performance gaps and Logit Lens analysis.

  14. InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    cs.CV 2025-08 unverdicted novelty 6.0

    InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...

  15. InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    cs.CV 2025-04 conditional novelty 6.0

    InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.

  16. Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

    cs.CL 2024-11 conditional novelty 6.0

    Mixed Preference Optimization with the MMPR dataset boosts multimodal CoT reasoning, lifting InternVL2-8B to 67.0 accuracy on MathVista (+8.7 points) and matching the 76B model.

  17. Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models

    cs.AI 2026-04 unverdicted novelty 5.0

    Multimodal reasoning models hallucinate at high-entropy cognitive bifurcation points due to loss of visual semantic anchoring, and the V-STAR training paradigm with HVAR rewards and FRM reflection mitigates this by re...

  18. LLaVA-OneVision: Easy Visual Task Transfer

    cs.CV 2024-08 unverdicted novelty 5.0

    LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.

Reference graph

Works this paper leans on

83 extracted references · 83 canonical work pages · cited by 18 Pith papers · 26 internal anchors

  1. [1]

    Advances in Neural Information Processing Systems 35, 23716–23736 (2022)

    Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y ., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems 35, 23716–23736 (2022)

  2. [2]

    MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms

    Amini, A., Gabriel, S., Lin, P., Koncel-Kedziorski, R., Choi, Y ., Hajishirzi, H.: Mathqa: Towards interpretable math word problem solving with operation-based formalisms. arXiv preprint arXiv:1905.13319 (2019)

  3. [3]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966 (2023)

  4. [4]

    In: Advances in neural information processing systems

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. In: Advances in neural information processing systems. pp. 1877–1901 (2020) 14

  5. [5]

    In: Proceedings of the 29th International Conference on Computa- tional Linguistics

    Cao, J., Xiao, J.: An augmented benchmark dataset for geometric question answering through dual parallel text encoding. In: Proceedings of the 29th International Conference on Computa- tional Linguistics. pp. 1511–1520 (2022)

  6. [6]

    arXiv preprint arXiv:2305.13292 (2023)

    Chen, G., Zheng, Y .D., Wang, J., Xu, J., Huang, Y ., Pan, J., Wang, Y ., Wang, Y ., Qiao, Y ., Lu, T., et al.: Videollm: Modeling video sequence with large language models. arXiv preprint arXiv:2305.13292 (2023)

  7. [8]

    ArXiv abs/2212.02746 (2022)

    Chen, J., Li, T., Qin, J., Lu, P., Lin, L., Chen, C., Liang, X.: Unigeo: Unifying geometry logical reasoning via reformulating mathematical expression. ArXiv abs/2212.02746 (2022)

  8. [10]

    ArXiv abs/2105.14517 (2021), https://api.semanticscholar.org/CorpusID:235253782

    Chen, J., Tang, J., Qin, J., Liang, X., Liu, L., Xing, E.P., Lin, L.: Geoqa: A geometric question answering benchmark towards multimodal numerical reasoning. ArXiv abs/2105.14517 (2021), https://api.semanticscholar.org/CorpusID:235253782

  9. [11]

    MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

    Chen, J., Li, D.Z.X.S.X., Zhang, Z.L.P., Xiong, R.K.V .C.Y ., Elhoseiny, M.: Minigpt-v2: Large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478 (2023)

  10. [12]

    ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

    Chen, L., Li, J., wen Dong, X., Zhang, P., He, C., Wang, J., Zhao, F., Lin, D.: Sharegpt4v: Improving large multi-modal models with better captions. ArXiv abs/2311.12793 (2023), https://api.semanticscholar.org/CorpusID:265308687

  11. [13]

    https://lmsys.org/blog/2023-03-30-vicuna/ (March 2023)

    Chiang, W.L., Li, Z., Lin, Z., Sheng, Y ., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y ., Gonzalez, J.E., Stoica, I., Xing, E.P.: Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. https://lmsys.org/blog/2023-03-30-vicuna/ (March 2023)

  12. [14]

    Training Verifiers to Solve Math Word Problems

    Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al.: Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 (2021)

  13. [15]

    Dai, W., Li, J., Li, D., Tiong, A.M.H., Zhao, J., Wang, W., Li, B., Fung, P., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning (2023)

  14. [16]

    InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

    Dong, X., Zhang, P., Zang, Y ., Cao, Y ., Wang, B., Ouyang, L., Wei, X., Zhang, S., Duan, H., Cao, M., et al.: Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model. arXiv preprint arXiv:2401.16420 (2024)

  15. [17]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Fu, C., Chen, P., Shen, Y ., Qin, Y ., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., Wu, Y ., Ji, R.: Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394 (2023)

  16. [18]

    A challenger to gpt-4v? early explorations of gemini in visual expertise

    Fu, C., Zhang, R., Lin, H., Wang, Z., Gao, T., Luo, Y ., Huang, Y ., Zhang, Z., Qiu, L., Ye, G., et al.: A challenger to gpt-4v? early explorations of gemini in visual expertise. arXiv preprint arXiv:2312.12436 (2023)

  17. [19]

    arXiv preprint arXiv:2312.11370 (2023)

    Gao, J., Pi, R., Zhang, J., Ye, J., Zhong, W., Wang, Y ., Hong, L., Han, J., Xu, H., Li, Z., et al.: G-llava: Solving geometric problem with multi-modal large language model. arXiv preprint arXiv:2312.11370 (2023)

  18. [20]

    LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

    Gao, P., Han, J., Zhang, R., Lin, Z., Geng, S., Zhou, A., Zhang, W., Lu, P., He, C., Yue, X., Li, H., Qiao, Y .: Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010 (2023)

  19. [21]

    arXiv preprint arXiv:2402.05935 (2024) 15

    Gao, P., Zhang, R., Liu, C., Qiu, L., Huang, S., Lin, W., Zhao, S., Geng, S., Lin, Z., Jin, P., et al.: Sphinx-x: Scaling data and parameters for a family of multi-modal large language models. arXiv preprint arXiv:2402.05935 (2024) 15

  20. [22]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, G.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

  21. [23]

    Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following

    Guo, Z., Zhang, R., Zhu, X., Tang, Y ., Ma, X., Han, J., Chen, K., Gao, P., Li, X., Li, H., et al.: Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following. arXiv preprint arXiv:2309.00615 (2023)

  22. [24]

    Imagebind-llm: Multi-modality instruction tuning

    Han, J., Zhang, R., Shao, W., Gao, P., Xu, P., Xiao, H., Zhang, K., Liu, C., Wen, S., Guo, Z., et al.: Imagebind-llm: Multi-modality instruction tuning. arXiv preprint arXiv:2309.03905 (2023)

  23. [25]

    Proceedings of the International Conference on Learning Representations (ICLR) (2021)

    Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., Steinhardt, J.: Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR) (2021)

  24. [26]

    NeurIPS (2021)

    Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., Steinhardt, J.: Measuring mathematical problem solving with the math dataset. NeurIPS (2021)

  25. [27]

    Advances in Neural Information Processing Systems 36 (2024)

    Hong, Y ., Zhen, H., Chen, P., Zheng, S., Du, Y ., Chen, Z., Gan, C.: 3d-llm: Injecting the 3d world into large language models. Advances in Neural Information Processing Systems 36 (2024)

  26. [28]

    Mixtral of Experts

    Jiang, A.Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D.S., de Las Casas, D., Hanna, E.B., Bressand, F., Lengyel, G., Bour, G., Lample, G., Lavaud, L.R., Saulnier, L., Lachaux, M., Stock, P., Subramanian, S., Yang, S., Antoniak, S., Scao, T.L., Gervet, T., Lavril, T., Wang, T., Lacroix, T., Sayed, W.E.: Mixtral of experts....

  27. [29]

    Segment Anything

    Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y ., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023)

  28. [30]

    https://llava- vl.github.io/blog/2024-05-10-llava-next-stronger-llms/ (2024)

    Li, B., Zhang, K., Zhang, H., Guo, D., Zhang, R., Li, F., Zhang, Y ., Liu, Z., Li, C.: Llava-next: Stronger llms supercharge multimodal capabilities in the wild. https://llava- vl.github.io/blog/2024-05-10-llava-next-stronger-llms/ (2024)

  29. [31]

    Li, and Ziwei Liu

    Li, B., Zhang, Y ., Chen, L., Wang, J., Pu, F., Yang, J., Li, C., Liu, Z.: Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425 (2023)

  30. [32]

    LLaVA-OneVision: Easy Visual Task Transfer

    Li, B., Zhang, Y ., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Li, Y ., Liu, Z., Li, C.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024)

  31. [33]

    SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

    Li, B., Wang, R., Wang, G., Ge, Y ., Ge, Y ., Shan, Y .: Seed-bench: Benchmarking multimodal llms with generative comprehension. ArXiv abs/2307.16125 (2023)

  32. [34]

    LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

    Li, F., Zhang, R., Zhang, H., Zhang, Y ., Li, B., Li, W., Ma, Z., Li, C.: Llava-next- interleave: Tackling multi-image, video, and 3d in large multimodal models. arXiv preprint arXiv:2407.07895 (2024)

  33. [35]

    In: International Conference on Machine Learning

    Li, J., Li, D., Xiong, C., Hoi, S.: Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation. In: International Conference on Machine Learning. pp. 12888–12900. PMLR (2022)

  34. [36]

    SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models

    Lin, Z., Liu, C., Zhang, R., Gao, P., Qiu, L., Xiao, H., Qiu, H., Lin, C., Shao, W., Chen, K., et al.: Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575 (2023)

  35. [37]

    Liu, H., Li, C., Li, Y ., Lee, Y .J.: Improved baselines with visual instruction tuning (2023)

  36. [38]

    Liu, H., Li, C., Li, Y ., Li, B., Zhang, Y ., Shen, S., Lee, Y .J.: Llava-next: Improved rea- soning, ocr, and world knowledge (January 2024), https://llava-vl.github.io/blog/ 2024-01-30-llava-next/

  37. [39]

    In: NeurIPS (2023) 16

    Liu, H., Li, C., Wu, Q., Lee, Y .J.: Visual instruction tuning. In: NeurIPS (2023) 16

  38. [40]

    Liu, Y ., Duan, H., Zhang, Y ., Li, B., Zhang, S., Zhao, W., Yuan, Y ., Wang, J., He, C., Liu, Z., et al.: Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281 (2023)

  39. [41]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Lu, P., Bansal, H., Xia, T., Liu, J., yue Li, C., Hajishirzi, H., Cheng, H., Chang, K.W., Galley, M., Gao, J.: Mathvista: Evaluating math reasoning in visual contexts with gpt-4v, bard, and other large multimodal models. ArXiv abs/2310.02255 (2023)

  40. [42]

    Inter-gps: Interpretable geometry problem solving with formal language and sym- bolic reasoning.arXiv preprint arXiv:2105.04165, 2021

    Lu, P., Gong, R., Jiang, S., Qiu, L., Huang, S., Liang, X., Zhu, S.C.: Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning. arXiv preprint arXiv:2105.04165 (2021)

  41. [43]

    In: Annual Meeting of the Association for Computational Linguistics (2021), https://api.semanticscholar

    Lu, P., Gong, R., Jiang, S., Qiu, L., Huang, S., Liang, X., Zhu, S.C.: Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning. In: Annual Meeting of the Association for Computational Linguistics (2021), https://api.semanticscholar. org/CorpusID:234337054

  42. [44]

    WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct

    Luo, H., Sun, Q., Xu, C., Zhao, P., Lou, J., Tao, C., Geng, X., Lin, Q., Chen, S., Zhang, D.: Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583 (2023)

  43. [45]

    https://chat.openai.com (2023)

    OpenAI: Chatgpt. https://chat.openai.com (2023)

  44. [46]

    GPT-4 Technical Report

    OpenAI: Gpt-4 technical report. ArXiv abs/2303.08774 (2023)

  45. [47]

    OpenAI: GPT-4V(ision) system card (2023), https://openai.com/research/ gpt-4v-system-card

  46. [48]

    In: Advances in Neural Information Processing Systems (2022)

    Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Gray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., Lowe, R.: Training language models to follow instructions with human feedback. In: Advances in Neural Information Processing...

  47. [49]

    In: International Conference on Machine Learning (2021), https://api.semanticscholar.org/CorpusID:231591445

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (2021), https://api.semanticscholar.org/CorpusID:231591445

  48. [50]

    Solving General Arithmetic Word Problems

    Roy, S., Roth, D.: Solving general arithmetic word problems. ArXiv abs/1608.01413 (2016), https://api.semanticscholar.org/CorpusID:560565

  49. [51]

    In: Proceedings of the 2015 conference on empirical methods in natural language processing

    Seo, M., Hajishirzi, H., Farhadi, A., Etzioni, O., Malcolm, C.: Solving geometry problems: Combining text and diagram interpretation. In: Proceedings of the 2015 conference on empirical methods in natural language processing. pp. 1466–1476 (2015)

  50. [52]

    PandaGPT: One Model To Instruction-Follow Them All

    Su, Y ., Lan, T., Li, H., Xu, J., Wang, Y ., Cai, D.: Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355 (2023)

  51. [53]

    Advances in Neural Information Processing Systems 36 (2024)

    Sun, K., Pan, J., Ge, Y ., Li, H., Duan, H., Wu, X., Zhang, R., Zhou, A., Qin, Z., Wang, Y ., et al.: Journeydb: A benchmark for generative image understanding. Advances in Neural Information Processing Systems 36 (2024)

  52. [54]

    Team, I.: Internlm: A multilingual language model with progressively enhanced capabilities (2023)

  53. [55]

    LLaMA: Open and Efficient Foundation Language Models

    Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

  54. [56]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023) 17

  55. [57]

    In: The Twelfth International Conference on Learning Representations (2024), https://openreview

    Wang, K., Ren, H., Zhou, A., Lu, Z., Luo, S., Shi, W., Zhang, R., Song, L., Zhan, M., Li, H.: Mathcoder: Seamless code integration in LLMs for enhanced mathematical reasoning. In: The Twelfth International Conference on Learning Representations (2024), https://openreview. net/forum?id=z8TW0ttBPp

  56. [58]

    Advances in Neural Information Processing Systems 35, 24824–24837 (2022)

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V ., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35, 24824–24837 (2022)

  57. [59]

    arXiv preprint arXiv:2306.09265 (2023)

    Xu, P., Shao, W., Zhang, K., Gao, P., Liu, S., Lei, M., Meng, F., Huang, S., Qiao, Y ., Luo, P.: Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. arXiv preprint arXiv:2306.09265 (2023)

  58. [60]

    Pointllm: Empowering large language models to understand point clouds

    Xu, R., Wang, X., Wang, T., Chen, Y ., Pang, J., Lin, D.: Pointllm: Empowering large language models to understand point clouds. arXiv preprint arXiv:2308.16911 (2023)

  59. [61]

    Ye, Q., Xu, H., Xu, G., Ye, J., Yan, M., Zhou, Y ., Wang, J., Hu, A., Shi, P., Shi, Y ., Jiang, C., Li, C., Xu, Y ., Chen, H., Tian, J., Qian, Q., Zhang, J., Huang, F.: mplug-owl: Modularization empowers large language models with multimodality (2023)

  60. [62]

    Ye, Q., Xu, H., Ye, J., Yan, M., Hu, A., Liu, H., Qian, Q., Zhang, J., Huang, F., Zhou, J.: mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration (2023)

  61. [63]

    MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

    Yue, X., Ni, Y ., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y ., Wei, C., Yu, B., Yuan, R., Sun, R., Yin, M., Zheng, B., Yang, Z., Liu, Y ., Huang, W., Sun, H., Su, Y ., Chen, W.: Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv preprint arXiv:2311.16502 (2023)

  62. [64]

    arXiv preprint arXiv:2309.05653 (2023)

    Yue, X., Qu, X., Zhang, G., Fu, Y ., Huang, W., Sun, H., Su, Y ., Chen, W.: Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653 (2023)

  63. [65]

    Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

    Zhang, H., Li, X., Bing, L.: Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858 (2023)

  64. [66]

    In: The Twelfth International Conference on Learning Representations (2024), https://openreview.net/ forum?id=d4UiXAHN2W

    Zhang, R., Han, J., Zhou, A., Hu, X., Yan, S., Lu, P., Li, H., Gao, P., Qiao, Y .: LLaMA-adapter: Efficient fine-tuning of large language models with zero-initialized attention. In: The Twelfth International Conference on Learning Representations (2024), https://openreview.net/ forum?id=d4UiXAHN2W

  65. [67]

    CVPR 2023 (2023)

    Zhang, R., Hu, X., Li, B., Huang, S., Deng, H., Li, H., Qiao, Y ., Gao, P.: Prompt, generate, then cache: Cascade of foundation models makes strong few-shot learners. CVPR 2023 (2023)

  66. [68]

    ICLR 2024 (2023)

    Zhang, R., Jiang, Z., Guo, Z., Yan, S., Pan, J., Dong, H., Gao, P., Li, H.: Personalize segment anything model with one shot. ICLR 2024 (2023)

  67. [69]

    CVPR 2023 (2023)

    Zhang, R., Wang, L., Qiao, Y ., Gao, P., Li, H.: Learning 3d representations from 2d pre-trained models via image-to-point masked autoencoders. CVPR 2023 (2023)

  68. [70]

    arXiv preprint arXiv:2407.08739 (2024)

    Zhang, R., Wei, X., Jiang, D., Zhang, Y ., Guo, Z., Tong, C., Liu, J., Zhou, A., Wei, B., Zhang, S., et al.: Mavis: Mathematical visual instruction tuning. arXiv preprint arXiv:2407.08739 (2024)

  69. [71]

    Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification

    Zhou, A., Wang, K., Lu, Z., Shi, W., Luo, S., Qin, Z., Lu, S., Jia, A., Song, L., Zhan, M., et al.: Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification. arXiv preprint arXiv:2308.07921 (2023)

  70. [72]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023) 18 Appendix Overview • Section A: Related work. • Section B: Additional experimental details. • Section C: More dataset details. • Section D: Comparison to current benchmarks. • ...

  71. [73]

    Calculatethesquareofthelengthsofthelegs: y^2=11^2+11^2,y^2=121+121,y^2=242.3

    UsethePythagoreantheoremtofindthelengthofthediagonal.Thetheoremstatesthatthesquareofthelengthofthehypotenuse(thediagonalinthiscase)isequaltothesumofthesquaresofthelengthsofthelegs.Inthiscase,thelegsarethesidesofthesquareprism,whichareofequallength.2. Calculatethesquareofthelengthsofthelegs: y^2=11^2+11^2,y^2=121+121,y^2=242.3. Takethesquarerootofbothsides...

  72. [74]

    Inthiscase,thebaselengthis10andtheheightis13.3

    TheareaofanisoscelestriangleisgivenbytheformulaA=(1/2)*b*h,wherebisthebaselengthandhistheheight.2. Inthiscase,thebaselengthis10andtheheightis13.3. Pluggingthesevaluesintotheformula,wegetA=(1/2)*10*13=65.4. Therefore,theareaofthetriangleis65.ThecorrectoptionisB:60. SPHINX-MoE: ×0/1×0/1 ×0/1×0/1×0/1 ✓1/1✓1/1 ✓1/1✓1/1 ✓1/1 ✓1/1 ✓1/1 ✓1/1 Multi-step Score: 0....

  73. [75]

    We adopt the Text-lite version of the problem, and highlight the key-step extraction and scoring by the CoT evaluation strategy

    Thesingleansweristhatthevalueofhis30.SPHINX-MoE: ×0/1 ×0/1 ×0/1 ✓1/1✓1/1✓1/1✓1/1 ✓1/1 ✓1/1 ✓1/1 ×0/1×0/1 Multi-step Score: 0Final CoTEvaluation Score: 0 Multi-step Score: 0.3Final CoTEvaluation Score: 0.2 Multi-step Score: 1Final CoTEvaluation Score: 1 Visual Perception Error Reasoning Error Figure 26: Response Comparison of GPT-4V [47], LLaV A-NeXT [38],...

  74. [76]

    Identify that the polynomial has a degree of 2 for the terms with the highest degree, which are the squares of the linear terms.3

    Analyze the graph and identify the degree of the terms in the polynomial.2. Identify that the polynomial has a degree of 2 for the terms with the highest degree, which are the squares of the linear terms.3. Identify the highest degree term that is not squared is the term with the variable raised to the second power, which is (x+1)2.4. Identify the next hi...

  75. [77]

    The graph has a vertical tangent at x=-5/2.4.The graph of p also passes through the point (-1,0).5

    The graph of p has a vertical tangent at x=-1.2.The graph has a horizontal tangent at x=3.3. The graph has a vertical tangent at x=-5/2.4.The graph of p also passes through the point (-1,0).5. The equation of p is p(x)=(x+1)2(2 x+5)(x-3)2.6. The correct option is A. SPHINX-MoE:×0/1 Multi-step Score: 0.17Final CoTEvaluation Score: 0.12 Multi-step Score: 0....

  76. [78]

    The function starts off decreasing from the point (-5, something above 6) to the point (-4, 1).3

    Determine the intervals on which the function is increasing or decreasing by looking at the slope of the curve on the graph.2. The function starts off decreasing from the point (-5, something above 6) to the point (-4, 1).3. The function then starts increasing at the point (-4, 1) and continues to increase until it reaches a peak somewhere between -2 and ...

  77. [79]

    Identify the points where the function reaches its peaks and its lowest points.3

    Determine the slope of the function as it moves from left to right along the x-axis.2. Identify the points where the function reaches its peaks and its lowest points.3. The function is increasing from the leftmost part of the graph until it reaches its first peak, which is approximately at x = -3.4. The function starts decreasing after this peak until it ...

  78. [80]

    The graph shows a function D(t) that represents the horizontal distance between Sia and the wall over time. 2. The highlighted segment with a red double arrow spans vertically from the point (3.75, 3) to the point (3.75, 5) on the graph. 3. This segment is at a constant time of 3.75 seconds and shows a change in the distance D(t) from 3 meters to 5 meters...

  79. [81]

    The horizontal axis is labeled with time t in seconds, and the vertical axis is labeled with distance D in meters

    Analyze the graph which shows a periodic function, representing the back-and-forth motion of Sia swinging from the chandelier. The horizontal axis is labeled with time t in seconds, and the vertical axis is labeled with distance D in meters. The highlighted segment on the graph indicates a change in the distance from one point to another over time.2. Eval...

  80. [82]

    Use the property of similar triangles that the ratios of their corresponding sides are equal:AB/BD = BC/CE.3

    Identify that the triangles ABD and CEB are similar because they both have a right angle, and they share the angle at point B.2. Use the property of similar triangles that the ratios of their corresponding sides are equal:AB/BD = BC/CE.3. Substitute the given values into the ratio:2.0 / (7.0/3.0) = BC / 9.0.4. Cross-multiply to solve for BC:2.0 * 9.0 = BC...

Showing first 80 references.