pith. sign in

arxiv: 2503.16549 · v2 · submitted 2025-03-19 · 💻 cs.CV

MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical Problems

Pith reviewed 2026-05-22 22:49 UTC · model grok-4.3

classification 💻 cs.CV
keywords multimodal large language modelsvisual mathematical reasoningdiagram perceptionperception-inference decouplingFlowVerse benchmark
0
0 comments X

The pith

Decoupling perception into a separate trained stage improves MLLM accuracy on diagram-based math problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that multimodal large language models still fail to reliably extract information from diagrams in math problems, even when they can reason once the facts are given. It introduces the FlowVerse benchmark to measure perception and reasoning separately and reveals clear gaps in current models. In response it proposes MathFlow, a pipeline that splits the task into an independent perception stage followed by inference. Training MathFlow-P-7B specifically for perception produces measurable gains when the model is paired with existing closed- and open-source reasoners. A reader would care because the separation lets each component be improved without retraining the entire system.

Core claim

MathFlow is a modular pipeline that decouples perception of diagrams from subsequent inference. Training a dedicated perception model, MathFlow-P-7B, and feeding its output to various inference models produces substantial performance gains on visual mathematical tasks.

What carries the argument

The MathFlow pipeline that isolates diagram perception as an independent, trainable stage before handing structured information to an inference model.

If this is right

  • Perception can be optimized independently of the reasoning component.
  • The same perception model works with multiple closed-source and open-source inference systems.
  • FlowVerse supplies separate scores for perception accuracy and reasoning accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same separation might help other visual domains where diagrams or charts are central.
  • Specialized perception models could be developed for different diagram styles or subjects.
  • The benchmark could be used to diagnose exactly which visual features current models miss.

Load-bearing premise

That training perception separately will reliably improve end-to-end results without introducing new interface errors between the two stages.

What would settle it

An experiment in which MathFlow-P-7B is paired with the same inference models on FlowVerse and yields no accuracy increase over the baseline MLLM.

Figures

Figures reproduced from arXiv: 2503.16549 by Hangjie Yuan, Jun Cen, Pengwei Liu, Shuhang Chen, Tao Feng, Yi Yang, Yunqiu Xu, Zeying Huang.

Figure 1
Figure 1. Figure 1: The Typical Process of Humans Solving Visual Math￾ematical Problems. We can summarize two key capabilities ob￾served in the typical human problem-solving process: perception and inference. The perception capability involves extracting rel￾evant information from both visual and textual inputs, ensuring accurate reasoning, which inspired the development of FlowVerse and MathFlow. els (MLLMs) [10, 30, 51] are… view at source ↗
Figure 2
Figure 2. Figure 2: Six Versions of Problems in FlowVerse. FlowVerse begins by categorizing the original problem information into four distinct components: Descriptive Information (DI), Essential Information (EI), Only Question (OQ), and Reasoned Property (RP). The first three components are derived directly from the original problem statement, while RP is extracted from the solution and represents the inferences needed to so… view at source ↗
Figure 4
Figure 4. Figure 4: The FlowVerse-CoT-E Strategy. abilities beyond end-to-end performance. In contrast, Math￾Vista [40] integrates diverse mathematical and visual tasks to challenge models with fine-grained visual understand￾ing and compositional reasoning across various contexts. To provide a more comprehensive and in-depth evaluation of MLLMs, MathVerse [79] builds upon MathVista, focus￾ing on eliminating textual redundancy… view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of Two Different CoT Evaluation Per￾formances on FlowVerse† . Visual Perception Error Question: Answer: 1 To �ind the area of the shaded region in the square ABCD, We �irst note that the square has a side length of 2. The area of the entire square is therefore (22= 4) square units. The line through O (the center of the square) that intersects AD and BC at points E and F respectively, divides the… view at source ↗
Figure 6
Figure 6. Figure 6: Problem-solving Comparison of MathFlow⋆ GPT-4V and GPT-4V. GPT-4V. In terms of CoT-based evaluation (CoT-E), MathFlow⋆ GPT−4V also demonstrates consistent superior￾ity. On the other hand, Tab. 3 shows the performance comparison of MLLMs on the FlowVerse† and MathVerse datasets, where FlowVerse† refers to the raw, unmodified version of FlowVerse. Notably, MathFlow⋆ Gemini1.5−pro achieves the highest accurac… view at source ↗
Figure 8
Figure 8. Figure 8: Manual Modification for EI in FlowVerse. For the original problems shown, we transfer some of the EI from diagrams to question texts (highlighted in green) to mark the Vision Centric version. A.2. Subject and Subfield Definition Plane Geometry. This foundational area studies the prop￾erties and relationships of points, lines, and surfaces within a two-dimensional plane. It covers key concepts such as cir￾c… view at source ↗
Figure 9
Figure 9. Figure 9: Subject Distribution of FlowVerse. 0 20 40 60 80 100 0.00 0.05 0.10 0.15 0.20 Percentage Distribution of Question Lengths (Word) Text Dominant Text Lite Vision Dominant Text Dominant Average Text Lite Average Vision Dominant Average [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Distribution of Question Length for Four Problem Versions. We present the distribution of question length for the four problem versions, with the horizontal axis representing ques￾tion length in characters and the vertical axis depicting the corre￾sponding probability distribution. we observe a clear downward trend in both the distribution of question lengths and their average values. A.4. Details of Eval… view at source ↗
Figure 11
Figure 11. Figure 11: Visualization of Different Error Type across Dif￾ferent Versions using GPT-4 on FlowVerse. The horizontal axis represents different problem versions, while the vertical axis indi￾cates the error types. The radius of each bubble corresponds to the number of visual perception errors, with smaller radii indicating fewer visual perception errors. rors, underscoring the importance of a balanced informa￾tion re… view at source ↗
Figure 12
Figure 12. Figure 12: Comparison of Six Problem Versions in FlowVerse. Descriptive Information Essential Information Only Question Reasoned Property Text Centric Text Limited Text Plus Vision Dense Vision Centric Vision Primary 班长对捐款情况进行了统计 , 并绘制成了统计图 .根据统计 图可知 。6人捐款 10元,13人 捐款20元,20人捐款30元,8 人捐款50元,3人捐款100元。 捐款超过30元的有8+3=11(人) 捐款总数为 6×10 + 13×20 + 20×30 + 8×50 + 3×100 = 1620(元). 则下列说法中,不正 确的是() choices: A. "'捐款30元的是20 人’”, B. … view at source ↗
Figure 13
Figure 13. Figure 13: Comparison of Six Problem Versions in FlowVerse [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Comparison of Six Problem Versions in FlowVerse. According to the properties of congruent triangles, we obtain ∠ACB = ∠DFE, and then, by the criterion for parallel lines, we can reach the conclusion.:∵ △ABC ≌ △DEF, ∴ ∠ACB = ∠DFE, ∴ AC ∥ DF. Solving-problem Solution Step Extraction Question: As shown in the �igure, △ABC ≌ △DEF, what is the relationship between AC and DF? Visual math problems Answer: AC ∥ D… view at source ↗
Figure 15
Figure 15. Figure 15: Data Annotation of the MathFlow-RP We first employ Qwen2.5-72B to extract the corresponding steps from the solving￾problem solution, then select step N as the target for prediction. Subsequently, the preceding N-1 steps are provided as input within the prompt, enabling the MLLM to predict the next step based on this sequential context. other hand, correctly identifies and matches corresponding sides and a… view at source ↗
Figure 16
Figure 16. Figure 16: Data Annotation of the MathFlow-SFT.We manually extract the corresponding EI and RP from the solving-problem solution and associated diagram. In this representation, the red-highlighted portions indicate EI, while the blue-highlighted sections represent RP [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Response Comparison of GPT-4V and MathFlow⋆ GPT−4V Visual Perception Error Question: Answer: 8 From the given information, we can determine the length of chord AC in the circle. 1. We know that△𝑂𝐴𝐵 is a right triangle with ∠OAB = 90° , OB = 13 (radius of the circle), and AB = 12 . Using the Pythagorean theorem, 𝐴𝑂 = 𝑂𝐵2 − 𝐴𝐵2 = 132 − 122 = 169 − 144 = 25 = 5 . 2. It is given that AC = 2AH, 𝐴𝐻 = 𝐴𝑂2 − 𝐻𝑂2 … view at source ↗
Figure 18
Figure 18. Figure 18: Response Comparison of GPT-4V and MathFlow⋆ GPT−4V [PITH_FULL_IMAGE:figures/full_fig_p022_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Response Comparison of GPT-4V and MathFlow⋆ GPT−4V [PITH_FULL_IMAGE:figures/full_fig_p023_19.png] view at source ↗
read the original abstract

Despite strong results on many tasks, multimodal large language models (MLLMs) still underperform on visual mathematical problem solving, especially in reliably perceiving and interpreting diagrams. Inspired by human problem-solving, we hypothesize that the ability to extract meaningful information from diagrams is pivotal, as it directly conditions subsequent inference. Hence, we introduce FlowVerse, a comprehensive benchmark that provides a fine-grained evaluation of MLLMs' perception and reasoning capabilities. Our preliminary results on FlowVerse reveal that existing MLLMs exhibit substantial limitations when extracting essential information and reasoned properties from diagrams and performing complex reasoning based on these visual inputs. In response, we introduce MathFlow, a modular problem-solving pipeline that decouples perception and inference into distinct stages, thereby optimizing each independently. Given the perceptual limitations observed in current MLLMs, we trained MathFlow-P-7B as a dedicated perception model. Experimental results indicate that MathFlow-P-7B yields substantial performance gains when integrated with various closed-source and open-source inference models. This demonstrates the effectiveness of the MathFlow pipeline and its compatibility with diverse inference frameworks. Project page: https://github.com/MathFlow-zju/MathFlow.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces FlowVerse, a benchmark providing fine-grained evaluation of MLLMs' perception and reasoning on visual mathematical problems, and MathFlow, a modular pipeline decoupling perception from inference. It trains MathFlow-P-7B as a dedicated perception model and reports that this yields substantial performance gains when paired with various closed- and open-source inference models.

Significance. If the claimed gains hold under scrutiny, the work would provide evidence that separating and specializing the perception stage can address a key bottleneck in MLLM visual math solving, while also supplying a new benchmark for the community.

major comments (2)
  1. [Abstract] Abstract: the central claim of 'substantial performance gains' from MathFlow-P-7B is stated without any quantitative metrics, dataset sizes, error bars, or ablation results, preventing verification of the effect size or statistical reliability.
  2. [Abstract] The hypothesis that decoupling perception into a separate trained stage will reliably improve end-to-end performance is presented as pivotal, yet the abstract supplies no evidence on whether the perception model was trained on data disjoint from the inference models or benchmarks.
minor comments (1)
  1. [Abstract] The project page link is provided, which supports potential reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point-by-point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of 'substantial performance gains' from MathFlow-P-7B is stated without any quantitative metrics, dataset sizes, error bars, or ablation results, preventing verification of the effect size or statistical reliability.

    Authors: We agree that the abstract would be strengthened by including quantitative support for the central claim. In the revision we will add specific performance deltas on FlowVerse (with dataset sizes), while retaining the note that full ablations, error bars, and statistical details appear in the experimental sections. revision: yes

  2. Referee: [Abstract] The hypothesis that decoupling perception into a separate trained stage will reliably improve end-to-end performance is presented as pivotal, yet the abstract supplies no evidence on whether the perception model was trained on data disjoint from the inference models or benchmarks.

    Authors: We acknowledge the abstract does not explicitly state data disjointness. We will insert a concise clause confirming that MathFlow-P-7B was trained on data disjoint from both the FlowVerse benchmark and the inference models used in the paired experiments. The full training-data description and separation protocol are already detailed in Sections 3 and 4. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical pipeline: creation of FlowVerse benchmark, observation of MLLM perception limits, introduction of modular MathFlow decoupling perception/inference, training of MathFlow-P-7B, and reporting of integration gains. No equations, fitted parameters, derivations, or uniqueness theorems appear. No self-citation chains or ansatzes are invoked as load-bearing. The performance claims rest on external experimental results rather than any quantity that reduces to its own inputs by construction. This is a standard empirical contribution with self-contained evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no equations, fitted constants, or explicit assumptions beyond the high-level hypothesis; therefore no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.0 · 5755 in / 1065 out tokens · 59090 ms · 2026-05-22T22:49:22.352538+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Large Language Models for Operations Research: A Comprehensive Survey

    math.OC 2026-05 unverdicted novelty 2.0

    A survey compiling roles, applications, benchmarks, challenges, and future directions for large language models in operations research.

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages · cited by 1 Pith paper · 29 internal anchors

  1. [1]

    Large language models for mathematical reasoning: Progresses and challenges

    Janice Ahn, Rishu Verma, Renze Lou, Di Liu, Rui Zhang, and Wenpeng Yin. Large language models for mathemat- ical reasoning: Progresses and challenges. arXiv preprint arXiv:2402.00157, 2024. 2

  2. [2]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men- sch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736,

  3. [3]

    MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms

    Aida Amini, Saadia Gabriel, Peter Lin, Rik Koncel- Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. Mathqa: Towards interpretable math word problem solving with operation-based formalisms. arXiv preprint arXiv:1905.13319, 2019. 3

  4. [4]

    claude-3-5-sonnet system card, 2024

    Anthropic. claude-3-5-sonnet system card, 2024. 4, 6, 9

  5. [5]

    OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

    Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. Openflamingo: An open- source framework for training large autoregressive vision- language models. arXiv preprint arXiv:2308.01390 , 2023. 3

  6. [6]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023. 1, 4, 5, 6, 9

  7. [7]

    Geogpt4v: Towards geometric multi-modal large language models with geometric image generation

    Shihao Cai, Keqin Bao, Hangyu Guo, Jizhi Zhang, Jun Song, and Bo Zheng. Geogpt4v: Towards geometric multi-modal large language models with geometric image generation. arXiv preprint arXiv:2406.11503, 2024. 6

  8. [8]

    Geoqa: A geometric question answering benchmark towards multimodal numeri- cal reasoning

    Jiaqi Chen, Jianheng Tang, Jinghui Qin, Xiaodan Liang, Lingbo Liu, Eric P Xing, and Liang Lin. Geoqa: A geometric question answering benchmark towards multimodal numeri- cal reasoning. arXiv preprint arXiv:2105.14517, 2021. 1

  9. [9]

    Unigeo: Unifying ge- ometry logical reasoning via reformulating mathematical ex- pression

    Jiaqi Chen, Tong Li, Jinghui Qin, Pan Lu, Liang Lin, Chongyu Chen, and Xiaodan Liang. Unigeo: Unifying ge- ometry logical reasoning via reformulating mathematical ex- pression. arXiv preprint arXiv:2212.02746, 2022. 1

  10. [10]

    MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

    Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023. 1

  11. [11]

    Chatcot: Tool- augmented chain-of-thought reasoning on chat-based large language models

    Zhipeng Chen, Kun Zhou, Beichen Zhang, Zheng Gong, Wayne Xin Zhao, and Ji-Rong Wen. Chatcot: Tool- augmented chain-of-thought reasoning on chat-based large language models. arXiv preprint arXiv:2305.14323, 2023. 3

  12. [12]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test- time scaling. arXiv preprint arXiv:2412.05271, 2024. 4, 6

  13. [13]

    How to learn and teach economics with large language models, including gpt

    Tyler Cowen and Alexander T Tabarrok. How to learn and teach economics with large language models, including gpt

  14. [14]

    InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

    Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, et al. Internlm-xcomposer2: Mastering free-form text-image composition and compre- hension in vision-language large model. arXiv preprint arXiv:2401.16420, 2024. 5, 6, 9

  15. [15]

    Vi- sual reasoning and multi-agent approach in multimodal large language models (mllms): Solving tsp and mtsp combinato- rial challenges

    Mohammed Elhenawy, Ahmad Abutahoun, Taqwa I Alha- didi, Ahmed Jaber, Huthaifa I Ashqar, Shadi Jaradat, Ahmed Abdelhay, Sebastien Glaser, and Andry Rakotonirainy. Vi- sual reasoning and multi-agent approach in multimodal large language models (mllms): Solving tsp and mtsp combinato- rial challenges. arXiv preprint arXiv:2407.00092, 2024. 3

  16. [16]

    More than meets the ai: Evaluating the performance of gpt-4 on computer graphics assessment questions

    Tony Haoran Feng, Paul Denny, Burkhard Wuensche, An- drew Luxton-Reilly, and Steffan Hooper. More than meets the ai: Evaluating the performance of gpt-4 on computer graphics assessment questions. In Proceedings of the 26th Australasian Computing Education Conference, pages 182– 191, 2024. 2

  17. [17]

    Gpt-3: Its nature, scope, limits, and consequences

    Luciano Floridi and Massimo Chiriatti. Gpt-3: Its nature, scope, limits, and consequences. Minds and Machines, 30: 681–694, 2020. 3

  18. [18]

    Mathematical capabilities of chatgpt

    Simon Frieder, Luca Pinchetti, Ryan-Rhys Griffiths, Tom- maso Salvatori, Thomas Lukasiewicz, Philipp Petersen, and Julius Berner. Mathematical capabilities of chatgpt. Ad- vances in neural information processing systems , 36, 2024. 3

  19. [19]

    Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models

    Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, et al. Omni-math: A universal olympiad level mathe- matic benchmark for large language models. arXiv preprint arXiv:2410.07985, 2024. 3

  20. [21]

    G-LLaVA: Solving Geometric Problem with Multi- Modal Large Language Model.arXiv:2312.11370, 2023

    Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wan- jun Zhong, Yufei Wang, Lanqing Hong, Jianhua Han, Hang Xu, Zhenguo Li, et al. G-llava: Solving geometric prob- lem with multi-modal large language model. arXiv preprint arXiv:2312.11370, 2023. 6

  21. [22]

    LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

    Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xi- angyu Yue, et al. Llama-adapter v2: Parameter-efficient vi- sual instruction model. arXiv preprint arXiv:2304.15010 ,

  22. [23]

    Sphinx-x: Scaling data and parameters for a family of multi-modal large language models.arXiv preprint arXiv:2402.05935, 2024

    Peng Gao, Renrui Zhang, Chris Liu, Longtian Qiu, Siyuan Huang, Weifeng Lin, Shitian Zhao, Shijie Geng, Ziyi Lin, Peng Jin, et al. Sphinx-x: Scaling data and parameters for a family of multi-modal large language models.arXiv preprint arXiv:2402.05935, 2024. 5, 6, 9

  23. [24]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. 7

  24. [25]

    Infimm-webmath-40b: Advancing mul- timodal pre-training for enhanced mathematical reasoning

    Xiaotian Han, Yiren Jian, Xuefeng Hu, Haogeng Liu, Yiqi Wang, Qihang Fan, Yuang Ai, Huaibo Huang, Ran He, Zhen- heng Yang, et al. Infimm-webmath-40b: Advancing mul- timodal pre-training for enhanced mathematical reasoning. arXiv preprint arXiv:2409.12568, 2024. 3, 4, 5, 9

  25. [26]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021. 3, 12

  26. [27]

    Mixtral of Experts

    Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Deven- dra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024. 3

  27. [28]

    Problem representation and mathemati- cal problem solving of students of varying math ability.Jour- nal of Learning Disabilities, 47(2):103–115, 2014

    Jennifer L Krawec. Problem representation and mathemati- cal problem solving of students of varying math ability.Jour- nal of Learning Disabilities, 47(2):103–115, 2014. 1

  28. [29]

    Solving quantitative reasoning problems with language mod- els

    Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language mod- els. Advances in Neural Information Processing Systems, 35: 3843–3857, 2022. 3

  29. [30]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 1, 3

  30. [31]

    LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

    Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models. arXiv preprint arXiv:2407.07895, 2024. 3

  31. [32]

    Eagle: Elevating geo- metric reasoning through llm-empowered visual instruction tuning

    Zhihao Li, Yao Du, Yang Liu, Yan Zhang, Yufang Liu, Mengdi Zhang, and Xunliang Cai. Eagle: Elevating geo- metric reasoning through llm-empowered visual instruction tuning. arXiv preprint arXiv:2408.11397, 2024. 3

  32. [33]

    Let's Verify Step by Step

    Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Ed- wards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. arXiv preprint arXiv:2305.20050, 2023. 3

  33. [34]

    SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models

    Ziyi Lin, Chris Liu, Renrui Zhang, Peng Gao, Longtian Qiu, Han Xiao, Han Qiu, Chen Lin, Wenqi Shao, Keqin Chen, et al. Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575, 2023. 1

  34. [35]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024. 7

  35. [36]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024. 1, 3

  36. [37]

    Mathbench: Evaluating the theory and application proficiency of llms with a hierarchical mathematics benchmark

    Hongwei Liu, Zilong Zheng, Yuxuan Qiao, Haodong Duan, Zhiwei Fei, Fengzhe Zhou, Wenwei Zhang, Songyang Zhang, Dahua Lin, and Kai Chen. Mathbench: Evalu- ating the theory and application proficiency of llms with a hierarchical mathematics benchmark. arXiv preprint arXiv:2405.12209, 2024. 3

  37. [38]

    Cmm-math: A chinese multimodal math dataset to evaluate and enhance the mathematics reasoning of large multimodal models

    Wentao Liu, Qianjun Pan, Yi Zhang, Zhuo Liu, Ji Wu, Jie Zhou, Aimin Zhou, Qin Chen, Bo Jiang, and Liang He. Cmm-math: A chinese multimodal math dataset to evaluate and enhance the mathematics reasoning of large multimodal models. arXiv preprint arXiv:2409.02834, 2024. 3

  38. [39]

    Finemath: A fine-grained mathematical evaluation bench- mark for chinese large language models

    Yan Liu, Renren Jin, Ling Shi, Zheng Yao, and Deyi Xiong. Finemath: A fine-grained mathematical evaluation bench- mark for chinese large language models. arXiv preprint arXiv:2403.07747, 2024. 3

  39. [40]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023. 1, 3

  40. [41]

    Visaidmath: Benchmark- ing visual-aided mathematical reasoning

    Jingkun Ma, Runzhe Zhan, Derek F Wong, Yang Li, Di Sun, Hou Pong Chan, and Lidia S Chao. Visaidmath: Benchmark- ing visual-aided mathematical reasoning. arXiv preprint arXiv:2410.22995, 2024. 3

  41. [42]

    Language Models are Few-Shot Learners

    Ben Mann, N Ryder, M Subbiah, J Kaplan, P Dhariwal, A Neelakantan, P Shyam, G Sastry, A Askell, S Agarwal, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 1, 2020. 3

  42. [43]

    A Comprehensive Overview of Large Language Models

    Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar, Muhammad Usman, Naveed Akhtar, Nick Barnes, and Ajmal Mian. A comprehensive overview of large language models. arXiv preprint arXiv:2307.06435,

  43. [44]

    OpenAI. Chatgpt. https://chat.openai.com, 2023. 3

  44. [45]

    Introducing openai o1, 2023

    OpenAI. Introducing openai o1, 2023. 12

  45. [46]

    GPT-4V(ision) system card, 2023

    OpenAI. GPT-4V(ision) system card, 2023. 3, 4, 5, 6, 9

  46. [47]

    GPT-4o system card, 2024

    OpenAI. GPT-4o system card, 2024. 4, 9

  47. [48]

    Multimath: Bridging visual and mathematical reasoning for large language models

    Shuai Peng, Di Fu, Liangcai Gao, Xiuqin Zhong, Hongguang Fu, and Zhi Tang. Multimath: Bridging visual and mathe- matical reasoning for large language models. arXiv preprint arXiv:2409.00147, 2024. 3

  48. [49]

    How to solve it: A new aspect of mathematical method

    George Polya and George P ´olya. How to solve it: A new aspect of mathematical method . Princeton university press,

  49. [50]

    We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

    Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, et al. We-math: Does your large mul- timodal model achieve human-like mathematical reasoning? arXiv preprint arXiv:2407.01284, 2024. 2, 3, 12

  50. [51]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 1

  51. [52]

    Vision language models are blind

    Pooyan Rahmanzadehgervi, Logan Bolton, Moham- mad Reza Taesiri, and Anh Totti Nguyen. Vision language models are blind. arXiv preprint arXiv:2407.06581 , 2024. 1, 2

  52. [53]

    Towards robust automated math problem solving: a survey of statistical and deep learning approaches

    Amrutesh Saraf, Pooja Kamat, Shilpa Gite, Satish Kumar, and Ketan Kotecha. Towards robust automated math problem solving: a survey of statistical and deep learning approaches. Evolutionary Intelligence, pages 1–38, 2024. 3

  53. [54]

    Can llms master math? investigating large language models on math stack exchange

    Ankit Satpute, Noah Gießing, Andr ´e Greiner-Petter, Moritz Schubotz, Olaf Teschke, Akiko Aizawa, and Bela Gipp. Can llms master math? investigating large language models on math stack exchange. In Proceedings of the 47th Interna- tional ACM SIGIR Conference on Research and Develop- ment in Information Retrieval, pages 2316–2320, 2024. 3

  54. [55]

    P ´olya, problem solving, and education

    Alan H Schoenfeld. P ´olya, problem solving, and education. Mathematics magazine, 60(5):283–291, 1987. 1, 2

  55. [56]

    Survey of different large language model archi- tectures: Trends, benchmarks, and challenges

    Minghao Shao, Abdul Basit, Ramesh Karri, and Muhammad Shafique. Survey of different large language model archi- tectures: Trends, benchmarks, and challenges. IEEE Access,

  56. [57]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. 3

  57. [58]

    Math-llava: Bootstrapping mathematical reasoning for multimodal large language models

    Wenhao Shi, Zhiqiang Hu, Yi Bin, Junhua Liu, Yang Yang, See-Kiong Ng, Lidong Bing, and Roy Ka-Wei Lee. Math- llava: Bootstrapping mathematical reasoning for multimodal large language models. arXiv preprint arXiv:2406.17294 ,

  58. [59]

    Automatic prompt augmentation and selection with chain-of-thought from labeled data,

    KaShun Shum, Shizhe Diao, and Tong Zhang. Automatic prompt augmentation and selection with chain-of-thought from labeled data. arXiv preprint arXiv:2302.12822, 2023. 3

  59. [60]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. 4, 5, 6, 9

  60. [61]

    Qwen2.5-llm: Extending the boundary of llms,

    Qwen Team. Qwen2.5-llm: Extending the boundary of llms,

  61. [62]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 3

  62. [63]

    Examining the potential and pitfalls of chatgpt in science and engineering problem-solving

    Karen D Wang, Eric Burkholder, Carl Wieman, Shima Salehi, and Nick Haber. Examining the potential and pitfalls of chatgpt in science and engineering problem-solving. In Frontiers in Education, page 1330486. Frontiers Media SA,

  63. [64]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024. 4, 5, 6, 9

  64. [65]

    Exploring the reasoning abilities of multimodal large language models (mllms): A comprehensive survey on emerging trends in multimodal reasoning

    Yiqi Wang, Wentao Chen, Xiaotian Han, Xudong Lin, Hait- eng Zhao, Yongfei Liu, Bohan Zhai, Jianbo Yuan, Quanzeng You, and Hongxia Yang. Exploring the reasoning abilities of multimodal large language models (mllms): A compre- hensive survey on emerging trends in multimodal reasoning. arXiv preprint arXiv:2401.06805, 2024. 3

  65. [66]

    Generative ai for math: Part i–mathpile: A billion-token-scale pretraining cor- pus for math

    Zengzhi Wang, Rui Xia, and Pengfei Liu. Generative ai for math: Part i–mathpile: A billion-token-scale pretraining cor- pus for math. arXiv preprint arXiv:2312.17120, 2023. 4

  66. [67]

    Chain-of-thought prompting elicits reasoning in large lan- guage models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models. Advances in neural information processing systems, 35:24824–24837, 2022. 3, 5, 7

  67. [68]

    Chain-of- though (cot) prompting strategies for medical error detection and correction

    Zhaolong Wu, Abul Hasan, Jinge Wu, Yunsoo Kim, Ja- son PY Cheung, Teng Zhang, and Honghan Wu. Chain-of- though (cot) prompting strategies for medical error detection and correction. arXiv preprint arXiv:2406.09103, 2024. 3

  68. [69]

    LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

    Guowei Xu, Peng Jin, Li Hao, Yibing Song, Lichao Sun, and Li Yuan. Llava-o1: Let vision language models reason step- by-step. arXiv preprint arXiv:2411.10440, 2024. 12

  69. [70]

    Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities

    Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xi- aochun Cao, Jie Zhang, and Dacheng Tao. Model merging in llms, mllms, and beyond: Methods, theories, applications and opportunities. arXiv preprint arXiv:2408.07666, 2024. 3

  70. [71]

    Mathglm-vision: Solving mathematical problems with multi-modal large language model

    Zhen Yang, Jinhao Chen, Zhengxiao Du, Wenmeng Yu, Wei- han Wang, Wenyi Hong, Zhihuan Jiang, Bin Xu, Yuxiao Dong, and Jie Tang. Mathglm-vision: Solving mathemati- cal problems with multi-modal large language model. arXiv preprint arXiv:2409.13729, 2024. 3

  71. [72]

    Gpt (generative pre-trained transformer)–a comprehensive review on enabling technolo- gies, potential applications, emerging challenges, and future directions

    Gokul Yenduri, M Ramalingam, G Chemmalar Selvi, Y Supriya, Gautam Srivastava, Praveen Kumar Reddy Mad- dikunta, G Deepti Raj, Rutvij H Jhaveri, B Prabadevi, Weizheng Wang, et al. Gpt (generative pre-trained transformer)–a comprehensive review on enabling technolo- gies, potential applications, emerging challenges, and future directions. IEEE Access, 2024. 3

  72. [73]

    Lamm: Language-assisted multi-modal instruction-tuning dataset, framework, and benchmark

    Zhenfei Yin, Jiong Wang, Jianjian Cao, Zhelun Shi, Dingn- ing Liu, Mukai Li, Xiaoshui Huang, Zhiyong Wang, Lu Sheng, Lei Bai, et al. Lamm: Language-assisted multi-modal instruction-tuning dataset, framework, and benchmark. Ad- vances in Neural Information Processing Systems, 36, 2024. 3

  73. [74]

    MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning

    Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653, 2023. 4

  74. [75]

    Mario eval: Evalu- ate your math llm with your math llm–a mathematical dataset evaluation toolkit

    Boning Zhang, Chengxi Li, and Kai Fan. Mario eval: Evalu- ate your math llm with your math llm–a mathematical dataset evaluation toolkit. arXiv preprint arXiv:2404.13925, 2024. 3

  75. [76]

    Llama-berry: Pairwise optimization for o1-like olympiad-level mathematical reasoning

    Di Zhang, Jianbo Wu, Jingdi Lei, Tong Che, Jiatong Li, Tong Xie, Xiaoshui Huang, Shufei Zhang, Marco Pavone, Yuqiang Li, et al. Llama-berry: Pairwise optimization for o1- like olympiad-level mathematical reasoning. arXiv preprint arXiv:2410.02884, 2024. 12

  76. [77]

    Mm-llms: Recent ad- vances in multimodal large language models

    Duzhen Zhang, Yahan Yu, Jiahua Dong, Chenxing Li, Dan Su, Chenhui Chu, and Dong Yu. Mm-llms: Recent ad- vances in multimodal large language models. arXiv preprint arXiv:2401.13601, 2024. 3

  77. [78]

    LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

    Renrui Zhang, Jiaming Han, Chris Liu, Peng Gao, Ao- jun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023. 3

  78. [79]

    MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

    Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? arXiv preprint arXiv:2403.14624, 2024. 1, 2, 3, 7, 4

  79. [80]

    Mavis: Mathematical visual in- struction tuning

    Renrui Zhang, Xinyu Wei, Dongzhi Jiang, Yichi Zhang, Ziyu Guo, Chengzhuo Tong, Jiaming Liu, Aojun Zhou, Bin Wei, Shanghang Zhang, et al. Mavis: Mathematical visual in- struction tuning. arXiv preprint arXiv:2407.08739, 2024. 3, 5, 6

  80. [81]

    Is your model really a good math rea- soner? evaluating mathematical reasoning with checklist

    Zihao Zhou, Shudong Liu, Maizhen Ning, Wei Liu, Jin- dong Wang, Derek F Wong, Xiaowei Huang, Qiufeng Wang, and Kaizhu Huang. Is your model really a good math rea- soner? evaluating mathematical reasoning with checklist. arXiv preprint arXiv:2407.08733, 2024. 2

Showing first 80 references.