pith. sign in

arxiv: 2508.06226 · v4 · submitted 2025-08-08 · 💻 cs.AI

GeoLaux: A Benchmark for Evaluating MLLMs' Geometry Performance on Long-Step Problems Requiring Auxiliary Lines

Pith reviewed 2026-05-19 00:35 UTC · model grok-4.3

classification 💻 cs.AI
keywords multimodal large language modelsgeometry problem solvingauxiliary lineslong-step reasoningbenchmark datasetMLLM evaluation
0
0 comments X

The pith

Multimodal models lose over half their accuracy when geometry problems require more than six steps and auxiliary lines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GeoLaux, a dataset of 2186 geometry calculation and proof problems that average 6.51 solution steps and require auxiliary lines in 41.8 percent of cases. It evaluates 23 leading MLLMs across five dimensions and finds that 18 models suffer performance drops exceeding 50 percent on the longer problems relative to shorter ones. The results also indicate that auxiliary line construction remains a key weakness and that limited answer hints preserve intermediate reasoning better than full solutions do. These outcomes matter because they expose concrete limits in current models' ability to handle extended diagram-based reasoning chains.

Core claim

GeoLaux shows that leading MLLMs perform significantly worse on long-step geometry problems than on short-step ones, with 18 of 23 models exhibiting drops of more than 50 percent, while also establishing that stronger understanding and construction of auxiliary lines is essential for overall geometric reasoning and that limited hints improve process correctness whereas explicit answers lead models to skip steps.

What carries the argument

The GeoLaux dataset of 2186 problems, annotated for exact step count and auxiliary-line necessity, used to run a five-dimensional evaluation of MLLM performance.

If this is right

  • Targeted improvements in long-chain reasoning are needed to reduce the observed performance gap between short and long problems.
  • Better model awareness and skill in auxiliary line construction would raise overall success rates on geometry tasks.
  • Limited answer hints can be used to encourage step-by-step correctness during evaluation and training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Benchmarks that include problems with up to 24 steps could become standard for testing real-world geometric reasoning depth.
  • The same long-step and auxiliary-line pattern may appear in other diagram-heavy domains such as physics or engineering diagrams.
  • Training regimes that reward explicit intermediate steps rather than final answers alone could mitigate the skipping behavior observed with full hints.

Load-bearing premise

The 2186 problems were accurately labeled for step count and auxiliary-line needs, and the evaluation fairly measures genuine reasoning ability without large prompt or selection biases.

What would settle it

Independent re-annotation of a random subset of problems by geometry experts followed by re-running the same 23 models to check whether the reported performance drop on long-step items stays above 50 percent.

Figures

Figures reproduced from arXiv: 2508.06226 by Bo Zhao, Jiayin Zhu, Jun Liu, Lingling Zhang, Shaoxuan Ma, Wenjun Wu, Yumeng Fu, Yushun Zhang.

Figure 1
Figure 1. Figure 1: An illustration of example from GeoLaux. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Problem quantity statistics across step lengths [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Five-dimension evaluation framework of GeoLaux. Given golden answer and solution from dataset, evaluator assesses MLLM [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Illustration of auxiliary line evaluation inputs. [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: First error step variation as step length increases. [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of calculation and proving problems. [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: PCS under different auxiliary line complexity. [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Performance delta after prompting auxiliary lines. [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Error types distribution for MLLMs. simple connecting lines, but for problems requiring com￾plex constructions (such as extending a line), it often resort to brute-force methods such as coordinate-system. These methods increase solution complexity and computational load, serving as escape mechanisms when models lack suffi￾cient spatial imagination and reasoning capabilities. There￾fore, enhancing MLLMs’ a… view at source ↗
Figure 12
Figure 12. Figure 12: Distribution of auxiliary line types in GeoLaux-mini. [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗
Figure 11
Figure 11. Figure 11: Problem quantity statistics across step lengths in [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗
Figure 13
Figure 13. Figure 13: Examples from the GeoLaux dataset [PITH_FULL_IMAGE:figures/full_fig_p013_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: One-shot solution generation prompt for main evaluation. [PITH_FULL_IMAGE:figures/full_fig_p014_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: One-shot solution generation prompt for auxiliary line heuristic evaluaion. [PITH_FULL_IMAGE:figures/full_fig_p014_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Zero-shot Step-by-Step Evaluation prompt. [PITH_FULL_IMAGE:figures/full_fig_p015_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Zero-shot Error Type Evaluation prompt [PITH_FULL_IMAGE:figures/full_fig_p015_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Examples of process evaluation [PITH_FULL_IMAGE:figures/full_fig_p016_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Examples of different error types [PITH_FULL_IMAGE:figures/full_fig_p017_19.png] view at source ↗
read the original abstract

Geometry problem solving (GPS) poses significant challenges for Multimodal Large Language Models (MLLMs) in diagram comprehension, knowledge application, long-step reasoning, and auxiliary line construction. However, current benchmarks lack fine-grained evaluation for long-step problems necessitating auxiliary construction. To address these limitations, we present GeoLaux, a fine-grained annotated dataset comprising 2186 calculation and proof problems. It features long-step reasoning (with an average solution length of 6.51 steps, maximum of 24 steps) and auxiliary line construction (required in 41.8% of problems). Building on the dataset, we conduct a comprehensive five-dimensional evaluation of 23 leading MLLMs. The evaluation yields three pivotal findings: First, models perform significantly worse on long-step problems compared to short-step ones, with 18 models exhibiting a performance drop of over 50%. Second, it is crucial to enhance models' understanding, awareness, and proficiency in auxiliary line construction, which is vital for overall geometric reasoning. Third, limited answer hints effectively improve process correctness, whereas explicit answers lead models to neglect intermediate reasoning steps. These findings position GeoLaux both to benchmark MLLMs geometry reasoning abilities and to guide their improvement. Data and code are available at https://github.com/Candice-yu/GeoLaux

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces GeoLaux, a dataset of 2186 geometry calculation and proof problems with fine-grained annotations for long-step reasoning (average 6.51 steps, maximum 24) and auxiliary line construction (required in 41.8% of problems). It evaluates 23 leading MLLMs across five dimensions and reports three findings: models perform significantly worse on long-step problems (with 18 models showing >50% performance drop), auxiliary line construction is critical for geometric reasoning, and limited answer hints improve process correctness while explicit answers cause neglect of intermediate steps.

Significance. If the step-count and auxiliary-line annotations can be independently validated, GeoLaux would provide a valuable fine-grained benchmark for diagnosing MLLM limitations in diagram comprehension, multi-step reasoning, and auxiliary construction. The reported performance gaps and hint effects offer concrete directions for improving geometric reasoning in multimodal models, and the public release of data and code supports reproducibility.

major comments (1)
  1. [§3] §3 (Dataset Construction and Annotation): The headline result that 18 of 23 models exhibit >50% performance drop on long-step versus short-step problems rests on the partition of the 2186 problems by annotated step count. The manuscript states that problems were 'fine-grained annotated' for step count and auxiliary-line requirements but supplies no annotation rubric, number of annotators, disagreement-resolution procedure, or inter-annotator agreement statistic. Without these details the observed length effect cannot be distinguished from possible annotation artifacts.
minor comments (2)
  1. [Abstract] The abstract refers to a 'five-dimensional evaluation' without enumerating the dimensions; listing them explicitly would improve immediate readability.
  2. Table or figure captions that report per-model accuracy on long-step and short-step subsets should include the exact number of problems in each subset to allow direct verification of the reported drops.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. We address the major comment on dataset annotation below and will incorporate the requested details in the revised manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (Dataset Construction and Annotation): The headline result that 18 of 23 models exhibit >50% performance drop on long-step versus short-step problems rests on the partition of the 2186 problems by annotated step count. The manuscript states that problems were 'fine-grained annotated' for step count and auxiliary-line requirements but supplies no annotation rubric, number of annotators, disagreement-resolution procedure, or inter-annotator agreement statistic. Without these details the observed length effect cannot be distinguished from possible annotation artifacts.

    Authors: We agree that the current manuscript lacks sufficient detail on the annotation protocol, which is necessary to substantiate the step-count partitions and the reported performance gaps. In the revision we will add a dedicated subsection describing the annotation rubric (including explicit criteria for counting reasoning steps and identifying auxiliary-line requirements), the number of annotators, the disagreement-resolution procedure, and the computed inter-annotator agreement statistics. These additions will allow readers to assess the reliability of the annotations independently and will directly address the concern that the length effect might reflect annotation artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical evaluation on newly constructed benchmark

full rationale

The paper introduces GeoLaux as a new dataset of 2186 geometry problems with fine-grained annotations for step count and auxiliary-line requirements, then reports direct performance measurements from running 23 MLLMs. The central findings (performance drop on long-step problems, importance of auxiliary lines) are observational results from this evaluation, not derived from any equations, fitted parameters, or self-referential definitions. No derivation chain exists that reduces predictions or uniqueness claims to the paper's own inputs by construction. The work is self-contained as an empirical benchmark study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard assumptions from AI benchmarking and geometry education without introducing new free parameters, axioms, or invented entities.

axioms (1)
  • domain assumption Geometry problems can be meaningfully categorized by reasoning length and need for auxiliary line construction.
    This classification underpins the dataset statistics and the focus on auxiliary lines in 41.8% of problems.

pith-pipeline@v0.9.0 · 5795 in / 1292 out tokens · 59129 ms · 2026-05-19T00:35:51.987668+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Draw2Think: Harnessing Geometry Reasoning through Constraint Engine Interaction

    cs.CV 2026-05 unverdicted novelty 7.0

    Draw2Think recasts geometric reasoning as agentic interaction with a constraint engine, achieving 95.9% predicate-level construction fidelity and up to 16.4% accuracy gains on solid geometry tasks.

  2. Causal Probing for Internal Visual Representations in Multimodal Large Language Models

    cs.AI 2026-05 unverdicted novelty 6.0

    Activation steering reveals localized encoding for entities versus distributed encoding for abstract concepts in MLLMs, identifying depth as key for the latter and a perception-reasoning disconnect.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · cited by 2 Pith papers · 5 internal anchors

  1. [1]

    Geovqa: A comprehensive multimodal geometry dataset for secondary education

    Avinash Anand, Raj Jaiswal, Abhishek Dharmadhikari, Atharva Marathe, Harsh Popat, Harshil Mital, Ashwin R Nair, Kritarth Prasad, Sidharth Kumar, Astha Verma, et al. Geovqa: A comprehensive multimodal geometry dataset for secondary education. In 2024 IEEE 7th International Con- ference on Multimedia Information Processing and Retrieval (MIPR), pages 102–10...

  2. [2]

    Improving multimodal llms ability in geometry problem solving, reasoning, and multistep scoring

    Avinash Anand, Raj Jaiswal, Abhishek Dharmadhikari, Atharva Marathe, Harsh Parimal Popat, Harshil Mital, Kri- tarth Prasad, Rajiv Ratn Shah, and Roger Zimmermann. Improving multimodal llms ability in geometry problem solving, reasoning, and multistep scoring. arXiv preprint arXiv:2412.00846, 2024. 2

  3. [3]

    https://www.anthropic.com/news/ claude-3-7-sonnet , 2025

    Anthropic. https://www.anthropic.com/news/ claude-3-7-sonnet , 2025. Accessed: 2025-07-10. 2, 6, 10

  4. [4]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025. 6, 10

  5. [5]

    An augmented benchmark dataset for geometric question answering through dual parallel text en- coding

    Jie Cao and Jing Xiao. An augmented benchmark dataset for geometric question answering through dual parallel text en- coding. In Proceedings of the 29th international conference on computational linguistics, pages 1511–1520, 2022. 2

  6. [6]

    Geoqa: A geometric question answering benchmark towards multimodal numeri- cal reasoning

    Jiaqi Chen, Jianheng Tang, Jinghui Qin, Xiaodan Liang, Lingbo Liu, Eric P Xing, and Liang Lin. Geoqa: A geometric question answering benchmark towards multimodal numer- ical reasoning. arXiv preprint arXiv:2105.14517, 2021. 1, 10

  7. [7]

    Unigeo: Unifying ge- ometry logical reasoning via reformulating mathematical ex- pression

    Jiaqi Chen, Tong Li, Jinghui Qin, Pan Lu, Liang Lin, Chongyu Chen, and Xiaodan Liang. Unigeo: Unifying ge- ometry logical reasoning via reformulating mathematical ex- pression. arXiv preprint arXiv:2212.02746, 2022. 2, 10

  8. [8]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test- time scaling. arXiv preprint arXiv:2412.05271, 2024. 2, 6, 10

  9. [9]

    Geouni: A unified model for gen- erating geometry diagrams, problems and problem solutions

    Jo-Ku Cheng, Zeren Zhang, Ran Chen, Jingyang Deng, Zi- ran Qin, and Jinwen Ma. Geouni: A unified model for gen- erating geometry diagrams, problems and problem solutions. arXiv preprint arXiv:2504.10146, 2025. 1

  10. [10]

    https : / / ai

    Google Deepmind. https : / / ai . google . dev / gemini - api / docs / thinking - mode, 2024. Ac- cessed: 2025-05-17. 6

  11. [11]

    https : / / deepmind

    Google Deepmind. https : / / deepmind . google / models/gemini/pro/ , 2025. Accessed: 2025-07-10. 2, 6

  12. [12]

    Xia, Mehdi S

    Danny Driess, F. Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Ho Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duck- worth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Peter R. Florence. Palm-e: An e...

  13. [13]

    Omni-math: A universal olympiad level mathematic benchmark for large language models, 2024

    Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, Zhengyang Tang, Benyou Wang, Daoguang Zan, Shang- haoran Quan, Ge Zhang, Lei Sha, Yichang Zhang, Xu- ancheng Ren, Tianyu Liu, and Baobao Chang. Omni-math: A universal olympiad level mathematic benchmark for large language models, 2024. 8

  14. [14]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 1, 6, 10

  15. [15]

    Corvid: Improving multimodal large language models towards chain-of-thought reasoning

    Jingjing Jiang et al. Corvid: Improving multimodal large language models towards chain-of-thought reasoning. arXiv preprint arXiv:2507.07424, 2025. 10

  16. [16]

    Creative mathematical reasoning: Does need for cognition matter? Frontiers in Psychology , 12: 797807, 2022

    Bert Jonsson, Julia Mosseg ˚ard, Johan Lithner, and Linnea Karlsson Wirebring. Creative mathematical reasoning: Does need for cognition matter? Frontiers in Psychology , 12: 797807, 2022. 1

  17. [17]

    Evaluating mathematical reasoning of large language models: A focus on error identification and correction

    Xiaoyuan Li, Wenjie Wang, Moxin Li, Junrong Guo, Yang Zhang, and Fuli Feng. Evaluating mathematical reasoning of large language models: A focus on error identification and correction. arXiv preprint arXiv:2406.00755, 2024. 6

  18. [18]

    Cmmath: A chinese multi-modal math skill evaluation benchmark for foundation models

    Zhong-Zhi Li, Ming-Liang Zhang, Fei Yin, Zhi-Long Ji, Jin- Feng Bai, Zhen-Ru Pan, Fan-Hu Zeng, Jian Xu, Jia-Xin Zhang, and Cheng-Lin Liu. Cmmath: A chinese multi- modal math skill evaluation benchmark for foundation mod- els. arXiv preprint arXiv:2407.12023, 2024. 6

  19. [19]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36:34892–34916, 2023. 10

  20. [20]

    Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning

    Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-gps: Interpretable geometry problem solving with formal language and sym- bolic reasoning. arXiv preprint arXiv:2105.04165, 2021. 1, 2, 10

  21. [21]

    Argotrics–automated triangle construc- tion solver

    Vesna Marinkovi ´c. Argotrics–automated triangle construc- tion solver. Journal of Experimental & Theoretical Artificial Intelligence, 29(2):247–271, 2017. 2

  22. [22]

    https://openai.com/index/gpt-4-1/ ,

    OpenAI. https://openai.com/index/gpt-4-1/ ,

  23. [23]

    Accessed: 2025-07-10. 2, 6

  24. [24]

    https://openai.com/o1/, 2025

    OpenAI. https://openai.com/o1/, 2025. Accessed: 2025-07-10. 6

  25. [26]

    https://openai.com/index/openai- o3-mini/, 2025

    OpenAI. https://openai.com/index/openai- o3-mini/, 2025. Accessed: 2025-07-10. 6

  26. [27]

    https : / / openai

    OpenAI. https : / / openai . com / index / introducing-o3-and-o4-mini/ , 2025. Accessed: 2025-07-10. 2, 6

  27. [28]

    Enhancing the geometric problem-solving ability of multimodal LLMs via symbolic-neural integration.arXiv preprint arXiv:2504.12773, 2025

    Yicheng Pan, Zhenrong Zhang, Pengfei Hu, Jiefeng Ma, Jun Du, Jianshu Zhang, Quan Liu, Jianqing Gao, and Feng Ma. Enhancing the geometric problem-solving ability of multi- modal llms via symbolic-neural integration. arXiv preprint arXiv:2504.12773, 2025. 2

  28. [29]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. 10

  29. [30]

    Qvq: To see the world with wisdom

    Qwen Team. Qvq: To see the world with wisdom. Accessed on May, 5:2025, 2024. 2, 6

  30. [31]

    Solving olympiad geometry without human demon- strations

    Trieu H Trinh, Yuhuai Wu, Quoc V Le, He He, and Thang Luong. Solving olympiad geometry without human demon- strations. Nature, 625(7995):476–482, 2024. 1

  31. [32]

    M2-reasoning: Empowering mllms with unified general and spatial reasoning

    Fudong Wang et al. M2-reasoning: Empowering mllms with unified general and spatial reasoning. arXiv preprint arXiv:2507.08306, 2025. 10

  32. [33]

    Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

    Weiyun Wang, Zhe Chen, Wenhai Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Jinguo Zhu, Xizhou Zhu, Lewei Lu, Yu Qiao, et al. Enhancing the reasoning ability of multimodal large language models via mixed preference optimization. arXiv preprint arXiv:2411.10442, 2024. 6

  33. [34]

    E-gps: Ex- plainable geometry problem solving via top-down solver and bottom-up generator

    Wenjun Wu, Lingling Zhang, Jun Liu, Xi Tang, Yaxian Wang, Shaowei Wang, and Qianying Wang. E-gps: Ex- plainable geometry problem solving via top-down solver and bottom-up generator. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 13828–13837, 2024. 1

  34. [35]

    GeoSense: Evaluating identification and application of geometric principles in multimodal reasoning.arXiv preprint arXiv:2504.12597, 2025

    Liangyu Xu, Yingxiu Zhao, Jingyun Wang, Yingyao Wang, Bu Pi, Chen Wang, Mingliang Zhang, Jihao Gu, Xiang Li, Xiaoyong Zhu, et al. Geosense: Evaluating identification and application of geometric principles in multimodal reasoning. arXiv preprint arXiv:2504.12597, 2025. 2, 11

  35. [36]

    input": {

    Ruijie Xu, Zengzhi Wang, Run-Ze Fan, and Pengfei Liu. Benchmarking benchmark leakage in large language mod- els. arXiv preprint arXiv:2404.18824, 2024. 8

  36. [37]

    Geo-llava: A large multi-modal model for solving geometry math problems with meta in-context learning

    Shihao Xu, Yiyang Luo, and Wei Shi. Geo-llava: A large multi-modal model for solving geometry math problems with meta in-context learning. Proceedings of the 2nd Work- shop on Large Generative Models Meet Multimodal Appli- cations, 2024. 2

  37. [38]

    A survey of mathematical reasoning in the era of multimodal large language model: Benchmark, method & challenges

    Yibo Yan, Jiamin Su, Jianxiang He, Fangteng Fu, Xu Zheng, Yuanhuiyi Lyu, Kun Wang, Shen Wang, Qingsong Wen, and Xuming Hu. A survey of mathematical reasoning in the era of multimodal large language model: Benchmark, method & challenges. arXiv preprint arXiv:2412.11936, 2024. 1

  38. [39]

    A survey on multimodal large language models

    Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models. National Science Review , 11(12): nwae403, 2024. 1

  39. [40]

    Geoeval: benchmark for evaluating llms and multi-modal models on geometry problem-solving

    Jiaxin Zhang, Zhongzhi Li, Mingliang Zhang, Fei Yin, Chenglin Liu, and Yashar Moshfeghi. Geoeval: benchmark for evaluating llms and multi-modal models on geometry problem-solving. arXiv preprint arXiv:2402.10104 , 2024. 2, 6, 10

  40. [41]

    A multi- modal neural geometric solver with textual clauses parsed from diagram

    Ming-Liang Zhang, Fei Yin, and Cheng-Lin Liu. A multi- modal neural geometric solver with textual clauses parsed from diagram. arXiv preprint arXiv:2302.11097, 2023. 2

  41. [42]

    Physreason: A comprehensive benchmark towards physics-based reasoning

    Xinyu Zhang, Yuxuan Dong, Yanrui Wu, Jiaxing Huang, Chengyou Jia, Basura Fernando, Mike Zheng Shou, Lin- gling Zhang, and Jun Liu. Physreason: A comprehensive benchmark towards physics-based reasoning. arXiv preprint arXiv:2502.12054, 2025. 6 Appendix Overview • Section A: Related work. • Section B: GeoLaux Details. • Section C: Prompts and Model Details....

  42. [43]

    Figure comprehension error: Failure to correctly un- derstand the geometric primitives (points, lines, circles, etc.) implied by the diagram, such as misidentifying an- gle relationships, collinear relationships, etc

  43. [44]

    This includes: using wrong formu- las/theorems/properties, or selecting inappropriate for- mulas/theorems/properties for the given problem

    Knowledge Error: While correctly understanding the point/line relationships, the solution employs incor- rect formulas. This includes: using wrong formu- las/theorems/properties, or selecting inappropriate for- mulas/theorems/properties for the given problem

  44. [45]

    Calculation Error: While correctly understanding the geometric relationships and properly selecting/applying the relevant knowledge, the solution contains numerical calculation mistakes or unit conversion errors

  45. [46]

    solution

    Logical Reasoning Error: The reasoning process con- tains logical fallacies, including but not limited to: in- valid causal relationships between premises and conclu- sions (the ”because-therefore” connection is unjustified), AI making intuitive assumptions without basis, draw- ing conclusions by introducing irrelevant external infor- mation or incorrect ...

  46. [47]

    Error Cause Analysis: For each step marked as incorrect (score=0), determine why it's wrong and provide a detailed explanation of the fundamental error

  47. [48]

    because-therefore

    Error Type Classification: Based on your analysis, categorize each error into one of the following types: Figure Understanding Error, Knowledge Error, Calculation Error and Logical Reasoning Error. Please select from these error types and output the corresponding error category for each incorrect step in sequence. For steps without errors, output "N/A". T...