GeoLaux: A Benchmark for Evaluating MLLMs' Geometry Performance on Long-Step Problems Requiring Auxiliary Lines
Pith reviewed 2026-05-19 00:35 UTC · model grok-4.3
The pith
Multimodal models lose over half their accuracy when geometry problems require more than six steps and auxiliary lines.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GeoLaux shows that leading MLLMs perform significantly worse on long-step geometry problems than on short-step ones, with 18 of 23 models exhibiting drops of more than 50 percent, while also establishing that stronger understanding and construction of auxiliary lines is essential for overall geometric reasoning and that limited hints improve process correctness whereas explicit answers lead models to skip steps.
What carries the argument
The GeoLaux dataset of 2186 problems, annotated for exact step count and auxiliary-line necessity, used to run a five-dimensional evaluation of MLLM performance.
If this is right
- Targeted improvements in long-chain reasoning are needed to reduce the observed performance gap between short and long problems.
- Better model awareness and skill in auxiliary line construction would raise overall success rates on geometry tasks.
- Limited answer hints can be used to encourage step-by-step correctness during evaluation and training.
Where Pith is reading between the lines
- Benchmarks that include problems with up to 24 steps could become standard for testing real-world geometric reasoning depth.
- The same long-step and auxiliary-line pattern may appear in other diagram-heavy domains such as physics or engineering diagrams.
- Training regimes that reward explicit intermediate steps rather than final answers alone could mitigate the skipping behavior observed with full hints.
Load-bearing premise
The 2186 problems were accurately labeled for step count and auxiliary-line needs, and the evaluation fairly measures genuine reasoning ability without large prompt or selection biases.
What would settle it
Independent re-annotation of a random subset of problems by geometry experts followed by re-running the same 23 models to check whether the reported performance drop on long-step items stays above 50 percent.
Figures
read the original abstract
Geometry problem solving (GPS) poses significant challenges for Multimodal Large Language Models (MLLMs) in diagram comprehension, knowledge application, long-step reasoning, and auxiliary line construction. However, current benchmarks lack fine-grained evaluation for long-step problems necessitating auxiliary construction. To address these limitations, we present GeoLaux, a fine-grained annotated dataset comprising 2186 calculation and proof problems. It features long-step reasoning (with an average solution length of 6.51 steps, maximum of 24 steps) and auxiliary line construction (required in 41.8% of problems). Building on the dataset, we conduct a comprehensive five-dimensional evaluation of 23 leading MLLMs. The evaluation yields three pivotal findings: First, models perform significantly worse on long-step problems compared to short-step ones, with 18 models exhibiting a performance drop of over 50%. Second, it is crucial to enhance models' understanding, awareness, and proficiency in auxiliary line construction, which is vital for overall geometric reasoning. Third, limited answer hints effectively improve process correctness, whereas explicit answers lead models to neglect intermediate reasoning steps. These findings position GeoLaux both to benchmark MLLMs geometry reasoning abilities and to guide their improvement. Data and code are available at https://github.com/Candice-yu/GeoLaux
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces GeoLaux, a dataset of 2186 geometry calculation and proof problems with fine-grained annotations for long-step reasoning (average 6.51 steps, maximum 24) and auxiliary line construction (required in 41.8% of problems). It evaluates 23 leading MLLMs across five dimensions and reports three findings: models perform significantly worse on long-step problems (with 18 models showing >50% performance drop), auxiliary line construction is critical for geometric reasoning, and limited answer hints improve process correctness while explicit answers cause neglect of intermediate steps.
Significance. If the step-count and auxiliary-line annotations can be independently validated, GeoLaux would provide a valuable fine-grained benchmark for diagnosing MLLM limitations in diagram comprehension, multi-step reasoning, and auxiliary construction. The reported performance gaps and hint effects offer concrete directions for improving geometric reasoning in multimodal models, and the public release of data and code supports reproducibility.
major comments (1)
- [§3] §3 (Dataset Construction and Annotation): The headline result that 18 of 23 models exhibit >50% performance drop on long-step versus short-step problems rests on the partition of the 2186 problems by annotated step count. The manuscript states that problems were 'fine-grained annotated' for step count and auxiliary-line requirements but supplies no annotation rubric, number of annotators, disagreement-resolution procedure, or inter-annotator agreement statistic. Without these details the observed length effect cannot be distinguished from possible annotation artifacts.
minor comments (2)
- [Abstract] The abstract refers to a 'five-dimensional evaluation' without enumerating the dimensions; listing them explicitly would improve immediate readability.
- Table or figure captions that report per-model accuracy on long-step and short-step subsets should include the exact number of problems in each subset to allow direct verification of the reported drops.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive feedback. We address the major comment on dataset annotation below and will incorporate the requested details in the revised manuscript.
read point-by-point responses
-
Referee: [§3] §3 (Dataset Construction and Annotation): The headline result that 18 of 23 models exhibit >50% performance drop on long-step versus short-step problems rests on the partition of the 2186 problems by annotated step count. The manuscript states that problems were 'fine-grained annotated' for step count and auxiliary-line requirements but supplies no annotation rubric, number of annotators, disagreement-resolution procedure, or inter-annotator agreement statistic. Without these details the observed length effect cannot be distinguished from possible annotation artifacts.
Authors: We agree that the current manuscript lacks sufficient detail on the annotation protocol, which is necessary to substantiate the step-count partitions and the reported performance gaps. In the revision we will add a dedicated subsection describing the annotation rubric (including explicit criteria for counting reasoning steps and identifying auxiliary-line requirements), the number of annotators, the disagreement-resolution procedure, and the computed inter-annotator agreement statistics. These additions will allow readers to assess the reliability of the annotations independently and will directly address the concern that the length effect might reflect annotation artifacts. revision: yes
Circularity Check
No circularity: direct empirical evaluation on newly constructed benchmark
full rationale
The paper introduces GeoLaux as a new dataset of 2186 geometry problems with fine-grained annotations for step count and auxiliary-line requirements, then reports direct performance measurements from running 23 MLLMs. The central findings (performance drop on long-step problems, importance of auxiliary lines) are observational results from this evaluation, not derived from any equations, fitted parameters, or self-referential definitions. No derivation chain exists that reduces predictions or uniqueness claims to the paper's own inputs by construction. The work is self-contained as an empirical benchmark study.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Geometry problems can be meaningfully categorized by reasoning length and need for auxiliary line construction.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Process Correctness Score (PCS) ... requires both correct answers and error-free solution processes
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
Draw2Think: Harnessing Geometry Reasoning through Constraint Engine Interaction
Draw2Think recasts geometric reasoning as agentic interaction with a constraint engine, achieving 95.9% predicate-level construction fidelity and up to 16.4% accuracy gains on solid geometry tasks.
-
Causal Probing for Internal Visual Representations in Multimodal Large Language Models
Activation steering reveals localized encoding for entities versus distributed encoding for abstract concepts in MLLMs, identifying depth as key for the latter and a perception-reasoning disconnect.
Reference graph
Works this paper leans on
-
[1]
Geovqa: A comprehensive multimodal geometry dataset for secondary education
Avinash Anand, Raj Jaiswal, Abhishek Dharmadhikari, Atharva Marathe, Harsh Popat, Harshil Mital, Ashwin R Nair, Kritarth Prasad, Sidharth Kumar, Astha Verma, et al. Geovqa: A comprehensive multimodal geometry dataset for secondary education. In 2024 IEEE 7th International Con- ference on Multimedia Information Processing and Retrieval (MIPR), pages 102–10...
work page 2024
-
[2]
Improving multimodal llms ability in geometry problem solving, reasoning, and multistep scoring
Avinash Anand, Raj Jaiswal, Abhishek Dharmadhikari, Atharva Marathe, Harsh Parimal Popat, Harshil Mital, Kri- tarth Prasad, Rajiv Ratn Shah, and Roger Zimmermann. Improving multimodal llms ability in geometry problem solving, reasoning, and multistep scoring. arXiv preprint arXiv:2412.00846, 2024. 2
-
[3]
https://www.anthropic.com/news/ claude-3-7-sonnet , 2025
Anthropic. https://www.anthropic.com/news/ claude-3-7-sonnet , 2025. Accessed: 2025-07-10. 2, 6, 10
work page 2025
-
[4]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025. 6, 10
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Jie Cao and Jing Xiao. An augmented benchmark dataset for geometric question answering through dual parallel text en- coding. In Proceedings of the 29th international conference on computational linguistics, pages 1511–1520, 2022. 2
work page 2022
-
[6]
Geoqa: A geometric question answering benchmark towards multimodal numeri- cal reasoning
Jiaqi Chen, Jianheng Tang, Jinghui Qin, Xiaodan Liang, Lingbo Liu, Eric P Xing, and Liang Lin. Geoqa: A geometric question answering benchmark towards multimodal numer- ical reasoning. arXiv preprint arXiv:2105.14517, 2021. 1, 10
-
[7]
Unigeo: Unifying ge- ometry logical reasoning via reformulating mathematical ex- pression
Jiaqi Chen, Tong Li, Jinghui Qin, Pan Lu, Liang Lin, Chongyu Chen, and Xiaodan Liang. Unigeo: Unifying ge- ometry logical reasoning via reformulating mathematical ex- pression. arXiv preprint arXiv:2212.02746, 2022. 2, 10
-
[8]
Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test- time scaling. arXiv preprint arXiv:2412.05271, 2024. 2, 6, 10
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
Geouni: A unified model for gen- erating geometry diagrams, problems and problem solutions
Jo-Ku Cheng, Zeren Zhang, Ran Chen, Jingyang Deng, Zi- ran Qin, and Jinwen Ma. Geouni: A unified model for gen- erating geometry diagrams, problems and problem solutions. arXiv preprint arXiv:2504.10146, 2025. 1
-
[10]
Google Deepmind. https : / / ai . google . dev / gemini - api / docs / thinking - mode, 2024. Ac- cessed: 2025-05-17. 6
work page 2024
-
[11]
Google Deepmind. https : / / deepmind . google / models/gemini/pro/ , 2025. Accessed: 2025-07-10. 2, 6
work page 2025
-
[12]
Danny Driess, F. Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Ho Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duck- worth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Peter R. Florence. Palm-e: An e...
work page 2023
-
[13]
Omni-math: A universal olympiad level mathematic benchmark for large language models, 2024
Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, Zhengyang Tang, Benyou Wang, Daoguang Zan, Shang- haoran Quan, Ge Zhang, Lei Sha, Yichang Zhang, Xu- ancheng Ren, Tianyu Liu, and Baobao Chang. Omni-math: A universal olympiad level mathematic benchmark for large language models, 2024. 8
work page 2024
-
[14]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 1, 6, 10
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Corvid: Improving multimodal large language models towards chain-of-thought reasoning
Jingjing Jiang et al. Corvid: Improving multimodal large language models towards chain-of-thought reasoning. arXiv preprint arXiv:2507.07424, 2025. 10
-
[16]
Bert Jonsson, Julia Mosseg ˚ard, Johan Lithner, and Linnea Karlsson Wirebring. Creative mathematical reasoning: Does need for cognition matter? Frontiers in Psychology , 12: 797807, 2022. 1
work page 2022
-
[17]
Xiaoyuan Li, Wenjie Wang, Moxin Li, Junrong Guo, Yang Zhang, and Fuli Feng. Evaluating mathematical reasoning of large language models: A focus on error identification and correction. arXiv preprint arXiv:2406.00755, 2024. 6
-
[18]
Cmmath: A chinese multi-modal math skill evaluation benchmark for foundation models
Zhong-Zhi Li, Ming-Liang Zhang, Fei Yin, Zhi-Long Ji, Jin- Feng Bai, Zhen-Ru Pan, Fan-Hu Zeng, Jian Xu, Jia-Xin Zhang, and Cheng-Lin Liu. Cmmath: A chinese multi- modal math skill evaluation benchmark for foundation mod- els. arXiv preprint arXiv:2407.12023, 2024. 6
-
[19]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36:34892–34916, 2023. 10
work page 2023
-
[20]
Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning
Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-gps: Interpretable geometry problem solving with formal language and sym- bolic reasoning. arXiv preprint arXiv:2105.04165, 2021. 1, 2, 10
-
[21]
Argotrics–automated triangle construc- tion solver
Vesna Marinkovi ´c. Argotrics–automated triangle construc- tion solver. Journal of Experimental & Theoretical Artificial Intelligence, 29(2):247–271, 2017. 2
work page 2017
- [22]
-
[23]
Accessed: 2025-07-10. 2, 6
work page 2025
-
[24]
OpenAI. https://openai.com/o1/, 2025. Accessed: 2025-07-10. 6
work page 2025
-
[26]
https://openai.com/index/openai- o3-mini/, 2025
OpenAI. https://openai.com/index/openai- o3-mini/, 2025. Accessed: 2025-07-10. 6
work page 2025
-
[27]
OpenAI. https : / / openai . com / index / introducing-o3-and-o4-mini/ , 2025. Accessed: 2025-07-10. 2, 6
work page 2025
-
[28]
Yicheng Pan, Zhenrong Zhang, Pengfei Hu, Jiefeng Ma, Jun Du, Jianshu Zhang, Quan Liu, Jianqing Gao, and Feng Ma. Enhancing the geometric problem-solving ability of multi- modal llms via symbolic-neural integration. arXiv preprint arXiv:2504.12773, 2025. 2
-
[29]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. 10
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
Qvq: To see the world with wisdom
Qwen Team. Qvq: To see the world with wisdom. Accessed on May, 5:2025, 2024. 2, 6
work page 2025
-
[31]
Solving olympiad geometry without human demon- strations
Trieu H Trinh, Yuhuai Wu, Quoc V Le, He He, and Thang Luong. Solving olympiad geometry without human demon- strations. Nature, 625(7995):476–482, 2024. 1
work page 2024
-
[32]
M2-reasoning: Empowering mllms with unified general and spatial reasoning
Fudong Wang et al. M2-reasoning: Empowering mllms with unified general and spatial reasoning. arXiv preprint arXiv:2507.08306, 2025. 10
-
[33]
Weiyun Wang, Zhe Chen, Wenhai Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Jinguo Zhu, Xizhou Zhu, Lewei Lu, Yu Qiao, et al. Enhancing the reasoning ability of multimodal large language models via mixed preference optimization. arXiv preprint arXiv:2411.10442, 2024. 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
E-gps: Ex- plainable geometry problem solving via top-down solver and bottom-up generator
Wenjun Wu, Lingling Zhang, Jun Liu, Xi Tang, Yaxian Wang, Shaowei Wang, and Qianying Wang. E-gps: Ex- plainable geometry problem solving via top-down solver and bottom-up generator. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 13828–13837, 2024. 1
work page 2024
-
[35]
Liangyu Xu, Yingxiu Zhao, Jingyun Wang, Yingyao Wang, Bu Pi, Chen Wang, Mingliang Zhang, Jihao Gu, Xiang Li, Xiaoyong Zhu, et al. Geosense: Evaluating identification and application of geometric principles in multimodal reasoning. arXiv preprint arXiv:2504.12597, 2025. 2, 11
- [36]
-
[37]
Shihao Xu, Yiyang Luo, and Wei Shi. Geo-llava: A large multi-modal model for solving geometry math problems with meta in-context learning. Proceedings of the 2nd Work- shop on Large Generative Models Meet Multimodal Appli- cations, 2024. 2
work page 2024
-
[38]
Yibo Yan, Jiamin Su, Jianxiang He, Fangteng Fu, Xu Zheng, Yuanhuiyi Lyu, Kun Wang, Shen Wang, Qingsong Wen, and Xuming Hu. A survey of mathematical reasoning in the era of multimodal large language model: Benchmark, method & challenges. arXiv preprint arXiv:2412.11936, 2024. 1
-
[39]
A survey on multimodal large language models
Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models. National Science Review , 11(12): nwae403, 2024. 1
work page 2024
-
[40]
Geoeval: benchmark for evaluating llms and multi-modal models on geometry problem-solving
Jiaxin Zhang, Zhongzhi Li, Mingliang Zhang, Fei Yin, Chenglin Liu, and Yashar Moshfeghi. Geoeval: benchmark for evaluating llms and multi-modal models on geometry problem-solving. arXiv preprint arXiv:2402.10104 , 2024. 2, 6, 10
-
[41]
A multi- modal neural geometric solver with textual clauses parsed from diagram
Ming-Liang Zhang, Fei Yin, and Cheng-Lin Liu. A multi- modal neural geometric solver with textual clauses parsed from diagram. arXiv preprint arXiv:2302.11097, 2023. 2
-
[42]
Physreason: A comprehensive benchmark towards physics-based reasoning
Xinyu Zhang, Yuxuan Dong, Yanrui Wu, Jiaxing Huang, Chengyou Jia, Basura Fernando, Mike Zheng Shou, Lin- gling Zhang, and Jun Liu. Physreason: A comprehensive benchmark towards physics-based reasoning. arXiv preprint arXiv:2502.12054, 2025. 6 Appendix Overview • Section A: Related work. • Section B: GeoLaux Details. • Section C: Prompts and Model Details....
-
[43]
Figure comprehension error: Failure to correctly un- derstand the geometric primitives (points, lines, circles, etc.) implied by the diagram, such as misidentifying an- gle relationships, collinear relationships, etc
-
[44]
Knowledge Error: While correctly understanding the point/line relationships, the solution employs incor- rect formulas. This includes: using wrong formu- las/theorems/properties, or selecting inappropriate for- mulas/theorems/properties for the given problem
-
[45]
Calculation Error: While correctly understanding the geometric relationships and properly selecting/applying the relevant knowledge, the solution contains numerical calculation mistakes or unit conversion errors
-
[46]
Logical Reasoning Error: The reasoning process con- tains logical fallacies, including but not limited to: in- valid causal relationships between premises and conclu- sions (the ”because-therefore” connection is unjustified), AI making intuitive assumptions without basis, draw- ing conclusions by introducing irrelevant external infor- mation or incorrect ...
-
[47]
Error Cause Analysis: For each step marked as incorrect (score=0), determine why it's wrong and provide a detailed explanation of the fundamental error
-
[48]
Error Type Classification: Based on your analysis, categorize each error into one of the following types: Figure Understanding Error, Knowledge Error, Calculation Error and Logical Reasoning Error. Please select from these error types and output the corresponding error category for each incorrect step in sequence. For steps without errors, output "N/A". T...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.