pith. sign in

arxiv: 2605.16385 · v1 · pith:K5DCLXFTnew · submitted 2026-05-11 · 💻 cs.CV · cs.AI· cs.CL

Hilbert-Geo: Solving Solid Geometric Problems by Neural-Symbolic Reasoning

Pith reviewed 2026-05-20 22:48 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL
keywords solid geometryneural-symbolic reasoningformal languagepredicate librarytheorem bankgeometric problem solvingmultimodal reasoningParse2Reason
0
0 comments X

The pith

Hilbert-Geo converts solid geometry text and diagrams into a predicate-based formal language and then applies theorems to produce verifiable step-by-step solutions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Hilbert-Geo as a unified formal language framework for solid geometry that includes a predicate library and theorem bank. It defines a Parse2Reason pipeline that first translates natural-language statements and 3D diagrams into conditional description language, then performs relational inference and algebraic computation. This produces strictly correct and human-readable reasoning traces that current multimodal models cannot match on the same tasks. The method also applies to plane geometry and comes with new expert-annotated datasets that include formal-language annotations. A sympathetic reader would care because the approach directly targets the spatial and relational failures that appear when large models confront three-dimensional figures.

Core claim

Hilbert-Geo is the first unified formal language framework for solid geometry, including an extensive predicate library and a dedicated theorem bank. Based on this framework, the Parse2Reason method first parses problem descriptions and diagrams into conditional description language (CDL) and then leverages the formal CDL together with the theorem bank to perform relational inference and algebraic computation, generating strictly correct, verifiable, and human-readable reasoning processes.

What carries the argument

Conditional Description Language (CDL), a formalized language of predicates that encodes geometric conditions from both natural text and visual diagrams so that symbolic reasoning can follow.

If this is right

  • The generated reasoning processes are strictly correct and human-readable because they rest on explicit theorem applications rather than statistical generation.
  • The same Parse2Reason pipeline reaches 80.2 percent accuracy on the plane-geometry dataset PlaneFGeo3k, showing that the framework is not limited to three-dimensional cases.
  • Performance reaches 77.3 percent on SolidFGeo2k and 84.1 percent on MathVerse-Solid, exceeding the results of Gemini-2.5-pro and GPT-5 on the same benchmarks.
  • The method supplies new expert-annotated datasets equipped with formal-language annotations, solutions, and answers that can support further development of geometric reasoners.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the predicate representation proves complete, the same parsing-plus-theorem structure could be reused for other 3D spatial reasoning tasks such as mechanical assembly or architectural verification.
  • The explicit theorem bank opens the possibility of combining the system with existing automated theorem provers to handle longer proof chains without manual intervention.
  • Because the output is a sequence of formal steps, it becomes feasible to check the reasoning against an independent symbolic engine and thereby measure the rate of hidden errors.

Load-bearing premise

The conditional description language and predicate library can represent both natural-language statements and solid geometric diagrams without loss of information or ambiguity that would break later reasoning.

What would settle it

A collection of solid-geometry problems whose diagrams contain spatial relations that cannot be expressed uniquely in the predicate library, resulting in either failed parses or inconsistent reasoning chains.

Figures

Figures reproduced from arXiv: 2605.16385 by Bin Dong, Haoyu Cheng, Qiufeng Wang, Ruoran Xu.

Figure 1
Figure 1. Figure 1: MLLMs struggle with problems in the field of solid [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Error distribution in SolidFGeo2k of Gemini and GPT [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visual Perception Error for the Problem in Fig. 4 [PITH_FULL_IMAGE:figures/full_fig_p002_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overall process of parsing images and text into geometry condition description language (CDL) [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Reasoning using geometric condition description language (CDL) [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Performances for Different Subjects (CSS, SMR, SSI, [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Reasoning performance of MLLMs under different num [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Fuzzy Jaccard Similarity Score between Predicted CDL [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 12
Figure 12. Figure 12: example of hallucination and Visual Perception Error [PITH_FULL_IMAGE:figures/full_fig_p008_12.png] view at source ↗
Figure 11
Figure 11. Figure 11: Error distribution of Hilbert-Geo (Gemini 2.5 pro and [PITH_FULL_IMAGE:figures/full_fig_p008_11.png] view at source ↗
Figure 15
Figure 15. Figure 15: Examples from Mathverse-solid [PITH_FULL_IMAGE:figures/full_fig_p014_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Data Samples from Mathverse-solid and SolidFGeo2k [PITH_FULL_IMAGE:figures/full_fig_p014_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Comparison of Datasets for Solid Geometry Problems [PITH_FULL_IMAGE:figures/full_fig_p014_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Proportion of different subjects in SolidFGeo2k, note [PITH_FULL_IMAGE:figures/full_fig_p015_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Examples from different subjects [PITH_FULL_IMAGE:figures/full_fig_p015_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Construction Parsing Performance: Jaccard similar [PITH_FULL_IMAGE:figures/full_fig_p018_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Condition Parsing Performance: Jaccard similarity [PITH_FULL_IMAGE:figures/full_fig_p019_21.png] view at source ↗
Figure 23
Figure 23. Figure 23: Examples from designed samples The complete prompt template is shown below: You are an expert in geometry, logic, and computer science. Your task is to precisely convert a geometry problem (with natural language and an image) into a JSON object following the provided JSON Schema. You must strictly follow the schema and output a complete JSON object. Rule 0: Predicate Compliance (MOST IMPORTANT) - All CDL … view at source ↗
Figure 25
Figure 25. Figure 25: Plane and Solid differ greatly in all aspects, even if the [PITH_FULL_IMAGE:figures/full_fig_p022_25.png] view at source ↗
Figure 24
Figure 24. Figure 24: Theorem Search Tree and a inference demonstration is [PITH_FULL_IMAGE:figures/full_fig_p022_24.png] view at source ↗
Figure 27
Figure 27. Figure 27: A theorem (as illustrated in Fig. 25) that falls into set [PITH_FULL_IMAGE:figures/full_fig_p022_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: one illustrative example for reasoning based on Hilbert-Geo [PITH_FULL_IMAGE:figures/full_fig_p023_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Reasoning performance of MLLMs under different [PITH_FULL_IMAGE:figures/full_fig_p024_29.png] view at source ↗
read the original abstract

Geometric problem solving, as a typical multimodal reasoning problem, has attracted much attention and made great progress recently, however most of works focus on plane geometry while usually fail in solid geometry due to 3D spatial diagrams and complex reasoning. To bridge this gap, we introduce Hilbert-Geo, the first unified formal language framework for solid geometry, including an extensive predicate library and a dedicated theorem bank. Based on this framework, we propose a Parse2Reason method containing two steps of first parsing then reasoning. In the parsing step, we utilize conditional description language (CDL), a formalized language composed of predicates specifically designed to construct geometric conditions, to represent both problem description (natural text) and solid diagrams (visual image). In the reasoning step, we leverage those formal CDL and the theorem bank to perform relational inference and algebraic computation, generating strictly correct, verifiable, and human-readable reasoning processes. Notably, our proposed Hilbert-Geo is also applicable to plane geometry. To advance geometric reasoning, we curate two expert-annotated dataset SolidFGeo2k and PlaneFGeo3k, which are furnished with geometric formal language annotations, solutions and answers. Extensive experiments show that our proposed method achieves the state-of-the-art (SOTA) performance 77.3% in SolidFGeo2k and 84.1% in MathVerse-Solid (one small subset in MathVerse dedicated to solid geometry), substantially outperforming leading MLLMs, such as Gemini-2.5-pro (54.2% on SolidFGeo2k) and GPT-5 (62.9% on MathVerse-Solid). In addition, our method achieves the SOTA accuracy 80.2% in PlaneFGeo3k, demonstrating the generality of the Hilbert-Geo in geometric reasoning. Our code and datasets will be publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Hilbert-Geo, a unified formal language framework for solid geometry consisting of an extensive predicate library and dedicated theorem bank. It introduces the Parse2Reason pipeline that first converts natural-language problem statements and solid diagrams into Conditional Description Language (CDL) predicates, then performs relational inference and algebraic computation over the theorem bank to produce strictly correct, verifiable reasoning traces. The method is evaluated on two new expert-annotated datasets (SolidFGeo2k and PlaneFGeo3k) plus the MathVerse-Solid subset, reporting SOTA accuracies of 77.3% on SolidFGeo2k and 84.1% on MathVerse-Solid, substantially above leading MLLMs such as Gemini-2.5-pro and GPT-5; the framework is also shown to generalize to plane geometry.

Significance. If the CDL predicate library can be shown to capture all required 3D spatial relations from 2D projections without loss or ambiguity, the neural-symbolic Parse2Reason approach would constitute a meaningful advance in verifiable multimodal geometric reasoning. The public release of the curated datasets with formal annotations and the code would provide a valuable resource for the community.

major comments (2)
  1. [Parsing step / CDL definition] The SOTA claims on SolidFGeo2k rest on the unverified assumption that the CDL predicates faithfully encode all necessary 3D relations (depth ordering, hidden-line intersections, coplanarity, occlusion) from 2D diagram projections. The manuscript provides no explicit validation, completeness proof, or failure-case analysis demonstrating that the parsed CDL inputs are sufficient for the theorem bank to derive correct solutions in all cases.
  2. [Experiments / Results] The evaluation reports aggregate accuracies but lacks an ablation study or error breakdown that isolates the contribution of the formal reasoning component versus the neural parsing component. Without this, it is difficult to confirm that the performance gains over pure MLLMs derive from the verifiable theorem-based reasoning rather than improved diagram parsing alone.
minor comments (2)
  1. [Abstract / Datasets] Clarify the exact size, selection criteria, and annotation protocol for the MathVerse-Solid subset so readers can assess the scope and difficulty of that evaluation.
  2. [Method] Add a short table or figure illustrating a concrete solid-geometry example with its CDL predicate list and the corresponding theorem-bank derivation to make the Parse2Reason pipeline more transparent.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We appreciate the acknowledgment of the potential significance of the CDL-based framework and Parse2Reason pipeline for verifiable solid geometry reasoning. Below we respond point-by-point to the major comments. We will revise the manuscript to strengthen the presentation of validation and experimental analysis.

read point-by-point responses
  1. Referee: The SOTA claims on SolidFGeo2k rest on the unverified assumption that the CDL predicates faithfully encode all necessary 3D relations (depth ordering, hidden-line intersections, coplanarity, occlusion) from 2D diagram projections. The manuscript provides no explicit validation, completeness proof, or failure-case analysis demonstrating that the parsed CDL inputs are sufficient for the theorem bank to derive correct solutions in all cases.

    Authors: We agree that an explicit validation section would improve clarity. The CDL predicate library (detailed in Section 3.1) was constructed in collaboration with geometry experts to cover depth ordering, occlusion, coplanarity, hidden-line intersections, and other 3D relations via predicates such as DepthOrder, OccludedBy, Coplanar, and IntersectHidden. The SolidFGeo2k dataset was annotated by experts using these predicates, and the theorem bank was verified to derive solutions from the resulting CDL representations. To directly address the concern, we will add a new subsection (e.g., 4.3) that includes (i) a coverage argument mapping common 3D spatial relations to CDL predicates, (ii) quantitative statistics on predicate usage in the dataset, and (iii) a failure-case analysis of parsing errors and their impact on downstream reasoning. This revision will be included in the next version. revision: yes

  2. Referee: The evaluation reports aggregate accuracies but lacks an ablation study or error breakdown that isolates the contribution of the formal reasoning component versus the neural parsing component. Without this, it is difficult to confirm that the performance gains over pure MLLMs derive from the verifiable theorem-based reasoning rather than improved diagram parsing alone.

    Authors: We concur that isolating the contributions would strengthen the experimental claims. The current results compare the full Parse2Reason pipeline against MLLMs that perform both parsing and reasoning in a single forward pass, thereby indirectly highlighting the benefit of the symbolic theorem bank. However, we acknowledge the value of a direct ablation. In the revised manuscript we will add an ablation study that (a) replaces the relational inference and algebraic computation steps with direct answer generation from the parsed CDL (to measure the isolated effect of the theorem bank), and (b) reports error breakdowns categorized by parsing failures versus reasoning failures. These additions will appear in Section 5 and the supplementary material. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance claims rest on held-out evaluation of an independent symbolic framework

full rationale

The paper defines a CDL predicate library and theorem bank as a formal representation layer, then applies Parse2Reason (parse diagrams/text to CDL, then relational/algebraic inference) to produce verifiable reasoning traces. Reported accuracies (77.3% SolidFGeo2k, 84.1% MathVerse-Solid, 80.2% PlaneFGeo3k) are measured on expert-annotated held-out test sets and compared against MLLM baselines; these numbers are not obtained by fitting parameters to the target metrics or by renaming any internal quantity as a prediction. No equation or step reduces the central result to its own inputs by construction, and the theorem bank is presented as an external, reusable resource rather than a self-citation chain that carries the performance claim.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach depends on the assumption that solid geometry problems can be completely captured by a finite predicate library and that the theorem bank contains all necessary inference rules; no numerical free parameters are introduced.

axioms (1)
  • domain assumption Standard Euclidean solid geometry theorems and spatial relations can be expressed as a finite set of predicates and inference rules.
    The theorem bank and CDL are built on this premise to enable relational inference.
invented entities (1)
  • Conditional Description Language (CDL) no independent evidence
    purpose: Formalized language to represent geometric conditions from both text and diagrams.
    New predicate-based language introduced to bridge natural language, images, and symbolic reasoning.

pith-pipeline@v0.9.0 · 5878 in / 1372 out tokens · 54223 ms · 2026-05-20T22:48:47.645837+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    we introduce Hilbert-Geo, the first unified formal language framework for solid geometry, including an extensive predicate library and a dedicated theorem bank... CDL... to represent both problem description and solid diagrams

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 5 internal anchors

  1. [1]

    Claude 3.7 sonnet system card.https : //www.anthropic.com/claude- 3- 7- sonnet- system-card, 2025

    Anthropic. Claude 3.7 sonnet system card.https : //www.anthropic.com/claude- 3- 7- sonnet- system-card, 2025. System card for Claude 3.7 Sonnet

  2. [2]

    Arnon, George E

    Dennis S. Arnon, George E. Collins, and Scott McCallum. Cylindrical algebraic decomposition i: The basic algorithm. SIAM Journal on Computing, 13(4):865–877, 1984

  3. [3]

    Birkh ¨auser Basel, 2004

    Lucian B ˘adescu.Projective Geometry and Formal Geome- try. Birkh ¨auser Basel, 2004

  4. [4]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, et al. Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923, 2025

  5. [5]

    Relational inductive biases, deep learning, and graph networks

    Peter W. Battaglia, Jessica B. Hamrick, Victor Bapst, et al. Relational inductive biases, deep learning, and graph net- works.arXiv preprint arXiv:1806.01261, 2018

  6. [6]

    Xing, and Liang Lin

    Jiaqi Chen, Jianheng Tang, Jinghui Qin, Xiaodan Liang, Lingbo Liu, Eric P. Xing, and Liang Lin. GeoQA: A ge- ometric question answering benchmark towards multimodal numerical reasoning. InFindings of the Association for Com- putational Linguistics: ACL-IJCNLP 2021, pages 513–523, 2021

  7. [7]

    UniGeo: Unify- ing geometry logical reasoning via reformulating mathemat- ical expression

    Jiaqi Chen, Tong Li, Jinghui Qin, et al. UniGeo: Unify- ing geometry logical reasoning via reformulating mathemat- ical expression. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3313–3323, 2022

  8. [8]

    Do NOT think that much for 2+3=? on the overthinking of long reasoning models

    Xingyu Chen, Jiahao Xu, Tian Liang, et al. Do NOT think that much for 2+3=? on the overthinking of long reasoning models. InProceedings of the 42nd International Confer- ence on Machine Learning, pages 9487–9499. PMLR, 2025

  9. [9]

    A coefficient of agreement for nominal scales

    Jacob Cohen. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1):37–46, 1960

  10. [10]

    DeepSeek-V3 Technical Report

    DeepSeek-AI. DeepSeek-V3 technical report.arXiv preprint arXiv:2412.19437, 2024

  11. [11]

    Gemini 2.5: Updates to our family of thinking mod- els.https://developers.googleblog.com/en/ gemini- 2- 5- thinking- model- updates/, 2025

    Google. Gemini 2.5: Updates to our family of thinking mod- els.https://developers.googleblog.com/en/ gemini- 2- 5- thinking- model- updates/, 2025. Introduces Gemini 2.5 Pro and Gemini 2.5 Flash updates

  12. [12]

    Large language models: A comprehensive survey of its applications, challenges, limitations, and future prospects

    Muhammad Usman Hadi, Qasem Al Tashi, Abbas Shah, et al. Large language models: A comprehensive survey of its applications, challenges, limitations, and future prospects. TechRxiv, 2024. Preprint, version 6

  13. [13]

    Formal verification meth- ods

    Osman Hasan and Sofiene Tahar. Formal verification meth- ods. InEncyclopedia of Information Science and Technol- ogy, Third Edition, pages 7162–7170. IGI Global Scientific Publishing, 2015

  14. [14]

    Springer Tokyo, 2014

    Takayuki Hibi, editor.Gr ¨obner Bases: Statistics and Soft- ware Systems. Springer Tokyo, 2014. Copyright 2013

  15. [15]

    Solving ge- ometry problems via feature learning and contrastive learn- ing of multimodal data.Computer Modeling in Engineering & Sciences, 136(2):1707–1728, 2023

    Pengpeng Jian, Fucheng Guo, Yanli Wang, et al. Solving ge- ometry problems via feature learning and contrastive learn- ing of multimodal data.Computer Modeling in Engineering & Sciences, 136(2):1707–1728, 2023

  16. [16]

    A survey on large language models for code generation.ACM Transactions on Software Engineering and Methodology, 35(2):1–72, 2026

    Juyong Jiang, Fan Wang, Jiasi Shen, et al. A survey on large language models for code generation.ACM Transactions on Software Engineering and Methodology, 35(2):1–72, 2026

  17. [17]

    Vidhalluc: Evaluating temporal hallucinations in multimodal large lan- guage models for video understanding

    Chaoyu Li, Eun Woo Im, Pooyan Fazli, et al. Vidhalluc: Evaluating temporal hallucinations in multimodal large lan- guage models for video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13723–13733, 2025

  18. [18]

    A survey on deep learning for theorem proving

    Zhaoyu Li, Jialiang Sun, Logan Murphy, et al. A survey on deep learning for theorem proving. InProceedings of the First Conference on Language Modeling, 2024

  19. [19]

    Inter-GPS: In- terpretable geometry problem solving with formal language and symbolic reasoning

    Pan Lu, Ran Gong, Shibiao Jiang, et al. Inter-GPS: In- terpretable geometry problem solving with formal language and symbolic reasoning. InProceedings of the 59th An- nual Meeting of the Association for Computational Linguis- tics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6774– 6786, Online, ...

  20. [20]

    MathVista: Evalu- ating mathematical reasoning of foundation models in visual contexts

    Pan Lu, Hritik Bansal, Tony Xia, et al. MathVista: Evalu- ating mathematical reasoning of foundation models in visual contexts. InThe Twelfth International Conference on Learn- ing Representations, 2024. Oral presentation

  21. [21]

    Llama 3.3 model cards and prompt formats, 2024

    Meta. Llama 3.3 model cards and prompt formats, 2024. Of- ficial Meta documentation for Llama 3.3, release date: De- cember 6, 2024

  22. [22]

    Autofor- malizing euclidean geometry

    Logan Murphy, Kaiyu Yang, Jialiang Sun, et al. Autofor- malizing euclidean geometry. InProceedings of the 41st In- ternational Conference on Machine Learning, pages 36847– 36893. PMLR, 2024

  23. [23]

    A com- prehensive overview of large language models.ACM Trans- actions on Intelligent Systems and Technology, 16(5):1–72, 2025

    Humza Naveed, Asad Ullah Khan, Shi Qiu, et al. A com- prehensive overview of large language models.ACM Trans- actions on Intelligent Systems and Technology, 16(5):1–72, 2025

  24. [24]

    A symbolic characters aware model for solving ge- ometry problems

    Maizhen Ning, Qiu-Feng Wang, Kaizhu Huang, and Xiaowei Huang. A symbolic characters aware model for solving ge- ometry problems. InProceedings of the 31st ACM Inter- national Conference on Multimedia (MM ’23), pages 7767– 7775, New York, NY , USA, 2023. ACM

  25. [25]

    GNS: Solving plane geometry problems by neural-symbolic reasoning with multi-modal llms

    Maizhen Ning, Zihao Zhou, Qiufeng Wang, Xiaowei Huang, and Kaizhu Huang. GNS: Solving plane geometry problems by neural-symbolic reasoning with multi-modal llms. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 24957–24965, 2025

  26. [26]

    Hello gpt-4o.https : / / openai

    OpenAI. Hello gpt-4o.https : / / openai . com / index/hello-gpt-4o/, 2024. OpenAI announcement

  27. [27]

    Introducing gpt-5.https://openai.com/ index/introducing- gpt- 5/, 2025

    OpenAI. Introducing gpt-5.https://openai.com/ index/introducing- gpt- 5/, 2025. OpenAI an- nouncement

  28. [28]

    GPT-5 system card.https://openai.com/ index/gpt-5-system-card/, 2025

    OpenAI. GPT-5 system card.https://openai.com/ index/gpt-5-system-card/, 2025. OpenAI system card

  29. [29]

    Pittalis and C

    M. Pittalis and C. Christou. Types of reasoning in 3d geom- etry thinking and their relation with spatial ability.Educa- tional Studies in Mathematics, 75(2):191–212, 2010

  30. [30]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team. Gemini: A family of highly capable multi- modal models.arXiv preprint arXiv:2312.11805, 2023

  31. [31]

    Measuring multi- modal mathematical reasoning with math-vision dataset

    Ke Wang, Junting Pan, Weikang Shi, et al. Measuring multi- modal mathematical reasoning with math-vision dataset. In NeurIPS 2024 Datasets and Benchmarks Track, 2024

  32. [32]

    SolidGeo: Measuring multimodal spatial math reasoning in solid ge- ometry

    Peijie Wang, Chao Yang, Zhong-Zhi Li, et al. SolidGeo: Measuring multimodal spatial math reasoning in solid ge- ometry. InNeurIPS 2025 Datasets and Benchmarks Track,

  33. [33]

    Thoughts are all over the place: On the underthinking of o1-like llms.arXiv preprint arXiv:2501.18585, 2025

    Yue Wang, Qiuzhi Liu, Jiahao Xu, Tian Liang, Xingyu Chen, Zhiwei He, Linfeng Song, Dian Yu, Juntao Li, Zhuosheng Zhang, et al. Thoughts are all over the place: On the under- thinking of o1-like llms.arXiv preprint arXiv:2501.18585, 2025

  34. [34]

    A survey on large language models for recommendation.World Wide Web, 27(5):60, 2024

    Likang Wu, Zhi Zheng, Zhaopeng Qiu, et al. A survey on large language models for recommendation.World Wide Web, 27(5):60, 2024

  35. [35]

    NesyGeo: A neuro-symbolic framework for multimodal geometric reasoning data generation

    Weiming Wu, Jiachen Ye, Zihao Wang, Ziyi Zhou, Yifan Li, and Luzhen Guo. Nesygeo: A neuro-symbolic framework for multimodal geometric reasoning data generation.arXiv preprint arXiv:2505.17121, 2025

  36. [36]

    GeoX: Ge- ometric problem solving through unified formalized vision- language pre-training

    Renqiu Xia, Mingsheng Li, Hancheng Ye, et al. GeoX: Ge- ometric problem solving through unified formalized vision- language pre-training. InThe Thirteenth International Con- ference on Learning Representations, 2025

  37. [37]

    Math- Verse: Does your multi-modal llm truly see the diagrams in visual math problems? InEuropean Conference on Com- puter Vision, pages 169–186

    Renrui Zhang, Dongzhi Jiang, Yichi Zhang, et al. Math- Verse: Does your multi-modal llm truly see the diagrams in visual math problems? InEuropean Conference on Com- puter Vision, pages 169–186. Springer, 2024

  38. [38]

    FormalGeo: An extensible formalized framework for olympiad geometric problem solving.arXiv preprint arXiv:2310.18021, 2023

    Xiaokai Zhang, Na Zhu, Yiming He, et al. FormalGeo: An extensible formalized framework for olympiad geometric problem solving.arXiv preprint arXiv:2310.18021, 2023

  39. [39]

    Pi-GPS: Enhancing geometry problem solving by unleashing the power of diagrammatic information

    Junbo Zhao, Ting Zhang, Jiayu Sun, Mi Tian, and Hua Huang. Pi-GPS: Enhancing geometry problem solving by unleashing the power of diagrammatic information. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1526–1536, 2025

  40. [40]

    A Survey of Large Language Models

    Wayne Xin Zhao, Kun Zhou, Junyi Li, et al. A survey of large language models.arXiv preprint arXiv:2303.18223, 2023. Hilbert-Geo: Solving Solid Geometric Problems by Neural-Symbolic Reasoning Supplementary Material A. Solid Geometry Formal Language A.1. Formal Geometry Representation In the domain of solid geometry, simple geometric bodies serve as fundame...

  41. [41]

    LetR A andR B be face sets of two valid polyhedra

  42. [42]

    Execute the operationR result =R A ⊕3D RB

  43. [43]

    This operation removes the internal contact faces (S A andS B) and retains all external surfaces

  44. [44]

    According to the generalization of Euler’s formula for manifolds, when two closed manifolds are glued along a simply connected face and that face is removed, the remaining surface still constitutes a closed 2-manifold (i.e., the boundary of the new polyhedron)

  45. [45]

    ∀RA, RB ∈S,(R A ⊕3D RB)∈S(5) Theorem A.2.R A ⊕3D RB =R B ⊕3D RA

    Therefore,R result remains a set of faces describing a closed solid. ∀RA, RB ∈S,(R A ⊕3D RB)∈S(5) Theorem A.2.R A ⊕3D RB =R B ⊕3D RA. Proof.LetR A containmfaces andR B containnfaces. Based on Eq. 4: RA ⊕3D RB ={f|f∈R A ∪R B, f̸=S A, f̸=S B}(6) Now consider the reverse operationR B ⊕3D RA: RB ⊕3D RA = (R B \ {SB})∪(R A \ {SA})(7) According to set algebra, ...

  46. [46]

    PolyhedraAandBshare interface faces(S AB, SBA)

  47. [47]

    par- allel

    PolyhedraBandCshare interface faces(S BC , SCB ). 3.R A, RB, RC are their respective face sets. Left Hand Side: LetR AB =R A ⊕3D RB. RAB = (R A ∪R B)\ {S AB, SBA}(10) Next, calculateR AB ⊕3D RC. The contact interface in- volvesBandC(i.e.,S BC andS CB ): (RA ⊕3D RB)⊕ 3D RC = (R AB ∪R C)\ {S BC , SCB } (11) = ((R A ∪R B)\ {S AB, SBA} ∪R C)\ {S BC , SCB } (1...

  48. [48]

    - image_cdl MUST include only facts directly observable from the image (e.g., length labels, right-angle marks, shape recognition)

    Information Source Separation: - text_cdl MUST include only facts extracted from the natural language description. - image_cdl MUST include only facts directly observable from the image (e.g., length labels, right-angle marks, shape recognition). - If a fact appears in both text and image, include it in both fields

  49. [49]

    construction_cdl - Geometric construction predicates (IMPORTANT): construction_cdl defines basic construction for entities, and MUST include the following types where applicable: - Shape predicates: define edges/segments of shapes * For segments/edges: Shape(AB,BC,CD,DA) or Shape(OP,PO) or Shape(PQ,QP) * For points (spheres etc.): Shape(O) or Shape(P) * E...

  50. [50]

    10", "36 *pi

    Answer formatting: - problem_answer MUST be a pure number or expression (e.g., "10", "36 *pi"), and MUST NOT contain units or extra text

  51. [51]

    Core predicate logic: - Length/Height: Equal(LengthOfLine(A,B),5), Equal(HeightOfCone(O,P),12) - Relations: PerpendicularBetweenLine(A,B,C,D), ParallelBetweenLine(A,B,C,D) - Goal: the requested quantity MUST be wrapped by Value(...)

  52. [52]

    - Quantities allowed in CDL expressions are LIMITED to standard forms: VolumeOfCone, SurfaceAreaOfCylinder, AreaOfCircle, LengthOfLine, etc

    Predicate and Operator Legality (CRITICAL): - Only reuse names from the official predicate list; DO NOT invent new construction predicates. - Quantities allowed in CDL expressions are LIMITED to standard forms: VolumeOfCone, SurfaceAreaOfCylinder, AreaOfCircle, LengthOfLine, etc. - Only the following algebraic operators are allowed: Value, Add, Sub, Mul, ...

  53. [53]

    Important: Output Requirements

    Completeness Checks: - Ensure every entity used by text_cdl/image_cdl exists in construction_cdl - Ensure the target entity in goal_cdl exists in the construction as well - Self-check after generation: verify all predicates/operators are allowed, no extra spaces, and no undeclared entities are referenced. Important: Output Requirements

  54. [54]

    You MUST output a complete JSON object with all required fields

  55. [55]

    All CDL fields MUST be arrays of strings

  56. [56]

    Value(VolumeOfCone(O,P))

    goal_cdl MUST be a string (e.g., "Value(VolumeOfCone(O,P))") C.4.2. Direct Problem Solving Prompt In addition to CDL generation, the system also supportsdi- rect problem solvingusing GPT models fortesting model accuracy. This approach bypasses formalization and di- rectly generates answers to geometry problems, providing a baseline for comparison with for...

  57. [57]

    Carefully analyze the problem text and the accompanying image

  58. [58]

    Show your reasoning process step by step

  59. [59]

    At the end, provide your final answer in a clear format

  60. [60]

    10", "5.5

    **Your final answer should be ONLY a number or mathematical expression (like "10", "5.5", "12 *pi", "36 *pi"), without any units or text **

  61. [61]

    FINAL ANSWER:

    Put your final answer on a line starting with "FINAL ANSWER: " Example format: FINAL ANSWER: 10 or FINAL ANSWER: 36 *pi Now, please solve this problem: D. SGRE Supplementary Information D.1. Theorem Search Tree and Search Process Figure 24. Theorem Search Tree and a inference demonstration is shown in fig. 28 The search process involves constructing a sea...