Hilbert-Geo: Solving Solid Geometric Problems by Neural-Symbolic Reasoning

Bin Dong; Haoyu Cheng; Qiufeng Wang; Ruoran Xu

arxiv: 2605.16385 · v3 · pith:K5DCLXFTnew · submitted 2026-05-11 · 💻 cs.CV · cs.AI· cs.CL

Hilbert-Geo: Solving Solid Geometric Problems by Neural-Symbolic Reasoning

Ruoran Xu , Haoyu Cheng , Bin Dong , Qiufeng Wang This is my paper

Pith reviewed 2026-06-30 22:42 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords solid geometryneural-symbolic reasoningformal languagegeometric problem solvingmultimodal reasoningpredicate logictheorem proving3D diagrams

0 comments

The pith

A formal predicate language turns solid geometry diagrams and text into verifiable theorem-based solutions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Hilbert-Geo as the first unified formal framework for solid geometry that includes an extensive predicate library and theorem bank. It proposes Parse2Reason, which first converts natural language descriptions and 3D diagrams into conditional description language (CDL) statements, then performs relational inference and algebraic computation using the theorem bank. This produces strictly correct, human-readable reasoning steps. The approach reaches 77.3 percent accuracy on the new SolidFGeo2k benchmark and 84.1 percent on MathVerse-Solid, beating leading multimodal models, and also works on plane geometry tasks.

Core claim

Hilbert-Geo supplies a unified formal language framework with predicates and theorems for solid geometry; the Parse2Reason method parses both problem text and diagrams into CDL, then applies the theorem bank for relational inference and algebraic computation to generate strictly correct and verifiable solutions.

What carries the argument

Conditional description language (CDL), a formalized predicate language for constructing geometric conditions that represents both natural text and visual diagrams to enable subsequent relational inference.

If this is right

The generated reasoning processes are strictly correct, verifiable, and human-readable.
The same framework applies directly to plane geometry problems.
Expert-annotated datasets with formal language annotations, solutions, and answers become available for advancing geometric reasoning research.
The method substantially outperforms pure multimodal large language models on solid geometry tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If CDL representations prove reliable across varied diagram styles, the same parse-then-reason structure could be tested on other spatial reasoning domains such as engineering drawings or molecular structures.
Expanding the theorem bank with additional solid geometry theorems would allow the system to handle a broader range of problem types without changing the core parsing step.
Integrating the formal CDL output as an intermediate representation might reduce hallucination rates in larger multimodal models when they are asked to explain 3D geometry solutions.

Load-bearing premise

The conditional description language can accurately and completely represent both natural language problem descriptions and solid diagrams without introducing errors or losing critical information.

What would settle it

A collection of solid geometry problems in which CDL parsing either introduces ambiguities in 3D spatial relations or omits necessary conditions, causing the downstream theorem-based reasoning to produce wrong answers.

Figures

Figures reproduced from arXiv: 2605.16385 by Bin Dong, Haoyu Cheng, Qiufeng Wang, Ruoran Xu.

**Figure 2.** Figure 2: Error distribution in SolidFGeo2k of Gemini and GPT [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Visual Perception Error for the Problem in Fig. 4 [PITH_FULL_IMAGE:figures/full_fig_p002_3.png] view at source ↗

**Figure 4.** Figure 4: Overall process of parsing images and text into geometry condition description language (CDL) [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Reasoning using geometric condition description language (CDL) [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Performances for Different Subjects (CSS, SMR, SSI, [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Reasoning performance of MLLMs under different num [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Fuzzy Jaccard Similarity Score between Predicted CDL [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 12.** Figure 12: example of hallucination and Visual Perception Error [PITH_FULL_IMAGE:figures/full_fig_p008_12.png] view at source ↗

**Figure 11.** Figure 11: Error distribution of Hilbert-Geo (Gemini 2.5 pro and [PITH_FULL_IMAGE:figures/full_fig_p008_11.png] view at source ↗

**Figure 15.** Figure 15: Examples from Mathverse-solid [PITH_FULL_IMAGE:figures/full_fig_p014_15.png] view at source ↗

**Figure 16.** Figure 16: Data Samples from Mathverse-solid and SolidFGeo2k [PITH_FULL_IMAGE:figures/full_fig_p014_16.png] view at source ↗

**Figure 17.** Figure 17: Comparison of Datasets for Solid Geometry Problems [PITH_FULL_IMAGE:figures/full_fig_p014_17.png] view at source ↗

**Figure 18.** Figure 18: Proportion of different subjects in SolidFGeo2k, note [PITH_FULL_IMAGE:figures/full_fig_p015_18.png] view at source ↗

**Figure 19.** Figure 19: Examples from different subjects [PITH_FULL_IMAGE:figures/full_fig_p015_19.png] view at source ↗

**Figure 20.** Figure 20: Construction Parsing Performance: Jaccard similar [PITH_FULL_IMAGE:figures/full_fig_p018_20.png] view at source ↗

**Figure 21.** Figure 21: Condition Parsing Performance: Jaccard similarity [PITH_FULL_IMAGE:figures/full_fig_p019_21.png] view at source ↗

**Figure 23.** Figure 23: Examples from designed samples The complete prompt template is shown below: You are an expert in geometry, logic, and computer science. Your task is to precisely convert a geometry problem (with natural language and an image) into a JSON object following the provided JSON Schema. You must strictly follow the schema and output a complete JSON object. Rule 0: Predicate Compliance (MOST IMPORTANT) - All CDL … view at source ↗

**Figure 25.** Figure 25: Plane and Solid differ greatly in all aspects, even if the [PITH_FULL_IMAGE:figures/full_fig_p022_25.png] view at source ↗

**Figure 24.** Figure 24: Theorem Search Tree and a inference demonstration is [PITH_FULL_IMAGE:figures/full_fig_p022_24.png] view at source ↗

**Figure 27.** Figure 27: A theorem (as illustrated in Fig. 25) that falls into set [PITH_FULL_IMAGE:figures/full_fig_p022_27.png] view at source ↗

**Figure 28.** Figure 28: one illustrative example for reasoning based on Hilbert-Geo [PITH_FULL_IMAGE:figures/full_fig_p023_28.png] view at source ↗

**Figure 29.** Figure 29: Reasoning performance of MLLMs under different [PITH_FULL_IMAGE:figures/full_fig_p024_29.png] view at source ↗

read the original abstract

Geometric problem solving, as a typical multimodal reasoning problem, has attracted much attention and made great progress recently, however most of works focus on plane geometry while usually fail in solid geometry due to 3D spatial diagrams and complex reasoning. To bridge this gap, we introduce Hilbert-Geo, the first unified formal language framework for solid geometry, including an extensive predicate library and a dedicated theorem bank. Based on this framework, we propose a Parse2Reason method containing two steps of first parsing then reasoning. In the parsing step, we utilize conditional description language (CDL), a formalized language composed of predicates specifically designed to construct geometric conditions, to represent both problem description (natural text) and solid diagrams (visual image). In the reasoning step, we leverage those formal CDL and the theorem bank to perform relational inference and algebraic computation, generating strictly correct, verifiable, and human-readable reasoning processes. Notably, our proposed Hilbert-Geo is also applicable to plane geometry. To advance geometric reasoning, we curate two expert-annotated dataset SolidFGeo2k and PlaneFGeo3k, which are furnished with geometric formal language annotations, solutions and answers. Extensive experiments show that our proposed method achieves the state-of-the-art (SOTA) performance 77.3% in SolidFGeo2k and 84.1% in MathVerse-Solid (one small subset in MathVerse dedicated to solid geometry), substantially outperforming leading MLLMs, such as Gemini-2.5-pro (54.2% on SolidFGeo2k) and GPT-5 (62.9% on MathVerse-Solid). In addition, our method achieves the SOTA accuracy 80.2% in PlaneFGeo3k, demonstrating the generality of the Hilbert-Geo in geometric reasoning. Our code and datasets are released at https://github.com/PremiLab-Math/Hilbert-Geo.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Hilbert-Geo adds a CDL predicate library and theorem bank for solid geometry with released datasets, but the SOTA numbers rest on an unverified parsing step that may not match the end-to-end MLLM baseline.

read the letter

Hilbert-Geo introduces a formal language framework for solid geometry built around CDL predicates and a dedicated theorem bank, then runs a Parse2Reason pipeline that turns text and diagrams into CDL before doing symbolic inference. It reports 77.3% on the new SolidFGeo2k set and 84.1% on MathVerse-Solid, beating the listed MLLMs, and shows 80.2% on the plane-geometry set as well.

The concrete advance is the extension of formal methods to 3D geometry with a predicate library and theorem bank that the authors claim is the first unified one, plus the two expert-annotated datasets they release with full formal annotations, solutions, and answers. Releasing code and data is useful, and the output reasoning steps are human-readable and verifiable, which is a practical plus.

The soft spot is exactly the parsing step. The abstract gives no numbers on how well the neural parser recovers ground-truth CDL from raw text and images on held-out problems, and no ablation that runs the full pipeline with parser output instead of oracle CDL. If the reported scores use clean expert CDL rather than automatic parses, the comparison to Gemini and GPT-5 on raw inputs does not hold. The theorem bank construction details are also missing here. These are fixable but load-bearing for the central claim.

The paper is aimed at people working on hybrid neural-symbolic systems for geometry and multimodal math. It has enough new resources and a clear pipeline that it deserves a serious referee to check the implementation and evaluation details.

Referee Report

3 major / 2 minor

Summary. The paper introduces Hilbert-Geo, a formal language framework for solid geometry that includes a predicate library, Conditional Description Language (CDL), and a dedicated theorem bank. It proposes the Parse2Reason method, which first parses natural-language problem statements and solid diagrams into CDL and then performs relational inference plus algebraic computation via the theorem bank to produce verifiable reasoning traces. New expert-annotated datasets SolidFGeo2k and PlaneFGeo3k are released; the method reports 77.3% accuracy on SolidFGeo2k, 84.1% on MathVerse-Solid, and 80.2% on PlaneFGeo3k, outperforming several MLLMs.

Significance. If the neural parser reliably produces accurate, lossless CDL from raw text+image inputs on held-out problems, the work would demonstrate a viable neural-symbolic route to 3D geometric reasoning that yields human-readable, machine-verifiable proofs—an advance over end-to-end neural baselines that currently lack such guarantees.

major comments (3)

[§4] §4 (Parse2Reason pipeline): no quantitative evaluation of the neural parser is reported (e.g., exact-match rate or edit distance between automatically generated CDL and expert ground-truth annotations on the test splits of SolidFGeo2k). Without this metric or an ablation that substitutes the parser output for oracle CDL, the 77.3% and 84.1% figures cannot be attributed to the full automatic pipeline.
[§5.2] §5.2 (experimental comparison): the direct SOTA claim against Gemini-2.5-pro (54.2%) and GPT-5 (62.9%) assumes identical input conditions. Because the test sets are supplied with expert CDL annotations, it is unclear whether the reported numbers use parsed CDL or oracle CDL; if the latter, the comparison to end-to-end MLLMs that must parse raw inputs is invalid.
[§3.2] §3.2 (theorem bank): the bank is presented as central to the reasoning step, yet no coverage statistics, verification procedure, or failure cases on solid-geometry problems are supplied. This leaves open whether the reported accuracies rest on a complete or curated subset of theorems.

minor comments (2)

[Figure 3] Figure 3 (pipeline diagram): the flow from diagram to CDL predicates is shown schematically but lacks a concrete side-by-side example of an input diagram, its parsed CDL, and the subsequent inference steps.
[§2] §2 (related work): the discussion of prior plane-geometry neuro-symbolic systems is adequate, but the text does not explicitly contrast the new CDL predicates required for 3D relations (e.g., occlusion, depth ordering) with existing 2D formalisms.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments highlight important gaps in evaluation and documentation that we will address through revisions to strengthen the paper's claims about the full automatic pipeline.

read point-by-point responses

Referee: [§4] §4 (Parse2Reason pipeline): no quantitative evaluation of the neural parser is reported (e.g., exact-match rate or edit distance between automatically generated CDL and expert ground-truth annotations on the test splits of SolidFGeo2k). Without this metric or an ablation that substitutes the parser output for oracle CDL, the 77.3% and 84.1% figures cannot be attributed to the full automatic pipeline.

Authors: We acknowledge the validity of this observation. The submitted manuscript reports end-to-end results but omits separate parser metrics and ablations. In the revised version we will add exact-match rates, edit distances, and other parser accuracy metrics on the SolidFGeo2k test split, together with an oracle-CDL ablation that isolates the parser's contribution. These additions will allow the reported accuracies to be properly attributed to the automatic pipeline. revision: yes
Referee: [§5.2] §5.2 (experimental comparison): the direct SOTA claim against Gemini-2.5-pro (54.2%) and GPT-5 (62.9%) assumes identical input conditions. Because the test sets are supplied with expert CDL annotations, it is unclear whether the reported numbers use parsed CDL or oracle CDL; if the latter, the comparison to end-to-end MLLMs that must parse raw inputs is invalid.

Authors: The 77.3 % and 84.1 % figures were obtained with the automatic Parse2Reason pipeline that ingests raw text and images and produces CDL via the trained neural parser; oracle CDL is used only for parser training and for the planned ablation. We will add an explicit statement in §5.2 clarifying the input conditions and confirming that all reported numbers reflect parsed, not oracle, CDL, thereby preserving the validity of the comparison with end-to-end MLLMs. revision: yes
Referee: [§3.2] §3.2 (theorem bank): the bank is presented as central to the reasoning step, yet no coverage statistics, verification procedure, or failure cases on solid-geometry problems are supplied. This leaves open whether the reported accuracies rest on a complete or curated subset of theorems.

Authors: We agree that additional documentation is required. The revised manuscript will include coverage statistics (fraction of SolidFGeo2k problems solvable by the bank), a description of the verification procedure (expert review plus automated consistency checks), and a summary of observed failure cases or coverage gaps for solid-geometry problems. These details will clarify the scope and completeness of the theorem bank. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces a new formal framework (CDL predicates and theorem bank), curates expert-annotated datasets (SolidFGeo2k, PlaneFGeo3k), and reports empirical accuracies from applying the Parse2Reason pipeline (neural parsing followed by symbolic inference) to held-out test portions. No equations, self-citations, or parameter-fitting steps are described that reduce the reported SOTA numbers (77.3%, 84.1%, 80.2%) to the inputs by construction; the results remain externally falsifiable measurements on the released datasets and code.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The paper introduces new formal components specific to solid geometry.

axioms (1)

standard math Standard rules of logical inference and algebraic computation hold for geometric predicates.
Used in the reasoning step to generate solutions.

invented entities (2)

Conditional Description Language (CDL) no independent evidence
purpose: To represent geometric conditions from text and images in a formal way.
New language introduced in the paper for the parsing step.
Theorem bank no independent evidence
purpose: To provide theorems for relational inference and algebraic computation in solid geometry.
Dedicated bank created for the framework.

pith-pipeline@v0.9.1-grok · 5892 in / 1217 out tokens · 29748 ms · 2026-06-30T22:42:34.360822+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 8 canonical work pages · 5 internal anchors

[1]

Claude 3.7 sonnet system card.https : //www.anthropic.com/claude- 3- 7- sonnet- system-card, 2025

Anthropic. Claude 3.7 sonnet system card.https : //www.anthropic.com/claude- 3- 7- sonnet- system-card, 2025. System card for Claude 3.7 Sonnet

2025
[2]

Arnon, George E

Dennis S. Arnon, George E. Collins, and Scott McCallum. Cylindrical algebraic decomposition i: The basic algorithm. SIAM Journal on Computing, 13(4):865–877, 1984

1984
[3]

Birkh ¨auser Basel, 2004

Lucian B ˘adescu.Projective Geometry and Formal Geome- try. Birkh ¨auser Basel, 2004

2004
[4]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, et al. Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Relational inductive biases, deep learning, and graph networks

Peter W. Battaglia, Jessica B. Hamrick, Victor Bapst, et al. Relational inductive biases, deep learning, and graph net- works.arXiv preprint arXiv:1806.01261, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[6]

Xing, and Liang Lin

Jiaqi Chen, Jianheng Tang, Jinghui Qin, Xiaodan Liang, Lingbo Liu, Eric P. Xing, and Liang Lin. GeoQA: A ge- ometric question answering benchmark towards multimodal numerical reasoning. InFindings of the Association for Com- putational Linguistics: ACL-IJCNLP 2021, pages 513–523, 2021

2021
[7]

UniGeo: Unify- ing geometry logical reasoning via reformulating mathemat- ical expression

Jiaqi Chen, Tong Li, Jinghui Qin, et al. UniGeo: Unify- ing geometry logical reasoning via reformulating mathemat- ical expression. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3313–3323, 2022

2022
[8]

Do NOT think that much for 2+3=? on the overthinking of long reasoning models

Xingyu Chen, Jiahao Xu, Tian Liang, et al. Do NOT think that much for 2+3=? on the overthinking of long reasoning models. InProceedings of the 42nd International Confer- ence on Machine Learning, pages 9487–9499. PMLR, 2025

2025
[9]

A coefficient of agreement for nominal scales

Jacob Cohen. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1):37–46, 1960

1960
[10]

DeepSeek-V3 Technical Report

DeepSeek-AI. DeepSeek-V3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Gemini 2.5: Updates to our family of thinking mod- els.https://developers.googleblog.com/en/ gemini- 2- 5- thinking- model- updates/, 2025

Google. Gemini 2.5: Updates to our family of thinking mod- els.https://developers.googleblog.com/en/ gemini- 2- 5- thinking- model- updates/, 2025. Introduces Gemini 2.5 Pro and Gemini 2.5 Flash updates

2025
[12]

Large language models: A comprehensive survey of its applications, challenges, limitations, and future prospects

Muhammad Usman Hadi, Qasem Al Tashi, Abbas Shah, et al. Large language models: A comprehensive survey of its applications, challenges, limitations, and future prospects. TechRxiv, 2024. Preprint, version 6

2024
[13]

Formal verification meth- ods

Osman Hasan and Sofiene Tahar. Formal verification meth- ods. InEncyclopedia of Information Science and Technol- ogy, Third Edition, pages 7162–7170. IGI Global Scientific Publishing, 2015

2015
[14]

Springer Tokyo, 2014

Takayuki Hibi, editor.Gr ¨obner Bases: Statistics and Soft- ware Systems. Springer Tokyo, 2014. Copyright 2013

2014
[15]

Solving ge- ometry problems via feature learning and contrastive learn- ing of multimodal data.Computer Modeling in Engineering & Sciences, 136(2):1707–1728, 2023

Pengpeng Jian, Fucheng Guo, Yanli Wang, et al. Solving ge- ometry problems via feature learning and contrastive learn- ing of multimodal data.Computer Modeling in Engineering & Sciences, 136(2):1707–1728, 2023

2023
[16]

A survey on large language models for code generation.ACM Transactions on Software Engineering and Methodology, 35(2):1–72, 2026

Juyong Jiang, Fan Wang, Jiasi Shen, et al. A survey on large language models for code generation.ACM Transactions on Software Engineering and Methodology, 35(2):1–72, 2026

2026
[17]

Vidhalluc: Evaluating temporal hallucinations in multimodal large lan- guage models for video understanding

Chaoyu Li, Eun Woo Im, Pooyan Fazli, et al. Vidhalluc: Evaluating temporal hallucinations in multimodal large lan- guage models for video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13723–13733, 2025

2025
[18]

A survey on deep learning for theorem proving

Zhaoyu Li, Jialiang Sun, Logan Murphy, et al. A survey on deep learning for theorem proving. InProceedings of the First Conference on Language Modeling, 2024

2024
[19]

Inter-GPS: In- terpretable geometry problem solving with formal language and symbolic reasoning

Pan Lu, Ran Gong, Shibiao Jiang, et al. Inter-GPS: In- terpretable geometry problem solving with formal language and symbolic reasoning. InProceedings of the 59th An- nual Meeting of the Association for Computational Linguis- tics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6774– 6786, Online, ...

2021
[20]

MathVista: Evalu- ating mathematical reasoning of foundation models in visual contexts

Pan Lu, Hritik Bansal, Tony Xia, et al. MathVista: Evalu- ating mathematical reasoning of foundation models in visual contexts. InThe Twelfth International Conference on Learn- ing Representations, 2024. Oral presentation

2024
[21]

Llama 3.3 model cards and prompt formats, 2024

Meta. Llama 3.3 model cards and prompt formats, 2024. Of- ficial Meta documentation for Llama 3.3, release date: De- cember 6, 2024

2024
[22]

Autofor- malizing euclidean geometry

Logan Murphy, Kaiyu Yang, Jialiang Sun, et al. Autofor- malizing euclidean geometry. InProceedings of the 41st In- ternational Conference on Machine Learning, pages 36847– 36893. PMLR, 2024

2024
[23]

A com- prehensive overview of large language models.ACM Trans- actions on Intelligent Systems and Technology, 16(5):1–72, 2025

Humza Naveed, Asad Ullah Khan, Shi Qiu, et al. A com- prehensive overview of large language models.ACM Trans- actions on Intelligent Systems and Technology, 16(5):1–72, 2025

2025
[24]

A symbolic characters aware model for solving ge- ometry problems

Maizhen Ning, Qiu-Feng Wang, Kaizhu Huang, and Xiaowei Huang. A symbolic characters aware model for solving ge- ometry problems. InProceedings of the 31st ACM Inter- national Conference on Multimedia (MM ’23), pages 7767– 7775, New York, NY , USA, 2023. ACM

2023
[25]

GNS: Solving plane geometry problems by neural-symbolic reasoning with multi-modal llms

Maizhen Ning, Zihao Zhou, Qiufeng Wang, Xiaowei Huang, and Kaizhu Huang. GNS: Solving plane geometry problems by neural-symbolic reasoning with multi-modal llms. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 24957–24965, 2025

2025
[26]

Hello gpt-4o.https : / / openai

OpenAI. Hello gpt-4o.https : / / openai . com / index/hello-gpt-4o/, 2024. OpenAI announcement

2024
[27]

Introducing gpt-5.https://openai.com/ index/introducing- gpt- 5/, 2025

OpenAI. Introducing gpt-5.https://openai.com/ index/introducing- gpt- 5/, 2025. OpenAI an- nouncement

2025
[28]

GPT-5 system card.https://openai.com/ index/gpt-5-system-card/, 2025

OpenAI. GPT-5 system card.https://openai.com/ index/gpt-5-system-card/, 2025. OpenAI system card

2025
[29]

Pittalis and C

M. Pittalis and C. Christou. Types of reasoning in 3d geom- etry thinking and their relation with spatial ability.Educa- tional Studies in Mathematics, 75(2):191–212, 2010

2010
[30]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team. Gemini: A family of highly capable multi- modal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Measuring multi- modal mathematical reasoning with math-vision dataset

Ke Wang, Junting Pan, Weikang Shi, et al. Measuring multi- modal mathematical reasoning with math-vision dataset. In NeurIPS 2024 Datasets and Benchmarks Track, 2024

2024
[32]

SolidGeo: Measuring multimodal spatial math reasoning in solid ge- ometry

Peijie Wang, Chao Yang, Zhong-Zhi Li, et al. SolidGeo: Measuring multimodal spatial math reasoning in solid ge- ometry. InNeurIPS 2025 Datasets and Benchmarks Track,

2025
[33]

Thoughts are all over the place: On the underthinking of o1-like llms

Yue Wang, Qiuzhi Liu, Jiahao Xu, Tian Liang, Xingyu Chen, Zhiwei He, Linfeng Song, Dian Yu, Juntao Li, Zhuosheng Zhang, et al. Thoughts are all over the place: On the under- thinking of o1-like llms.arXiv preprint arXiv:2501.18585, 2025

work page arXiv 2025
[34]

A survey on large language models for recommendation.World Wide Web, 27(5):60, 2024

Likang Wu, Zhi Zheng, Zhaopeng Qiu, et al. A survey on large language models for recommendation.World Wide Web, 27(5):60, 2024

2024
[35]

Nesygeo: A neuro-symbolic framework for multimodal geometric reasoning data generation.arXiv preprint arXiv:2505.17121, 2025

Weiming Wu, Jiachen Ye, Zihao Wang, Ziyi Zhou, Yifan Li, and Luzhen Guo. Nesygeo: A neuro-symbolic framework for multimodal geometric reasoning data generation.arXiv preprint arXiv:2505.17121, 2025

work page arXiv 2025
[36]

GeoX: Ge- ometric problem solving through unified formalized vision- language pre-training

Renqiu Xia, Mingsheng Li, Hancheng Ye, et al. GeoX: Ge- ometric problem solving through unified formalized vision- language pre-training. InThe Thirteenth International Con- ference on Learning Representations, 2025

2025
[37]

Math- Verse: Does your multi-modal llm truly see the diagrams in visual math problems? InEuropean Conference on Com- puter Vision, pages 169–186

Renrui Zhang, Dongzhi Jiang, Yichi Zhang, et al. Math- Verse: Does your multi-modal llm truly see the diagrams in visual math problems? InEuropean Conference on Com- puter Vision, pages 169–186. Springer, 2024

2024
[38]

FormalGeo: An extensible formalized framework for olympiad geometric problem solving.arXiv preprint arXiv:2310.18021, 2023

Xiaokai Zhang, Na Zhu, Yiming He, et al. FormalGeo: An extensible formalized framework for olympiad geometric problem solving.arXiv preprint arXiv:2310.18021, 2023

work page arXiv 2023
[39]

Pi-GPS: Enhancing geometry problem solving by unleashing the power of diagrammatic information

Junbo Zhao, Ting Zhang, Jiayu Sun, Mi Tian, and Hua Huang. Pi-GPS: Enhancing geometry problem solving by unleashing the power of diagrammatic information. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1526–1536, 2025

2025
[40]

A Survey of Large Language Models

Wayne Xin Zhao, Kun Zhou, Junyi Li, et al. A survey of large language models.arXiv preprint arXiv:2303.18223, 2023. Hilbert-Geo: Solving Solid Geometric Problems by Neural-Symbolic Reasoning Supplementary Material A. Solid Geometry Formal Language A.1. Formal Geometry Representation In the domain of solid geometry, simple geometric bodies serve as fundame...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

LetR A andR B be face sets of two valid polyhedra
[42]

Execute the operationR result =R A ⊕3D RB
[43]

This operation removes the internal contact faces (S A andS B) and retains all external surfaces
[44]

According to the generalization of Euler’s formula for manifolds, when two closed manifolds are glued along a simply connected face and that face is removed, the remaining surface still constitutes a closed 2-manifold (i.e., the boundary of the new polyhedron)
[45]

∀RA, RB ∈S,(R A ⊕3D RB)∈S(5) Theorem A.2.R A ⊕3D RB =R B ⊕3D RA

Therefore,R result remains a set of faces describing a closed solid. ∀RA, RB ∈S,(R A ⊕3D RB)∈S(5) Theorem A.2.R A ⊕3D RB =R B ⊕3D RA. Proof.LetR A containmfaces andR B containnfaces. Based on Eq. 4: RA ⊕3D RB ={f|f∈R A ∪R B, f̸=S A, f̸=S B}(6) Now consider the reverse operationR B ⊕3D RA: RB ⊕3D RA = (R B \ {SB})∪(R A \ {SA})(7) According to set algebra, ...
[46]

PolyhedraAandBshare interface faces(S AB, SBA)
[47]

par- allel

PolyhedraBandCshare interface faces(S BC , SCB ). 3.R A, RB, RC are their respective face sets. Left Hand Side: LetR AB =R A ⊕3D RB. RAB = (R A ∪R B)\ {S AB, SBA}(10) Next, calculateR AB ⊕3D RC. The contact interface in- volvesBandC(i.e.,S BC andS CB ): (RA ⊕3D RB)⊕ 3D RC = (R AB ∪R C)\ {S BC , SCB } (11) = ((R A ∪R B)\ {S AB, SBA} ∪R C)\ {S BC , SCB } (1...
[48]

- image_cdl MUST include only facts directly observable from the image (e.g., length labels, right-angle marks, shape recognition)

Information Source Separation: - text_cdl MUST include only facts extracted from the natural language description. - image_cdl MUST include only facts directly observable from the image (e.g., length labels, right-angle marks, shape recognition). - If a fact appears in both text and image, include it in both fields
[49]

construction_cdl - Geometric construction predicates (IMPORTANT): construction_cdl defines basic construction for entities, and MUST include the following types where applicable: - Shape predicates: define edges/segments of shapes * For segments/edges: Shape(AB,BC,CD,DA) or Shape(OP,PO) or Shape(PQ,QP) * For points (spheres etc.): Shape(O) or Shape(P) * E...
[50]

10", "36 *pi

Answer formatting: - problem_answer MUST be a pure number or expression (e.g., "10", "36 *pi"), and MUST NOT contain units or extra text
[51]

Core predicate logic: - Length/Height: Equal(LengthOfLine(A,B),5), Equal(HeightOfCone(O,P),12) - Relations: PerpendicularBetweenLine(A,B,C,D), ParallelBetweenLine(A,B,C,D) - Goal: the requested quantity MUST be wrapped by Value(...)
[52]

- Quantities allowed in CDL expressions are LIMITED to standard forms: VolumeOfCone, SurfaceAreaOfCylinder, AreaOfCircle, LengthOfLine, etc

Predicate and Operator Legality (CRITICAL): - Only reuse names from the official predicate list; DO NOT invent new construction predicates. - Quantities allowed in CDL expressions are LIMITED to standard forms: VolumeOfCone, SurfaceAreaOfCylinder, AreaOfCircle, LengthOfLine, etc. - Only the following algebraic operators are allowed: Value, Add, Sub, Mul, ...
[53]

Important: Output Requirements

Completeness Checks: - Ensure every entity used by text_cdl/image_cdl exists in construction_cdl - Ensure the target entity in goal_cdl exists in the construction as well - Self-check after generation: verify all predicates/operators are allowed, no extra spaces, and no undeclared entities are referenced. Important: Output Requirements
[54]

You MUST output a complete JSON object with all required fields
[55]

All CDL fields MUST be arrays of strings
[56]

Value(VolumeOfCone(O,P))

goal_cdl MUST be a string (e.g., "Value(VolumeOfCone(O,P))") C.4.2. Direct Problem Solving Prompt In addition to CDL generation, the system also supportsdi- rect problem solvingusing GPT models fortesting model accuracy. This approach bypasses formalization and di- rectly generates answers to geometry problems, providing a baseline for comparison with for...
[57]

Carefully analyze the problem text and the accompanying image
[58]

Show your reasoning process step by step
[59]

At the end, provide your final answer in a clear format
[60]

10", "5.5

**Your final answer should be ONLY a number or mathematical expression (like "10", "5.5", "12 *pi", "36 *pi"), without any units or text **
[61]

FINAL ANSWER:

Put your final answer on a line starting with "FINAL ANSWER: " Example format: FINAL ANSWER: 10 or FINAL ANSWER: 36 *pi Now, please solve this problem: D. SGRE Supplementary Information D.1. Theorem Search Tree and Search Process Figure 24. Theorem Search Tree and a inference demonstration is shown in fig. 28 The search process involves constructing a sea...

[1] [1]

Claude 3.7 sonnet system card.https : //www.anthropic.com/claude- 3- 7- sonnet- system-card, 2025

Anthropic. Claude 3.7 sonnet system card.https : //www.anthropic.com/claude- 3- 7- sonnet- system-card, 2025. System card for Claude 3.7 Sonnet

2025

[2] [2]

Arnon, George E

Dennis S. Arnon, George E. Collins, and Scott McCallum. Cylindrical algebraic decomposition i: The basic algorithm. SIAM Journal on Computing, 13(4):865–877, 1984

1984

[3] [3]

Birkh ¨auser Basel, 2004

Lucian B ˘adescu.Projective Geometry and Formal Geome- try. Birkh ¨auser Basel, 2004

2004

[4] [4]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, et al. Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Relational inductive biases, deep learning, and graph networks

Peter W. Battaglia, Jessica B. Hamrick, Victor Bapst, et al. Relational inductive biases, deep learning, and graph net- works.arXiv preprint arXiv:1806.01261, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[6] [6]

Xing, and Liang Lin

Jiaqi Chen, Jianheng Tang, Jinghui Qin, Xiaodan Liang, Lingbo Liu, Eric P. Xing, and Liang Lin. GeoQA: A ge- ometric question answering benchmark towards multimodal numerical reasoning. InFindings of the Association for Com- putational Linguistics: ACL-IJCNLP 2021, pages 513–523, 2021

2021

[7] [7]

UniGeo: Unify- ing geometry logical reasoning via reformulating mathemat- ical expression

Jiaqi Chen, Tong Li, Jinghui Qin, et al. UniGeo: Unify- ing geometry logical reasoning via reformulating mathemat- ical expression. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3313–3323, 2022

2022

[8] [8]

Do NOT think that much for 2+3=? on the overthinking of long reasoning models

Xingyu Chen, Jiahao Xu, Tian Liang, et al. Do NOT think that much for 2+3=? on the overthinking of long reasoning models. InProceedings of the 42nd International Confer- ence on Machine Learning, pages 9487–9499. PMLR, 2025

2025

[9] [9]

A coefficient of agreement for nominal scales

Jacob Cohen. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1):37–46, 1960

1960

[10] [10]

DeepSeek-V3 Technical Report

DeepSeek-AI. DeepSeek-V3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

Gemini 2.5: Updates to our family of thinking mod- els.https://developers.googleblog.com/en/ gemini- 2- 5- thinking- model- updates/, 2025

Google. Gemini 2.5: Updates to our family of thinking mod- els.https://developers.googleblog.com/en/ gemini- 2- 5- thinking- model- updates/, 2025. Introduces Gemini 2.5 Pro and Gemini 2.5 Flash updates

2025

[12] [12]

Large language models: A comprehensive survey of its applications, challenges, limitations, and future prospects

Muhammad Usman Hadi, Qasem Al Tashi, Abbas Shah, et al. Large language models: A comprehensive survey of its applications, challenges, limitations, and future prospects. TechRxiv, 2024. Preprint, version 6

2024

[13] [13]

Formal verification meth- ods

Osman Hasan and Sofiene Tahar. Formal verification meth- ods. InEncyclopedia of Information Science and Technol- ogy, Third Edition, pages 7162–7170. IGI Global Scientific Publishing, 2015

2015

[14] [14]

Springer Tokyo, 2014

Takayuki Hibi, editor.Gr ¨obner Bases: Statistics and Soft- ware Systems. Springer Tokyo, 2014. Copyright 2013

2014

[15] [15]

Solving ge- ometry problems via feature learning and contrastive learn- ing of multimodal data.Computer Modeling in Engineering & Sciences, 136(2):1707–1728, 2023

Pengpeng Jian, Fucheng Guo, Yanli Wang, et al. Solving ge- ometry problems via feature learning and contrastive learn- ing of multimodal data.Computer Modeling in Engineering & Sciences, 136(2):1707–1728, 2023

2023

[16] [16]

A survey on large language models for code generation.ACM Transactions on Software Engineering and Methodology, 35(2):1–72, 2026

Juyong Jiang, Fan Wang, Jiasi Shen, et al. A survey on large language models for code generation.ACM Transactions on Software Engineering and Methodology, 35(2):1–72, 2026

2026

[17] [17]

Vidhalluc: Evaluating temporal hallucinations in multimodal large lan- guage models for video understanding

Chaoyu Li, Eun Woo Im, Pooyan Fazli, et al. Vidhalluc: Evaluating temporal hallucinations in multimodal large lan- guage models for video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13723–13733, 2025

2025

[18] [18]

A survey on deep learning for theorem proving

Zhaoyu Li, Jialiang Sun, Logan Murphy, et al. A survey on deep learning for theorem proving. InProceedings of the First Conference on Language Modeling, 2024

2024

[19] [19]

Inter-GPS: In- terpretable geometry problem solving with formal language and symbolic reasoning

Pan Lu, Ran Gong, Shibiao Jiang, et al. Inter-GPS: In- terpretable geometry problem solving with formal language and symbolic reasoning. InProceedings of the 59th An- nual Meeting of the Association for Computational Linguis- tics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6774– 6786, Online, ...

2021

[20] [20]

MathVista: Evalu- ating mathematical reasoning of foundation models in visual contexts

Pan Lu, Hritik Bansal, Tony Xia, et al. MathVista: Evalu- ating mathematical reasoning of foundation models in visual contexts. InThe Twelfth International Conference on Learn- ing Representations, 2024. Oral presentation

2024

[21] [21]

Llama 3.3 model cards and prompt formats, 2024

Meta. Llama 3.3 model cards and prompt formats, 2024. Of- ficial Meta documentation for Llama 3.3, release date: De- cember 6, 2024

2024

[22] [22]

Autofor- malizing euclidean geometry

Logan Murphy, Kaiyu Yang, Jialiang Sun, et al. Autofor- malizing euclidean geometry. InProceedings of the 41st In- ternational Conference on Machine Learning, pages 36847– 36893. PMLR, 2024

2024

[23] [23]

A com- prehensive overview of large language models.ACM Trans- actions on Intelligent Systems and Technology, 16(5):1–72, 2025

Humza Naveed, Asad Ullah Khan, Shi Qiu, et al. A com- prehensive overview of large language models.ACM Trans- actions on Intelligent Systems and Technology, 16(5):1–72, 2025

2025

[24] [24]

A symbolic characters aware model for solving ge- ometry problems

Maizhen Ning, Qiu-Feng Wang, Kaizhu Huang, and Xiaowei Huang. A symbolic characters aware model for solving ge- ometry problems. InProceedings of the 31st ACM Inter- national Conference on Multimedia (MM ’23), pages 7767– 7775, New York, NY , USA, 2023. ACM

2023

[25] [25]

GNS: Solving plane geometry problems by neural-symbolic reasoning with multi-modal llms

Maizhen Ning, Zihao Zhou, Qiufeng Wang, Xiaowei Huang, and Kaizhu Huang. GNS: Solving plane geometry problems by neural-symbolic reasoning with multi-modal llms. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 24957–24965, 2025

2025

[26] [26]

Hello gpt-4o.https : / / openai

OpenAI. Hello gpt-4o.https : / / openai . com / index/hello-gpt-4o/, 2024. OpenAI announcement

2024

[27] [27]

Introducing gpt-5.https://openai.com/ index/introducing- gpt- 5/, 2025

OpenAI. Introducing gpt-5.https://openai.com/ index/introducing- gpt- 5/, 2025. OpenAI an- nouncement

2025

[28] [28]

GPT-5 system card.https://openai.com/ index/gpt-5-system-card/, 2025

OpenAI. GPT-5 system card.https://openai.com/ index/gpt-5-system-card/, 2025. OpenAI system card

2025

[29] [29]

Pittalis and C

M. Pittalis and C. Christou. Types of reasoning in 3d geom- etry thinking and their relation with spatial ability.Educa- tional Studies in Mathematics, 75(2):191–212, 2010

2010

[30] [30]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team. Gemini: A family of highly capable multi- modal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[31] [31]

Measuring multi- modal mathematical reasoning with math-vision dataset

Ke Wang, Junting Pan, Weikang Shi, et al. Measuring multi- modal mathematical reasoning with math-vision dataset. In NeurIPS 2024 Datasets and Benchmarks Track, 2024

2024

[32] [32]

SolidGeo: Measuring multimodal spatial math reasoning in solid ge- ometry

Peijie Wang, Chao Yang, Zhong-Zhi Li, et al. SolidGeo: Measuring multimodal spatial math reasoning in solid ge- ometry. InNeurIPS 2025 Datasets and Benchmarks Track,

2025

[33] [33]

Thoughts are all over the place: On the underthinking of o1-like llms

Yue Wang, Qiuzhi Liu, Jiahao Xu, Tian Liang, Xingyu Chen, Zhiwei He, Linfeng Song, Dian Yu, Juntao Li, Zhuosheng Zhang, et al. Thoughts are all over the place: On the under- thinking of o1-like llms.arXiv preprint arXiv:2501.18585, 2025

work page arXiv 2025

[34] [34]

A survey on large language models for recommendation.World Wide Web, 27(5):60, 2024

Likang Wu, Zhi Zheng, Zhaopeng Qiu, et al. A survey on large language models for recommendation.World Wide Web, 27(5):60, 2024

2024

[35] [35]

Nesygeo: A neuro-symbolic framework for multimodal geometric reasoning data generation.arXiv preprint arXiv:2505.17121, 2025

Weiming Wu, Jiachen Ye, Zihao Wang, Ziyi Zhou, Yifan Li, and Luzhen Guo. Nesygeo: A neuro-symbolic framework for multimodal geometric reasoning data generation.arXiv preprint arXiv:2505.17121, 2025

work page arXiv 2025

[36] [36]

GeoX: Ge- ometric problem solving through unified formalized vision- language pre-training

Renqiu Xia, Mingsheng Li, Hancheng Ye, et al. GeoX: Ge- ometric problem solving through unified formalized vision- language pre-training. InThe Thirteenth International Con- ference on Learning Representations, 2025

2025

[37] [37]

Math- Verse: Does your multi-modal llm truly see the diagrams in visual math problems? InEuropean Conference on Com- puter Vision, pages 169–186

Renrui Zhang, Dongzhi Jiang, Yichi Zhang, et al. Math- Verse: Does your multi-modal llm truly see the diagrams in visual math problems? InEuropean Conference on Com- puter Vision, pages 169–186. Springer, 2024

2024

[38] [38]

FormalGeo: An extensible formalized framework for olympiad geometric problem solving.arXiv preprint arXiv:2310.18021, 2023

Xiaokai Zhang, Na Zhu, Yiming He, et al. FormalGeo: An extensible formalized framework for olympiad geometric problem solving.arXiv preprint arXiv:2310.18021, 2023

work page arXiv 2023

[39] [39]

Pi-GPS: Enhancing geometry problem solving by unleashing the power of diagrammatic information

Junbo Zhao, Ting Zhang, Jiayu Sun, Mi Tian, and Hua Huang. Pi-GPS: Enhancing geometry problem solving by unleashing the power of diagrammatic information. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1526–1536, 2025

2025

[40] [40]

A Survey of Large Language Models

Wayne Xin Zhao, Kun Zhou, Junyi Li, et al. A survey of large language models.arXiv preprint arXiv:2303.18223, 2023. Hilbert-Geo: Solving Solid Geometric Problems by Neural-Symbolic Reasoning Supplementary Material A. Solid Geometry Formal Language A.1. Formal Geometry Representation In the domain of solid geometry, simple geometric bodies serve as fundame...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[41] [41]

LetR A andR B be face sets of two valid polyhedra

[42] [42]

Execute the operationR result =R A ⊕3D RB

[43] [43]

This operation removes the internal contact faces (S A andS B) and retains all external surfaces

[44] [44]

According to the generalization of Euler’s formula for manifolds, when two closed manifolds are glued along a simply connected face and that face is removed, the remaining surface still constitutes a closed 2-manifold (i.e., the boundary of the new polyhedron)

[45] [45]

∀RA, RB ∈S,(R A ⊕3D RB)∈S(5) Theorem A.2.R A ⊕3D RB =R B ⊕3D RA

Therefore,R result remains a set of faces describing a closed solid. ∀RA, RB ∈S,(R A ⊕3D RB)∈S(5) Theorem A.2.R A ⊕3D RB =R B ⊕3D RA. Proof.LetR A containmfaces andR B containnfaces. Based on Eq. 4: RA ⊕3D RB ={f|f∈R A ∪R B, f̸=S A, f̸=S B}(6) Now consider the reverse operationR B ⊕3D RA: RB ⊕3D RA = (R B \ {SB})∪(R A \ {SA})(7) According to set algebra, ...

[46] [46]

PolyhedraAandBshare interface faces(S AB, SBA)

[47] [47]

par- allel

PolyhedraBandCshare interface faces(S BC , SCB ). 3.R A, RB, RC are their respective face sets. Left Hand Side: LetR AB =R A ⊕3D RB. RAB = (R A ∪R B)\ {S AB, SBA}(10) Next, calculateR AB ⊕3D RC. The contact interface in- volvesBandC(i.e.,S BC andS CB ): (RA ⊕3D RB)⊕ 3D RC = (R AB ∪R C)\ {S BC , SCB } (11) = ((R A ∪R B)\ {S AB, SBA} ∪R C)\ {S BC , SCB } (1...

[48] [48]

- image_cdl MUST include only facts directly observable from the image (e.g., length labels, right-angle marks, shape recognition)

Information Source Separation: - text_cdl MUST include only facts extracted from the natural language description. - image_cdl MUST include only facts directly observable from the image (e.g., length labels, right-angle marks, shape recognition). - If a fact appears in both text and image, include it in both fields

[49] [49]

construction_cdl - Geometric construction predicates (IMPORTANT): construction_cdl defines basic construction for entities, and MUST include the following types where applicable: - Shape predicates: define edges/segments of shapes * For segments/edges: Shape(AB,BC,CD,DA) or Shape(OP,PO) or Shape(PQ,QP) * For points (spheres etc.): Shape(O) or Shape(P) * E...

[50] [50]

10", "36 *pi

Answer formatting: - problem_answer MUST be a pure number or expression (e.g., "10", "36 *pi"), and MUST NOT contain units or extra text

[51] [51]

Core predicate logic: - Length/Height: Equal(LengthOfLine(A,B),5), Equal(HeightOfCone(O,P),12) - Relations: PerpendicularBetweenLine(A,B,C,D), ParallelBetweenLine(A,B,C,D) - Goal: the requested quantity MUST be wrapped by Value(...)

[52] [52]

- Quantities allowed in CDL expressions are LIMITED to standard forms: VolumeOfCone, SurfaceAreaOfCylinder, AreaOfCircle, LengthOfLine, etc

Predicate and Operator Legality (CRITICAL): - Only reuse names from the official predicate list; DO NOT invent new construction predicates. - Quantities allowed in CDL expressions are LIMITED to standard forms: VolumeOfCone, SurfaceAreaOfCylinder, AreaOfCircle, LengthOfLine, etc. - Only the following algebraic operators are allowed: Value, Add, Sub, Mul, ...

[53] [53]

Important: Output Requirements

Completeness Checks: - Ensure every entity used by text_cdl/image_cdl exists in construction_cdl - Ensure the target entity in goal_cdl exists in the construction as well - Self-check after generation: verify all predicates/operators are allowed, no extra spaces, and no undeclared entities are referenced. Important: Output Requirements

[54] [54]

You MUST output a complete JSON object with all required fields

[55] [55]

All CDL fields MUST be arrays of strings

[56] [56]

Value(VolumeOfCone(O,P))

goal_cdl MUST be a string (e.g., "Value(VolumeOfCone(O,P))") C.4.2. Direct Problem Solving Prompt In addition to CDL generation, the system also supportsdi- rect problem solvingusing GPT models fortesting model accuracy. This approach bypasses formalization and di- rectly generates answers to geometry problems, providing a baseline for comparison with for...

[57] [57]

Carefully analyze the problem text and the accompanying image

[58] [58]

Show your reasoning process step by step

[59] [59]

At the end, provide your final answer in a clear format

[60] [60]

10", "5.5

**Your final answer should be ONLY a number or mathematical expression (like "10", "5.5", "12 *pi", "36 *pi"), without any units or text **

[61] [61]

FINAL ANSWER:

Put your final answer on a line starting with "FINAL ANSWER: " Example format: FINAL ANSWER: 10 or FINAL ANSWER: 36 *pi Now, please solve this problem: D. SGRE Supplementary Information D.1. Theorem Search Tree and Search Process Figure 24. Theorem Search Tree and a inference demonstration is shown in fig. 28 The search process involves constructing a sea...