pith. sign in

arxiv: 2606.03144 · v1 · pith:RRE3ZIGXnew · submitted 2026-06-02 · 💻 cs.AI

GTBench: A Curriculum-Grounded Benchmark for Evaluating LLMs as Mathematical Research Assistants in Graph Theory

Pith reviewed 2026-06-28 10:17 UTC · model grok-4.3

classification 💻 cs.AI
keywords graph theory benchmarkLLM mathematical reasoningcurriculum evaluationproof constructionmodel performance hierarchyzero-shot promptinghybrid human evaluation
0
0 comments X

The pith

GTBench shows GPT-5 alone sustains high accuracy on graduate graph theory proofs while other models fall to zero.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GTBench, a collection of 63 problems drawn from standard graph theory materials and split into three groups of rising difficulty from basic definitions through algorithm tracing to graduate proof construction. It evaluates five frontier LLMs under zero-shot and chain-of-thought conditions using exact-match scoring, LLM judges, and a hybrid human-expert protocol on the hardest items. Results establish a sharp performance hierarchy in which only one model maintains capability as difficulty increases, while others degrade, accompanied by distinct error patterns at each level. A reader would care because the benchmark directly tests whether current models can serve as reliable assistants for mathematical work that matches an actual curriculum.

Core claim

The paper claims that a pronounced performance hierarchy appears among LLMs when tested on graph theory problems of increasing difficulty: GPT-5 reaches 95.8 percent zero-shot on undergraduate material and 82 percent on graduate proofs, whereas the remaining models show substantial degradation, reaching 0 percent under human evaluation on the hardest group for at least one model. Failure modes shift from correct-algorithm but wrong-execution errors at lower levels to incomplete reasoning at the proof level, and human-LLM judge agreement remains only moderate.

What carries the argument

GTBench, a curriculum-grounded benchmark of 63 problems organized into three groups of increasing difficulty sourced from verified texts and scored by a hybrid of exact match, LLM-as-judge, and human-expert protocols.

If this is right

  • Only the strongest model retains meaningful accuracy once tasks reach graduate-level proof construction.
  • Correct algorithm selection followed by faulty execution is the dominant error on definition and tracing problems.
  • Incomplete reasoning and proof gaps become visible only at the graduate level.
  • Human evaluators and the automated judge disagree systematically on near-complete or verbose proofs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Curriculum-structured benchmarks could be constructed for other mathematical domains to map similar capability cliffs.
  • The observed drop-off may affect which models are chosen for self-study tools in advanced mathematics courses.
  • Improving agreement between human and automated judges on proof quality would strengthen future evaluations of this kind.
  • Adding more problems from additional textbooks would test whether the reported hierarchy generalizes beyond the current selection.

Load-bearing premise

The 63 chosen problems adequately represent the full graph theory curriculum and the hybrid human-LLM judging protocol measures reasoning quality without systematic bias.

What would settle it

A new run of the same 63 problems in which every model maintains comparable accuracy across all three groups or in which Llama scores above zero on Group 3 under human review.

Figures

Figures reproduced from arXiv: 2606.03144 by Deepti Gupta, Ibrahem ALJabea, Noujoud Nader, Patrick Diehl.

Figure 1
Figure 1. Figure 1: Prompt templates used across all models and groups. [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Evaluator prompt used by the GPT-4o judge. The three placeholders [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Evaluation results across five LLMs on GTBench Group 1. Left: prompting condition comparison. Right: failure mode breakdown. Data shown is from zero-shot condition. 7.2 Group 2 — Algorithms and Structures In [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Evaluation results across five LLMs on GTBench Group 2. Left: prompting condition comparison. Right: failure mode breakdown. Data shown is from zero-shot condition. 7.3 Group 3 — Proof Construction [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of Zero-Shot (ZS) and Chain-of-Thought (CoT) accuracy on Group 3 (proof-based problems) across five LLMs, [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
read the original abstract

Large language models (LLMs) are increasingly used as self-study assistants in technical disciplines, yet their reliability as mathematical reasoning assistants remains poorly understood. We introduce GTBench, a curriculum-grounded benchmark for evaluating LLMs as mathematical research assistants in graph theory, comprising 63 problems organized into three groups of increasing difficulty: undergraduate definitions and basic properties (Group 1), algorithm tracing and structural reasoning (Group 2), and graduate-level proof construction (Group 3). Problems are sourced from verified academic materials including Diestel's Graph Theory. We evaluate five frontier models -- GPT-5, Claude Sonnet 4.6, Gemini 2.5 Flash-Lite, Llama 3.3 70B, and Mistral Large 3 -- under zero-shot and chain-of-thought prompting, using exact-match and LLM-as-judge evaluation for Groups 1 and 2, and a hybrid human expert and LLM-as-judge protocol for Group 3. Our results reveal a pronounced performance hierarchy: GPT-5 approaches ceiling on Group 1 (95.8% zero-shot) and maintains meaningful accuracy on graduate proofs (82%), while all other models degrade substantially with difficulty, with Llama achieving 0% under human evaluation on Group 3 zero-shot. Failure mode analysis shows that correct algorithm, wrong execution errors dominate Groups 1 and 2, while Group 3 additionally surfaces incomplete reasoning failures and reveals systematic disagreement between human evaluators and the automated judge, particularly on verbose or near-complete proofs (kappa = 0.48-0.83 across human pairs). GTBench provides the first curriculum-grounded evaluation framework for graph-theoretic reasoning in LLMs, with direct implications for the governance of AI tools in mathematical education and scientific research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces GTBench, a curriculum-grounded benchmark comprising 63 graph theory problems sourced from verified materials such as Diestel's Graph Theory. Problems are organized into three groups of increasing difficulty: Group 1 (undergraduate definitions and basic properties), Group 2 (algorithm tracing and structural reasoning), and Group 3 (graduate-level proof construction). Five frontier LLMs are evaluated under zero-shot and chain-of-thought prompting using exact-match and LLM-as-judge metrics for Groups 1-2 and a hybrid human-LLM judge protocol for Group 3. Results claim a pronounced performance hierarchy, with GPT-5 achieving 95.8% zero-shot on Group 1 and 82% on Group 3, while other models (notably Llama 3.3 70B at 0% under human evaluation on Group 3 zero-shot) degrade sharply with difficulty; failure mode analysis and inter-evaluator agreement (kappa 0.48-0.83) are also reported.

Significance. If the central performance hierarchy holds under a robust evaluation protocol, GTBench would provide the first curriculum-grounded framework for assessing LLMs on graph-theoretic reasoning, with useful implications for AI governance in mathematical education and research. The sourcing from established textbooks and the multi-level difficulty structure are positive features that distinguish it from purely synthetic benchmarks.

major comments (2)
  1. [Group 3 evaluation protocol and results] The hybrid human expert and LLM-as-judge protocol for Group 3 (described in the abstract and results) reports only moderate agreement with kappa values of 0.48-0.83 across human pairs, with noted disagreements especially on verbose or near-complete proofs. This directly undermines the load-bearing claim of a performance hierarchy, as the specific figures (GPT-5 at 82%, Llama at 0% zero-shot under human evaluation) become sensitive to evaluator choice and cannot be taken as stable without further validation or resolution of the disagreement.
  2. [Benchmark construction and problem selection] The claim that the 63 problems are representative of the full graph theory curriculum (abstract and problem selection description) is asserted via sourcing from Diestel and similar materials but is not supported by any coverage analysis, topic distribution table, or external validation. This assumption is load-bearing for generalizing the degradation pattern beyond the specific selected problems.
minor comments (1)
  1. [Model selection] The abstract and results refer to 'GPT-5' without clarifying whether this denotes a publicly available model or an internal/preview version; this should be explicitly stated in the methods or model description section for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on GTBench. We address each major comment below with clarifications and indicate where revisions are planned.

read point-by-point responses
  1. Referee: [Group 3 evaluation protocol and results] The hybrid human expert and LLM-as-judge protocol for Group 3 (described in the abstract and results) reports only moderate agreement with kappa values of 0.48-0.83 across human pairs, with noted disagreements especially on verbose or near-complete proofs. This directly undermines the load-bearing claim of a performance hierarchy, as the specific figures (GPT-5 at 82%, Llama at 0% zero-shot under human evaluation) become sensitive to evaluator choice and cannot be taken as stable without further validation or resolution of the disagreement.

    Authors: We acknowledge that the reported kappa range of 0.48-0.83 reflects only moderate to substantial agreement and that disagreements arise particularly on verbose or near-complete proofs. The manuscript already flags these issues and presents the performance hierarchy using human expert evaluations as the primary metric for Group 3. To strengthen the presentation, we will add a dedicated subsection with concrete disagreement examples, report separate accuracy figures under both human and LLM-as-judge protocols, and include a sensitivity analysis showing how the hierarchy changes under alternative resolutions of disputed cases. These additions will make the dependence on evaluator choice explicit without altering the core reported numbers. revision: partial

  2. Referee: [Benchmark construction and problem selection] The claim that the 63 problems are representative of the full graph theory curriculum (abstract and problem selection description) is asserted via sourcing from Diestel and similar materials but is not supported by any coverage analysis, topic distribution table, or external validation. This assumption is load-bearing for generalizing the degradation pattern beyond the specific selected problems.

    Authors: We agree that an explicit coverage analysis is needed to support claims of representativeness. In the revised manuscript we will add a topic-distribution table that maps each of the 63 problems to the corresponding chapters and sections of Diestel’s Graph Theory (and, where relevant, other standard references), together with a short paragraph quantifying the balance across core undergraduate topics (e.g., connectivity, coloring, matching) and graduate-level topics (e.g., extremal graph theory, topological graph theory). This table will directly address the generalizability concern. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark evaluation with no derivations or self-referential reductions

full rationale

The paper introduces GTBench as a new benchmark of 63 problems sourced from external verified materials (e.g., Diestel's Graph Theory) and reports empirical performance of frontier LLMs under zero-shot and CoT prompting. Evaluation relies on exact-match, LLM-as-judge, and hybrid human-LLM protocols with reported inter-rater statistics (kappa values), but contains no equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations. The central claims are direct measurements of model accuracy on held-out problems; no step reduces by construction to the paper's own definitions or prior outputs. This is a standard empirical evaluation paper with independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical axioms, free parameters, or invented entities; the work is an empirical benchmark study. Information is limited to the abstract only.

pith-pipeline@v0.9.1-grok · 5870 in / 1047 out tokens · 20078 ms · 2026-06-28T10:17:01.234766+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 2 linked inside Pith

  1. [1]

    https://mistral.ai

    Frontier AI LLMs, assistants, agents, services | Mistral AI — mistral.ai. https://mistral.ai. [Accessed 09-05-2026]

  2. [2]

    https://diestel-graph-theory.com/

    Graph Theory — diestel-graph-theory.com. https://diestel-graph-theory.com/. [Accessed 12-05-2026]

  3. [3]

    https://openai.com/

    OpenAI — openai.com. https://openai.com/. [Accessed 09-05-2026]

  4. [4]

    HLE: A Human-Level Evaluation Benchmark for Large Language Models.arXiv preprint, 2024

    Anonymous. HLE: A Human-Level Evaluation Benchmark for Large Language Models.arXiv preprint, 2024

  5. [5]

    LemmaBench: Evaluating LLMs on Research-Level Lemmas from arXiv Preprints.arXiv preprint, 2024

    Anonymous. LemmaBench: Evaluating LLMs on Research-Level Lemmas from arXiv Preprints.arXiv preprint, 2024

  6. [6]

    RealMath: A Benchmark for Mathematical Reasoning Derived from Research Papers and Forums.arXiv preprint, 2024

    Anonymous. RealMath: A Benchmark for Mathematical Reasoning Derived from Research Papers and Forums.arXiv preprint, 2024

  7. [7]

    Introducing Sonnet 4.6 — anthropic.com

    Anthropic. Introducing Sonnet 4.6 — anthropic.com. https://www.anthropic.com/news/claude-sonnet-4-6. [Accessed 09-05-2026]

  8. [8]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training Verifiers to Solve Math Word Problems. InAdvances in Neural Information Processing Systems (NeurIPS), 2021

  9. [9]

    Mathematics 1 — Part I: Graph Theory

    Departament de Matemàtiques, Universitat Politècnica de Catalunya. Mathematics 1 — Part I: Graph Theory. Answers to Some Exercises. https://web.mat.upc.edu/fib/matematiques1/docs/pm1_graphs_sol.pdf, 2026. Academic Year 2025–2026

  10. [10]

    Mathematics 1 — Part I: Graph Theory

    Departament de Matemàtiques, Universitat Politècnica de Catalunya. Mathematics 1 — Part I: Graph Theory. Exercises and Problems. https: //web.mat.upc.edu/fib/matematiques1/docs/pm1_graphs.pdf, 2026. Academic Year 2025–2026

  11. [11]

    Evaluating ai-generated code for c++, fortran, go, java, julia, matlab, python, r, and rust

    Patrick Diehl, Noujoud Nader, Steve Brandt, and Hartmut Kaiser. Evaluating ai-generated code for c++, fortran, go, java, julia, matlab, python, r, and rust. InEuropean Conference on Parallel Processing, pages 243–254. Springer Nature Switzerland Cham, 2024

  12. [12]

    Llm-hpc++: Evaluating llm-generated modern c++ and mpi+ openmp codes for scalable mandelbrot set computation.arXiv preprint arXiv:2512.17023, 2025

    Patrick Diehl, Noujoud Nader, and Deepti Gupta. Llm-hpc++: Evaluating llm-generated modern c++ and mpi+ openmp codes for scalable mandelbrot set computation.arXiv preprint arXiv:2512.17023, 2025

  13. [13]

    Llm benchmarking with llama2: Evaluating code development performance across multiple programming languages.Journal of Machine Learning for Modeling and Computing, 6(3), 2025

    Patrick Diehl, Noujoud Nader, Maxim Moraru, and Steven R Brandt. Llm benchmarking with llama2: Evaluating code development performance across multiple programming languages.Journal of Machine Learning for Modeling and Computing, 6(3), 2025

  14. [14]

    Can Language Models Solve Graph Problems in Natural Language? InAdvances in Neural Information Processing Systems (NeurIPS), 2023

    Bahare Fatemi, Jonathan Halcrow, and Bryan Perozzi. Can Language Models Solve Graph Problems in Natural Language? InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  15. [15]

    Mathematical Capabilities of ChatGPT

    Simon Frieder, Luca Pinchetti, Ryan-Rhys Griffiths, Tommaso Salvatori, Thomas Lukasiewicz, Philipp Christian Petersen, Alexis Chevalier, and Julius Berner. Mathematical Capabilities of ChatGPT. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  16. [16]

    FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI.arXiv preprint arXiv:2411.04872, 2024

    Elliot Glazer, Ege Erdil, Tamay Besiroglu, Diego Chicharro, Evan Chen, Alex Bialer, Jaidn Gunning, Simon Lermen, and Fabien Roger. FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI.arXiv preprint arXiv:2411.04872, 2024

  17. [17]

    Models | Gemini API | Google AI for Developers — ai.google.dev

    Google DeepMind. Models | Gemini API | Google AI for Developers — ai.google.dev. https://ai.google.dev/gemini-api/docs/models. [Accessed 09-05-2026]

  18. [18]

    Measuring Mathematical Problem Solving with the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring Mathematical Problem Solving with the MATH Dataset. InAdvances in Neural Information Processing Systems (NeurIPS), 2021. 1 Manuscript submitted to ACM GTBench: A Benchmark for Evaluating LLMs in graph Theory 19

  19. [19]

    Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022

  20. [20]

    Evaluating large language models on solved and unsolved problems in graph theory: Implications for computing education.Journal of Computing Sciences in Colleges, 41(9):83–100, 2026

    Adithya Kulkarni, Mohna Chakraborty, and Jay Bagga. Evaluating large language models on solved and unsolved problems in graph theory: Implications for computing education.Journal of Computing Sciences in Colleges, 41(9):83–100, 2026

  21. [21]

    The measurement of observer agreement for categorical data.Biometrics, 33(1):159174, 1977

    GG Landis JRKoch. The measurement of observer agreement for categorical data.Biometrics, 33(1):159174, 1977

  22. [22]

    Combibench: Benchmarking llm capability for combinatorial mathematics.arXiv preprint arXiv:2505.03171, 2025

    Junqi Liu, Xiaohan Lin, Jonas Bayer, Yael Dillies, Weijie Jiang, Xiaodan Liang, Roman Soletskyi, Haiming Wang, Yunzhou Xie, Beibei Xiong, et al. Combibench: Benchmarking llm capability for combinatorial mathematics.arXiv preprint arXiv:2505.03171, 2025

  23. [23]

    Math 484: Graph Theory — homework and solutions

    John Mackey. Math 484: Graph Theory — homework and solutions. https://www.math.cmu.edu/~jmackey/math484/, 2020. Accessed: 2025

  24. [24]

    meta-llama/Llama-3.3-70B-Instruct·Hugging Face — huggingface.co

    Meta AI. meta-llama/Llama-3.3-70B-Instruct·Hugging Face — huggingface.co. https://huggingface.co/meta-llama/{L}lama-3.3-70{B}-{I}nstruct. [Accessed 09-05-2026]

  25. [25]

    Can llms find bugs in code? an evaluation from beginner errors to security vulnerabilities in python and c++

    Akshay Mhatre, Noujoud Nader, Patrick Diehl, and Deepti Gupta. Can llms find bugs in code? an evaluation from beginner errors to security vulnerabilities in python and c++. InSoutheastCon 2026, pages 1–8. IEEE, 2026

  26. [26]

    LLM & HPC: Benchmarking deepseek’s performance in high-performance computing tasks

    Noujoud Nader, Patrick Diehl, Steve Brandt, and Hartmut Kaiser. LLM & HPC: Benchmarking deepseek’s performance in high-performance computing tasks. InInternational Conference on High Performance Computing, pages 626–638. Springer, 2025

  27. [27]

    Classification of pregnancy and labor contractions using a graph theory based analysis

    Noujoud Nader, Mahmoud Hassan, W Falou, Ahmad Diab, Sally Al-Omar, Mohamad Khalil, and Catherine Marque. Classification of pregnancy and labor contractions using a graph theory based analysis. In2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pages 2876–2879. IEEE, 2015

  28. [28]

    Pregnancy monitoring using graph theory based analysis

    Noujoud Nader, Catherine Marque, Mahmoud Hassan, Wassim Falou, Ahmad Diab, and Mohamad Khalil. Pregnancy monitoring using graph theory based analysis. In2015 International Conference on Advances in Biomedical Engineering (ICABME), pages 73–76. IEEE, 2015

  29. [29]

    LemmaBench: A Live, Research-Level Benchmark to Evaluate LLM Capabilities in Mathematics, 2026

    Antoine Peyronnet, Fabian Gloeckle, and Amaury Hayat. LemmaBench: A Live, Research-Level Benchmark to Evaluate LLM Capabilities in Mathematics, 2026

  30. [30]

    Exploring graph tasks with pure llms: A comprehensive benchmark and investigation.arXiv preprint arXiv:2502.18771, 2025

    Yuxiang Wang, Xinnan Dai, Wenqi Fan, and Yao Ma. Exploring graph tasks with pure llms: A comprehensive benchmark and investigation.arXiv preprint arXiv:2502.18771, 2025

  31. [31]

    Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

  32. [32]

    Core: Benchmarking llms’ code reasoning capabilities through static analysis tasks.Advances in Neural Information Processing Systems, 38, 2026

    Danning Xie, Mingwei Zheng, Xuwei Liu, Jiannan Wang, Chengpeng Wang, Lin Tan, and Xiangyu Zhang. Core: Benchmarking llms’ code reasoning capabilities through static analysis tasks.Advances in Neural Information Processing Systems, 38, 2026

  33. [33]

    Leandojo: Theorem proving with retrieval-augmented language models.Advances in Neural Information Processing Systems, 36:21573–21612, 2023

    Kaiyu Yang, Aidan Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, Ryan J Prenger, and Animashree Anandkumar. Leandojo: Theorem proving with retrieval-augmented language models.Advances in Neural Information Processing Systems, 36:21573–21612, 2023

  34. [34]

    Minif2f: a cross-system benchmark for formal olympiad-level mathematics.arXiv preprint arXiv:2109.00110, 2021

    Kunhao Zheng, Jesse Michael Han, and Stanislas Polu. Minif2f: a cross-system benchmark for formal olympiad-level mathematics.arXiv preprint arXiv:2109.00110, 2021

  35. [35]

    Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023. Manuscript submitted to ACM