Automated Proving of Shannon-Type Entropy Inequalities via Fine-Tuned Language Models and Guided Tree Search

Amin Gohari; Cheuk Ting Li; Linqi Song; Shaocheng Liu; Shing Yin Wong

arxiv: 2606.05729 · v1 · pith:XAKQRNLVnew · submitted 2026-06-04 · 💻 cs.IT · cs.LG· math.IT

Automated Proving of Shannon-Type Entropy Inequalities via Fine-Tuned Language Models and Guided Tree Search

Shing Yin Wong , Shaocheng Liu , Linqi Song , Amin Gohari , Cheuk Ting Li This is my paper

Pith reviewed 2026-06-27 23:50 UTC · model grok-4.3

classification 💻 cs.IT cs.LGmath.IT

keywords entropy inequalitiesautomated theorem provinglanguage modelsinformation theorybeam searchShannon inequalitiesproof automationtree search

0 comments

The pith

A 0.6B fine-tuned language model with guided beam search proves 85 percent of held-out Shannon entropy inequalities for 10 to 15 variables.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether small language models can learn to generate the sequences of linear combinations needed to prove Shannon-type entropy inequalities, a task that grows combinatorially hard with more random variables. It fine-tunes models on examples of atomic proof steps and then applies guided beam search to explore full proofs. The 0.6B model reaches 85 percent success on a test set of 60 inequalities, beating both zero-shot large models and the existing Psitip solver. Ablation experiments show that balanced training data and moderate context length work better than skewed data or longer contexts. The work identifies format errors and step-quality drops as the main failure modes and confirms that the model's scoring heuristic drives most of the performance gain.

Core claim

Fine-tuned language models on atomic proof steps, when paired with guided beam search, can construct valid linear-combination proofs for Shannon-type entropy inequalities at scale; a 0.6B model succeeds on 85 percent of a held-out test set spanning 10 to 15 variables, while zero-shot GPT-5.5 reaches only 1.7 percent and Psitip reaches 33.3 percent.

What carries the argument

Guided beam search over sequences of atomic entropy-inequality proof steps produced by a fine-tuned language model, with the model itself providing the scoring heuristic that ranks partial proofs.

If this is right

Proof search for entropy inequalities becomes feasible for variable counts where exhaustive enumeration is impossible.
Training on atomic steps from smaller cases transfers to larger cases without retraining.
Balanced rather than skewed data distributions and moderate context lengths yield the highest success rates.
The beam-scoring function is essential; replacing it with random scoring cuts success from roughly 83 percent to 23 percent.
Format failures and progressive degradation of step quality are the dominant error types that future work must address.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fine-tuning-plus-guided-search pattern could be applied to other combinatorial search tasks in information theory that rely on linear combinations of known inequalities.
Hybrid systems that feed the model's candidate steps into an existing symbolic checker such as Psitip might raise reliability without increasing model size.
If atomic steps continue to compose for still larger variable counts, the method offers a route to proofs that currently require human insight into non-obvious combinations.
Failure on format or step quality could be mitigated by post-training the model to emit machine-checkable intermediate expressions rather than free text.

Load-bearing premise

The atomic proof steps seen during training already contain every building block required to assemble correct proofs for the held-out inequalities.

What would settle it

Construct a new test set of inequalities whose shortest valid proofs each require at least one atomic step or linear combination absent from the training distribution; a sharp drop below 85 percent success would falsify the claim.

read the original abstract

Proving Shannon-type entropy inequalities is a fundamental task in information theory that often requires constructing non-trivial linear combinations of known constraints, which is a combinatorial search problem that scales poorly with the number of random variables. We investigate whether small-scale large language models (0.6B--1.7B parameters), fine-tuned on atomic proof steps and combined with guided beam search, can automate this process. On a held-out test set of 60 inequalities spanning n=10 to 15 variables, our 0.6B fine-tuned model achieves an 85\% proof success rate with tree search. GPT-5.5 solves 1.7\% samples under zero-shot prompting while Psitip solves 33.3\% samples. A systematic ablation study across training context length (4096 vs.\ 8192 tokens) and data distribution (n=9-skewed vs not skewed) reveals that a 4096-token not skewed training distribution yields the best performance, with extended context and skewed data providing no marginal benefit. We further identify two dominant failure modes -- format failures and step quality degradation -- and verify that the beam-scoring heuristic is essential via a controlled ablation (random scoring reduces success from 83\% to 23\%).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Fine-tuned 0.6B LLMs plus guided search hit 85% on held-out entropy inequalities, but the success metric lacks external validity checks.

read the letter

The paper's core result is that a 0.6B model fine-tuned on atomic proof steps, paired with guided beam search, reaches 85% success on a held-out set of 60 inequalities with 10-15 variables. This beats their GPT-5.5 zero-shot baseline (1.7%) and Psitip (33.3%).

The ablations are the strongest part. They directly compare 4096 vs 8192 token context and skewed vs balanced training data, show that longer context and skew add nothing, and run a clean control where random scoring drops success from 83% to 23%. Naming the two failure modes (format and step quality) is also useful.

The soft spot is exactly the one the stress test flags. Success is defined by the search completing without format errors or quality drop, using an internal beam heuristic. The abstract gives no indication that outputs are passed through an independent symbolic verifier such as a linear-programming formulation or Psitip re-check. If the model can assemble syntactically clean but mathematically invalid linear combinations, the 85% figure does not directly measure proof correctness. That makes the comparison to the baselines less decisive than it appears.

The work is aimed at researchers who already care about automated proof search inside information theory. It has enough concrete numbers and controls to justify sending it to referees, provided the authors can clarify or add an external validation step for the generated proofs.

Referee Report

1 major / 2 minor

Summary. The paper claims that fine-tuned small LLMs (0.6B–1.7B parameters) trained on atomic proof steps, when combined with guided beam search, can automate construction of Shannon-type entropy inequality proofs. On a held-out test set of 60 inequalities for n=10–15 variables, the 0.6B model reaches 85% success (vs. 1.7% for zero-shot GPT-5.5 and 33.3% for Psitip); ablations show 4096-token non-skewed training is optimal, beam scoring is essential, and the two main failure modes are format errors and step-quality degradation.

Significance. If the reported proofs are mathematically valid, the work would demonstrate a practical ML-assisted approach to a combinatorially hard problem in information theory, with concrete empirical gains over existing tools and useful ablation data on context length and data distribution. The explicit reporting of success rates, baseline comparisons, and controlled ablations (including the random-scoring control) are strengths that would make the result reproducible and extensible if the validity concern is resolved.

major comments (1)

[Evaluation on held-out test set] The central empirical claim (85% success on the 60-inequality held-out set) is defined in the evaluation section solely via syntactic completion of the guided tree search (no format failures, no step-quality degradation) and a beam-scoring heuristic. No external symbolic verifier (e.g., LP formulation of Shannon inequalities or Psitip re-checking of the final linear combination) is described or applied to confirm that each output constitutes a mathematically valid proof. Because the model is trained only on atomic steps, this leaves open the possibility that some fraction of the 85% are syntactically well-formed but semantically invalid derivations, directly affecting the validity of the comparisons to Psitip and GPT-5.5.

minor comments (2)

[Data and test-set construction] The description of how the 60-inequality held-out test set was constructed (sampling procedure, variable-range constraints, and separation from training data) is not detailed enough to allow independent replication.
[Failure-mode analysis] The paper does not report the exact definition or implementation of the "step-quality degradation" failure mode used to count successes.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful and constructive review. The concern about external validation of proof validity is well-taken, and we address it directly below by agreeing to strengthen the evaluation.

read point-by-point responses

Referee: [Evaluation on held-out test set] The central empirical claim (85% success on the 60-inequality held-out set) is defined in the evaluation section solely via syntactic completion of the guided tree search (no format failures, no step-quality degradation) and a beam-scoring heuristic. No external symbolic verifier (e.g., LP formulation of Shannon inequalities or Psitip re-checking of the final linear combination) is described or applied to confirm that each output constitutes a mathematically valid proof. Because the model is trained only on atomic steps, this leaves open the possibility that some fraction of the 85% are syntactically well-formed but semantically invalid derivations, directly affecting the validity of the comparisons to Psitip and GPT-5.5.

Authors: We agree that an independent symbolic verifier would strengthen the empirical claims. The training corpus consists exclusively of verified atomic steps drawn from known valid proofs, and the search procedure restricts actions to these steps while treating step-quality degradation as an explicit failure mode. Nevertheless, this does not replace an external check. In the revised manuscript we will integrate a post-hoc symbolic verifier based on the linear-programming formulation of Shannon inequalities (or equivalently re-checking the final linear combination via Psitip) and report both the original syntactic success rate and the verified success rate on the 60-inequality test set. Any proofs that fail verification will be counted as failures, and the comparisons to baselines will be updated accordingly. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical success rate measured on held-out test set with external baselines

full rationale

The paper reports an 85% proof success rate for a 0.6B fine-tuned model on a held-out test set of 60 inequalities (n=10-15), compared against Psitip (33.3%) and GPT-5.5 (1.7%). Success is defined via completion of guided tree search without format failures or step-quality degradation, with an ablation confirming beam-scoring importance (83% vs 23%). No self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citations appear; the held-out evaluation and external comparisons keep the central claim independent of training inputs by construction. This is a standard empirical ML result with no reduction of the reported metric to its own definitions.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The performance claim rests on the assumption that training on atomic steps generalizes to new inequalities and that the beam-scoring heuristic reliably identifies valid proofs; no new entities are postulated.

free parameters (2)

context length = 4096
Ablation identified 4096 tokens as optimal
data distribution = not skewed
Non-skewed distribution selected after ablation

axioms (1)

domain assumption Atomic proof steps from the training distribution suffice to compose proofs for unseen inequalities
Required for the generalization from training to the held-out test set of n=10-15 inequalities

pith-pipeline@v0.9.1-grok · 5772 in / 1391 out tokens · 34234 ms · 2026-06-27T23:50:22.263997+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 5 canonical work pages · 4 internal anchors

[1]

On characterization of entropy function via information inequalities,

Z. Zhang and R. W. Yeung, “On characterization of entropy function via information inequalities,”IEEE Trans. Inf. Theory, vol. 44, no. 4, pp. 1440–1452, Jul. 1998

1998
[2]

Secret-sharing schemes: A survey,

A. Beimel, “Secret-sharing schemes: A survey,” inCoding and Cryptol- ogy, Lecture Notes in Computer Science, vol. 6639. Berlin, Heidelberg: Springer, 2011, pp. 11–46

2011
[3]

Characterizing the rate region of the (4,3,3) exact-repair regenerating codes,

C. Tian, “Characterizing the rate region of the (4,3,3) exact-repair regenerating codes,”IEEE J. Sel. Areas Commun., vol. 32, no. 5, pp. 967– 975, May 2014

2014
[4]

A framework for linear information inequalities,

R. W. Yeung, “A framework for linear information inequalities,”IEEE Trans. Inf. Theory, vol. 43, no. 6, pp. 1924–1934, Nov. 1997

1924
[5]

Pulikkoonattu and S

R. Pulikkoonattu and S. Diggavi,ITIP-Based C Program Software Package, 2006. [Online]. Available: http://xitip.epfl.ch

2006
[6]

Csirmaz,Minitip—A Minimal Information Theoretic Inequality Prover, 2016

L. Csirmaz,Minitip—A Minimal Information Theoretic Inequality Prover, 2016. [Online]. Available: https://github.com/lcsirmaz/minitip

2016
[7]

Proving and disproving information inequalities: Theory and scalable algorithms,

S.-W. Ho, L. Ling, C. W. Tan, and R. W. Yeung, “Proving and disproving information inequalities: Theory and scalable algorithms,”IEEE Trans. Inf. Theory, vol. 66, no. 9, pp. 5522–5536, Sep. 2020

2020
[8]

An automated theorem proving framework for information- theoretic results,

C. T. Li, “An automated theorem proving framework for information- theoretic results,”IEEE Trans. Inf. Theory, vol. 69, no. 11, pp. 6857– 6877, Nov. 2023

2023
[9]

Proving information inequalities by Gaussian elimination,

L. Guo, R. W. Yeung, and X.-S. Gao, “Proving information inequalities by Gaussian elimination,”IEEE Trans. Inf. Theory, vol. 71, no. 4, pp. 2315–2328, Apr. 2025

2025
[10]

Qwen3 Technical Report

Qwen Team, “Qwen3 technical report,” arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . K. Li, Y . Wu, and D. Guo, “DeepSeekMath: Pushing the limits of mathematical reasoning in open language models,” arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

LoRA: Low-rank adaptation of large language models,

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” inProc. Int. Conf. Learn. Represent. (ICLR), 2022

2022
[13]

Generative Language Modeling for Automated Theorem Proving

S. Polu and I. Sutskever, “Generative language modeling for automated theorem proving,” arXiv:2009.03393, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[14]

Formal mathematics statement curriculum learning,

S. Polu, J. M. Han, K. Zheng, M. Baksys, I. Babuschkin, and I. Sutskever, “Formal mathematics statement curriculum learning,” inProc. Int. Conf. Learn. Represent. (ICLR), 2023

2023
[15]

Reinforcement learning of theorem proving,

C. Kaliszyk, J. Urban, H. Michalewski, and M. Olsák, “Reinforcement learning of theorem proving,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2018

2018
[16]

Premise selection for theorem proving by deep graph embedding,

M. Wang, Y . Tang, J. Wang, and J. Deng, “Premise selection for theorem proving by deep graph embedding,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2017

2017
[17]

HyperTree proof search for neural theorem proving,

G. Lample, T. Lacroix, M.-A. Lachaux, A. Rodriguez, A. Hayat, T. Lavril, G. Ebner, and X. Martinet, “HyperTree proof search for neural theorem proving,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2022

2022
[18]

BFS-Prover: Scalable best-first tree search for LLM-based automatic theorem proving,

R. Xin, C. Xi, J. Yang, F. Chen, H. Wu, X. Xiao, Y . Sun, S. Zheng, and M. Ding, “BFS-Prover: Scalable best-first tree search for LLM-based automatic theorem proving,” inProc. 63rd Annu. Meeting Assoc. Comput. Linguistics (ACL), Vol. 1: Long Papers, 2025, pp. 32588–32599

2025
[19]

Thor: Wielding hammers to integrate language models and automated theorem provers,

A. Q. Jiang, W. Li, S. Tworkowski, K. Czechowski, T. Odrzygó´ zd´ z, P. Miło´s, Y . Wu, and M. Jamnik, “Thor: Wielding hammers to integrate language models and automated theorem provers,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2022

2022
[20]

DeepSeek-Prover: Advancing theorem proving in LLMs through large-scale synthetic data,

H. Xin, D. Guo, Z. Shao, Z. Z. Ren, Q. Zhu, B. Liu, C. Ruan, W. Li, and X. Liang, “DeepSeek-Prover: Advancing theorem proving in LLMs through large-scale synthetic data,” inProc. NeurIPS Workshop Mathematical Reasoning and AI (MATH-AI), 2024

2024
[21]

ProofOptimizer: Training language models to simplify proofs without human demonstrations,

A. Gu, B. Piotrowski, F. Gloeckle, K. Yang, and A. H. Markosyan, “ProofOptimizer: Training language models to simplify proofs without human demonstrations,” inProc. Int. Conf. Learn. Represent. (ICLR), 2026

2026
[22]

Solving olympiad geometry without human demonstrations,

T. H. Trinh, Y . Wu, Q. V . Le, H. He, and T. Luong, “Solving olympiad geometry without human demonstrations,”Nature, vol. 625, pp. 476–482, Jan. 2024

2024
[23]

Gold-medalist performance in solving olympiad geometry with AlphaGeometry2,

Y . Chervonyi, T. H. Trinh, M. Olsák, X. Yang, H. H. Nguyen, M. Mene- gali, J. Jung, J. Kim, V . Verma, Q. V . Le, and T. Luong, “Gold-medalist performance in solving olympiad geometry with AlphaGeometry2,”J. Mach. Learn. Res., vol. 26, pp. 1–39, 2025

2025
[24]

Olympiad-level formal mathematical reasoning with reinforcement learning,

T. Hubert et al., “Olympiad-level formal mathematical reasoning with reinforcement learning,”Nature, vol. 651, pp. 607–613, 2026

2026
[25]

Proving olympiad algebraic inequalities without human demonstrations,

C. Wei, M. Sun, and W. Wang, “Proving olympiad algebraic inequalities without human demonstrations,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), Datasets and Benchmarks Track, 2024

2024
[26]

Formal theorem proving by rewarding LLMs to decompose proofs hierarchically,

K. Dong, A. Mahankali, and T. Ma, “Formal theorem proving by rewarding LLMs to decompose proofs hierarchically,” inProc. NeurIPS Workshop Mathematical Reasoning and AI (MATH-AI), 2024

2024
[27]

FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

E. Glazer, E. Erdil, T. Besiroglu, D. Chicharro, E. Chen, A. Gunning, C. F. Olsson, J.-S. Denain, A. Ho, E. de Oliveira Santos, O. Järviniemi, M. Barnett, R. Sandler, M. Vrzala, J. Sevilla, Q. Ren, E. Pratt, L. Levine, G. Barkley, N. Stewart, B. Grechuk, T. Grechuk, S. V . Enugandla, and M. Wildon, “FrontierMath: A benchmark for evaluating advanced mathem...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model?,

Y . Yue, Z. Chen, R. Lu, A. Zhao, Z. Wang, Y . Yue, S. Song, and G. Huang, “Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model?,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2025

2025
[29]

Proof automation with large lan- guage models,

M. Lu, B. Delaware, and T. Zhang, “Proof automation with large lan- guage models,” inProc. 39th IEEE/ACM Int. Conf. Automated Software Engineering (ASE), Sacramento, CA, USA, 2024, 12 pp

2024
[30]

Lean Copilot: Large language models as copilots for theorem proving in Lean,

P. Song, K. Yang, and A. Anandkumar, “Lean Copilot: Large language models as copilots for theorem proving in Lean,” inProc. Int. Conf. Neuro-symbolic Systems (NeuS), Proc. Mach. Learn. Res., vol. 288, pp. 144–169, 2025

2025
[31]

Process-driven autoformalization in Lean 4,

J. Lu, Y . Wan, Z. Liu, Y . Huang, J. Xiong, C. Liu, J. Shen, H. Jin, J. Zhang, H. Wang, Z. Yang, J. Tang, and Z. Guo, “Process-driven autoformalization in Lean 4,” arXiv:2406.01940, 2024

work page arXiv 2024
[32]

[Online]

OpenAI,GPT-5.5, 2026. [Online]. Available: https://openai.com/index/ introducing-gpt-5-5/

2026
[33]

[Online]

Anthropic,Claude Sonnet 4.6, 2026. [Online]. Available: https://www. anthropic.com/news/claude-sonnet-4-6

2026
[34]

[Online]

OpenAI,GPT-5.4 mini, 2026. [Online]. Available: https://developers. openai.com/api/docs/models/gpt-5.4-mini

2026
[35]

[Online]

DeepSeek,DeepSeek-V3.2 Release, 2025. [Online]. Available: https:// api-docs.deepseek.com/news/news251201

2025

[1] [1]

On characterization of entropy function via information inequalities,

Z. Zhang and R. W. Yeung, “On characterization of entropy function via information inequalities,”IEEE Trans. Inf. Theory, vol. 44, no. 4, pp. 1440–1452, Jul. 1998

1998

[2] [2]

Secret-sharing schemes: A survey,

A. Beimel, “Secret-sharing schemes: A survey,” inCoding and Cryptol- ogy, Lecture Notes in Computer Science, vol. 6639. Berlin, Heidelberg: Springer, 2011, pp. 11–46

2011

[3] [3]

Characterizing the rate region of the (4,3,3) exact-repair regenerating codes,

C. Tian, “Characterizing the rate region of the (4,3,3) exact-repair regenerating codes,”IEEE J. Sel. Areas Commun., vol. 32, no. 5, pp. 967– 975, May 2014

2014

[4] [4]

A framework for linear information inequalities,

R. W. Yeung, “A framework for linear information inequalities,”IEEE Trans. Inf. Theory, vol. 43, no. 6, pp. 1924–1934, Nov. 1997

1924

[5] [5]

Pulikkoonattu and S

R. Pulikkoonattu and S. Diggavi,ITIP-Based C Program Software Package, 2006. [Online]. Available: http://xitip.epfl.ch

2006

[6] [6]

Csirmaz,Minitip—A Minimal Information Theoretic Inequality Prover, 2016

L. Csirmaz,Minitip—A Minimal Information Theoretic Inequality Prover, 2016. [Online]. Available: https://github.com/lcsirmaz/minitip

2016

[7] [7]

Proving and disproving information inequalities: Theory and scalable algorithms,

S.-W. Ho, L. Ling, C. W. Tan, and R. W. Yeung, “Proving and disproving information inequalities: Theory and scalable algorithms,”IEEE Trans. Inf. Theory, vol. 66, no. 9, pp. 5522–5536, Sep. 2020

2020

[8] [8]

An automated theorem proving framework for information- theoretic results,

C. T. Li, “An automated theorem proving framework for information- theoretic results,”IEEE Trans. Inf. Theory, vol. 69, no. 11, pp. 6857– 6877, Nov. 2023

2023

[9] [9]

Proving information inequalities by Gaussian elimination,

L. Guo, R. W. Yeung, and X.-S. Gao, “Proving information inequalities by Gaussian elimination,”IEEE Trans. Inf. Theory, vol. 71, no. 4, pp. 2315–2328, Apr. 2025

2025

[10] [10]

Qwen3 Technical Report

Qwen Team, “Qwen3 technical report,” arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . K. Li, Y . Wu, and D. Guo, “DeepSeekMath: Pushing the limits of mathematical reasoning in open language models,” arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

LoRA: Low-rank adaptation of large language models,

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” inProc. Int. Conf. Learn. Represent. (ICLR), 2022

2022

[13] [13]

Generative Language Modeling for Automated Theorem Proving

S. Polu and I. Sutskever, “Generative language modeling for automated theorem proving,” arXiv:2009.03393, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009

[14] [14]

Formal mathematics statement curriculum learning,

S. Polu, J. M. Han, K. Zheng, M. Baksys, I. Babuschkin, and I. Sutskever, “Formal mathematics statement curriculum learning,” inProc. Int. Conf. Learn. Represent. (ICLR), 2023

2023

[15] [15]

Reinforcement learning of theorem proving,

C. Kaliszyk, J. Urban, H. Michalewski, and M. Olsák, “Reinforcement learning of theorem proving,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2018

2018

[16] [16]

Premise selection for theorem proving by deep graph embedding,

M. Wang, Y . Tang, J. Wang, and J. Deng, “Premise selection for theorem proving by deep graph embedding,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2017

2017

[17] [17]

HyperTree proof search for neural theorem proving,

G. Lample, T. Lacroix, M.-A. Lachaux, A. Rodriguez, A. Hayat, T. Lavril, G. Ebner, and X. Martinet, “HyperTree proof search for neural theorem proving,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2022

2022

[18] [18]

BFS-Prover: Scalable best-first tree search for LLM-based automatic theorem proving,

R. Xin, C. Xi, J. Yang, F. Chen, H. Wu, X. Xiao, Y . Sun, S. Zheng, and M. Ding, “BFS-Prover: Scalable best-first tree search for LLM-based automatic theorem proving,” inProc. 63rd Annu. Meeting Assoc. Comput. Linguistics (ACL), Vol. 1: Long Papers, 2025, pp. 32588–32599

2025

[19] [19]

Thor: Wielding hammers to integrate language models and automated theorem provers,

A. Q. Jiang, W. Li, S. Tworkowski, K. Czechowski, T. Odrzygó´ zd´ z, P. Miło´s, Y . Wu, and M. Jamnik, “Thor: Wielding hammers to integrate language models and automated theorem provers,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2022

2022

[20] [20]

DeepSeek-Prover: Advancing theorem proving in LLMs through large-scale synthetic data,

H. Xin, D. Guo, Z. Shao, Z. Z. Ren, Q. Zhu, B. Liu, C. Ruan, W. Li, and X. Liang, “DeepSeek-Prover: Advancing theorem proving in LLMs through large-scale synthetic data,” inProc. NeurIPS Workshop Mathematical Reasoning and AI (MATH-AI), 2024

2024

[21] [21]

ProofOptimizer: Training language models to simplify proofs without human demonstrations,

A. Gu, B. Piotrowski, F. Gloeckle, K. Yang, and A. H. Markosyan, “ProofOptimizer: Training language models to simplify proofs without human demonstrations,” inProc. Int. Conf. Learn. Represent. (ICLR), 2026

2026

[22] [22]

Solving olympiad geometry without human demonstrations,

T. H. Trinh, Y . Wu, Q. V . Le, H. He, and T. Luong, “Solving olympiad geometry without human demonstrations,”Nature, vol. 625, pp. 476–482, Jan. 2024

2024

[23] [23]

Gold-medalist performance in solving olympiad geometry with AlphaGeometry2,

Y . Chervonyi, T. H. Trinh, M. Olsák, X. Yang, H. H. Nguyen, M. Mene- gali, J. Jung, J. Kim, V . Verma, Q. V . Le, and T. Luong, “Gold-medalist performance in solving olympiad geometry with AlphaGeometry2,”J. Mach. Learn. Res., vol. 26, pp. 1–39, 2025

2025

[24] [24]

Olympiad-level formal mathematical reasoning with reinforcement learning,

T. Hubert et al., “Olympiad-level formal mathematical reasoning with reinforcement learning,”Nature, vol. 651, pp. 607–613, 2026

2026

[25] [25]

Proving olympiad algebraic inequalities without human demonstrations,

C. Wei, M. Sun, and W. Wang, “Proving olympiad algebraic inequalities without human demonstrations,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), Datasets and Benchmarks Track, 2024

2024

[26] [26]

Formal theorem proving by rewarding LLMs to decompose proofs hierarchically,

K. Dong, A. Mahankali, and T. Ma, “Formal theorem proving by rewarding LLMs to decompose proofs hierarchically,” inProc. NeurIPS Workshop Mathematical Reasoning and AI (MATH-AI), 2024

2024

[27] [27]

FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

E. Glazer, E. Erdil, T. Besiroglu, D. Chicharro, E. Chen, A. Gunning, C. F. Olsson, J.-S. Denain, A. Ho, E. de Oliveira Santos, O. Järviniemi, M. Barnett, R. Sandler, M. Vrzala, J. Sevilla, Q. Ren, E. Pratt, L. Levine, G. Barkley, N. Stewart, B. Grechuk, T. Grechuk, S. V . Enugandla, and M. Wildon, “FrontierMath: A benchmark for evaluating advanced mathem...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model?,

Y . Yue, Z. Chen, R. Lu, A. Zhao, Z. Wang, Y . Yue, S. Song, and G. Huang, “Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model?,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2025

2025

[29] [29]

Proof automation with large lan- guage models,

M. Lu, B. Delaware, and T. Zhang, “Proof automation with large lan- guage models,” inProc. 39th IEEE/ACM Int. Conf. Automated Software Engineering (ASE), Sacramento, CA, USA, 2024, 12 pp

2024

[30] [30]

Lean Copilot: Large language models as copilots for theorem proving in Lean,

P. Song, K. Yang, and A. Anandkumar, “Lean Copilot: Large language models as copilots for theorem proving in Lean,” inProc. Int. Conf. Neuro-symbolic Systems (NeuS), Proc. Mach. Learn. Res., vol. 288, pp. 144–169, 2025

2025

[31] [31]

Process-driven autoformalization in Lean 4,

J. Lu, Y . Wan, Z. Liu, Y . Huang, J. Xiong, C. Liu, J. Shen, H. Jin, J. Zhang, H. Wang, Z. Yang, J. Tang, and Z. Guo, “Process-driven autoformalization in Lean 4,” arXiv:2406.01940, 2024

work page arXiv 2024

[32] [32]

[Online]

OpenAI,GPT-5.5, 2026. [Online]. Available: https://openai.com/index/ introducing-gpt-5-5/

2026

[33] [33]

[Online]

Anthropic,Claude Sonnet 4.6, 2026. [Online]. Available: https://www. anthropic.com/news/claude-sonnet-4-6

2026

[34] [34]

[Online]

OpenAI,GPT-5.4 mini, 2026. [Online]. Available: https://developers. openai.com/api/docs/models/gpt-5.4-mini

2026

[35] [35]

[Online]

DeepSeek,DeepSeek-V3.2 Release, 2025. [Online]. Available: https:// api-docs.deepseek.com/news/news251201

2025