FormalRewardBench is the first benchmark for reward models in formal theorem proving, consisting of 250 Lean 4 preference pairs that show frontier LLMs scoring 59.8% while specialized provers score only 24.4%.
Olympiad-level formal mathematical reasoning with reinforcement learning.Nature
7 Pith papers cite this work. Polarity classification is still indexing.
years
2026 7representative citing papers
Formal Conjectures is a Lean 4 benchmark containing 2615 formalized problems with 1029 open conjectures, designed to evaluate automated mathematical reasoning and proof discovery.
An interactive AI workbench for mathematicians achieves 48% on FrontierMath Tier 4 and helped solve open problems in early tests.
A SAT-plus-LLM method discovers infinite families of doubly saturated Ramsey-good graphs, answering Grinstead and Roberts' 1982 question.
Explorable theorems ground written proofs in Lean formalizations to enable step-by-step execution, custom example testing, and dependency tracing, with a user study showing improved comprehension.
CauSim turns scarce causal reasoning labels into scalable supervised data by having LLMs incrementally construct complex executable structural causal models.
GrandCode is the first AI system to consistently beat all human participants and place first in live Codeforces competitive programming contests.
citing papers explorer
-
FormalRewardBench: A Benchmark for Formal Theorem Proving Reward Models
FormalRewardBench is the first benchmark for reward models in formal theorem proving, consisting of 250 Lean 4 preference pairs that show frontier LLMs scoring 59.8% while specialized provers score only 24.4%.
-
Formal Conjectures: An Open and Evolving Benchmark for Verified Discovery in Mathematics
Formal Conjectures is a Lean 4 benchmark containing 2615 formalized problems with 1029 open conjectures, designed to evaluate automated mathematical reasoning and proof discovery.
-
AI co-mathematician: Accelerating mathematicians with agentic AI
An interactive AI workbench for mathematicians achieves 48% on FrontierMath Tier 4 and helped solve open problems in early tests.
-
Doubly Saturated Ramsey Graphs: A Case Study in Computer-Assisted Mathematical Discovery
A SAT-plus-LLM method discovers infinite families of doubly saturated Ramsey-good graphs, answering Grinstead and Roberts' 1982 question.
-
Explorable Theorems: Making Written Theorems Explorable by Grounding Them in Formal Representations
Explorable theorems ground written proofs in Lean formalizations to enable step-by-step execution, custom example testing, and dependency tracing, with a user study showing improved comprehension.
-
CauSim: Scaling Causal Reasoning with Increasingly Complex Causal Simulators
CauSim turns scarce causal reasoning labels into scalable supervised data by having LLMs incrementally construct complex executable structural causal models.
-
GrandCode: Achieving Grandmaster Level in Competitive Programming via Agentic Reinforcement Learning
GrandCode is the first AI system to consistently beat all human participants and place first in live Codeforces competitive programming contests.