FormalRewardBench is the first benchmark for reward models in formal theorem proving, consisting of 250 Lean 4 preference pairs that show frontier LLMs scoring 59.8% while specialized provers score only 24.4%.
hub
URL https://aclanthology.org/2025
27 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
years
2026 27representative citing papers
AXLE is a multi-tenant cloud platform providing Lean 4 metaprogramming utilities with per-request isolation, multi-version support, and public access via SDK and API, having processed over 500 million requests.
Introduces Relaxed NFL intermediate language for LLM-based auto-formalization, with rule-plus-LLM elaboration to Core NFL and tactic-language discharge of verification conditions.
LeanMarathon uses four contract-scoped agents on an evolving blueprint coordinated by a two-stage orchestrator to formalize seven theorems from Erdős problems in Lean, proving 258 lemmas with no sorry across three runs.
A multi-agent framework called AutoformBot autoformalized 26 textbooks spanning analysis, algebra, topology, combinatorics and probability into a verified Lean 4 library of 45k declarations, demonstrating scalable formalization of graduate math.
An RL-guided MCTS proof search for Tamarin finds more and shorter proofs than standard search across 16 protocol models.
Formal Conjectures is a Lean 4 benchmark containing 2615 formalized problems with 1029 open conjectures, designed to evaluate automated mathematical reasoning and proof discovery.
An interactive AI workbench for mathematicians achieves 48% on FrontierMath Tier 4 and helped solve open problems in early tests.
A SAT-plus-LLM method discovers infinite families of doubly saturated Ramsey-good graphs, answering Grinstead and Roberts' 1982 question.
Explorable theorems ground written proofs in Lean formalizations to enable step-by-step execution, custom example testing, and dependency tracing, with a user study showing improved comprehension.
External verification structures, not model capability, determine the reliability of LLM-assisted economic theory, as shown in attempts to design an incentive mechanism for a grade inflation model where adversarial checks caught false claims.
Presents PyGeoX DSL and 300-problem benchmark, identifies outlier gradient masking under global rewards, and shows Saturating Additive Rewards improve hard-tier solving rate by 2.3x with an 8B model competitive to larger systems.
Bidirectional Evolutionary Search augments autoregressive expansion with evolutionary recombination operators and dense backward subgoal feedback to produce better candidates than standard best-of-N or tree search for language model self-improvement.
Agent-directed tree search improves LLM performance on Lean formal verification tasks, with context-based orchestration solving more intermediate specs at lower token cost than baseline agents.
ImProver 2 combines a data-efficient expert-iteration pipeline with a neurosymbolic scaffold to train a 7B model that outperforms larger models in Lean 4 proof optimization across structural metrics.
Synthetic data improves models only in information-open generation-training loops with external signals, and coarser signals like binary correctness enable better generalization by converging to the most information-efficient component.
CauSim turns scarce causal reasoning labels into scalable supervised data by having LLMs incrementally construct complex executable structural causal models.
GrandCode is the first AI system to consistently beat all human participants and place first in live Codeforces competitive programming contests.
A minimal agentic system achieves competitive performance in automated theorem proving with a simpler design and lower cost than state-of-the-art methods.
FactorLibrary stores reusable subexpressions to help RL agents (especially PPO+MCTS top-down) find certified optimal arithmetic circuits for polynomials up to complexity 8 at 91.8% success rate.
Human-AI collaboration expanded a meta-idea on rational approximation into sign-embedding quantum algorithms for matrix problems, with humans retaining final judgment on routes and refinements.
An agentic theorem prover in Lean uses a control plane to route actions based on cost and success estimates, achieving 28.9% lower average cost than a fixed-step baseline on a PutnamBench subset while preserving performance.
Provides a graph model of theorems and proves exponential growth of proved theorems via random-walk conjecturing under connectivity, plus a diversity-maximizing conjecturer using diffusion similarity from contrastive embeddings.
RLVR training raises verified Dafny pass rates from 9.7% to 31.1% on a filtered benchmark while a Lean proof scaffold lifts success from 46.2% to 69.2% on a pilot set and solves 7 of 42 prior unsolved tasks.
citing papers explorer
-
Internalizing Geometric Law: Learning from Solver Residuals for Precision-Critical Generation
Presents PyGeoX DSL and 300-problem benchmark, identifies outlier gradient masking under global rewards, and shows Saturating Additive Rewards improve hard-tier solving rate by 2.3x with an 8B model competitive to larger systems.
-
An Information-Theoretic Criterion for Efficient Data Synthesis
Synthetic data improves models only in information-open generation-training loops with external signals, and coarser signals like binary correctness enable better generalization by converging to the most information-efficient component.
-
FactorLibrary: From Polynomials to Circuits via Recursive Subgoals
FactorLibrary stores reusable subexpressions to help RL agents (especially PPO+MCTS top-down) find certified optimal arithmetic circuits for polynomials up to complexity 8 at 91.8% success rate.
-
From Meta Idea to Advanced Mathematical Discovery -- Human-AI Co-Discovery of Sign-Embedding Quantum Algorithms
Human-AI collaboration expanded a meta-idea on rational approximation into sign-embedding quantum algorithms for matrix problems, with humans retaining final judgment on routes and refinements.
-
A Theoretical Framework for Self-Play Theorem Proving Algorithms
Provides a graph model of theorems and proves exponential growth of proved theorems via random-walk conjecturing under connectivity, plus a diversity-maximizing conjecturer using diffusion similarity from contrastive embeddings.
-
Automating Formal Verification with Reinforcement Learning and Recursive Inference
RLVR training raises verified Dafny pass rates from 9.7% to 31.1% on a filtered benchmark while a Lean proof scaffold lifts success from 46.2% to 69.2% on a pilot set and solves 7 of 42 prior unsolved tasks.