Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning
Pith reviewed 2026-05-17 18:24 UTC · model grok-4.3
The pith
A large model trained with reinforcement learning and a formal reasoning pattern reaches 80.7 percent on the miniF2F Lean theorem proving benchmark.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Kimina-Prover sets a new state-of-the-art on the miniF2F benchmark, reaching 80.7% with pass@8192, by employing a structured reasoning pattern termed formal reasoning pattern that allows the model to emulate human problem-solving strategies in Lean, iteratively generating and refining proof steps, trained with a large-scale reinforcement learning pipeline from Qwen2.5-72B.
What carries the argument
The formal reasoning pattern: a structured method for emulating human problem-solving in Lean by iteratively generating and refining proof steps.
Load-bearing premise
That the reported performance gains stem primarily from the novel formal reasoning pattern and reinforcement learning rather than from large sampling budgets or benchmark-specific tuning.
What would settle it
An ablation that trains or runs the model without the formal reasoning pattern and measures whether the miniF2F pass rate drops well below 80.7 percent at the same sampling budget.
read the original abstract
We introduce Kimina-Prover Preview, a large language model that pioneers a novel reasoning-driven exploration paradigm for formal theorem proving, as showcased in this preview release. Trained with a large-scale reinforcement learning pipeline from Qwen2.5-72B, Kimina-Prover demonstrates strong performance in Lean 4 proof generation by employing a structured reasoning pattern we term \textit{formal reasoning pattern}. This approach allows the model to emulate human problem-solving strategies in Lean, iteratively generating and refining proof steps. Kimina-Prover sets a new state-of-the-art on the miniF2F benchmark, reaching 80.7% with pass@8192. Beyond improved benchmark performance, our work yields several key insights: (1) Kimina-Prover exhibits high sample efficiency, delivering strong results even with minimal sampling (pass@1) and scaling effectively with computational budget, stemming from its unique reasoning pattern and RL training; (2) we demonstrate clear performance scaling with model size, a trend previously unobserved for neural theorem provers in formal mathematics; (3) the learned reasoning style, distinct from traditional search algorithms, shows potential to bridge the gap between formal verification and informal mathematical intuition. We open source distilled versions with 1.5B and 7B parameters of Kimina-Prover
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Kimina-Prover Preview, a 72B-parameter model initialized from Qwen2.5 and further trained via large-scale reinforcement learning. It employs a structured 'formal reasoning pattern' that emulates iterative human-style proof construction in Lean 4. The central empirical claim is a new state-of-the-art on the miniF2F benchmark of 80.7% pass@8192, together with assertions of high sample efficiency (strong pass@1 results), clear scaling with model size, and the release of distilled 1.5B and 7B open-source checkpoints.
Significance. If the performance gains can be shown to arise primarily from the combination of the formal reasoning pattern and RL rather than from sampling scale or data overlap, the result would be significant for neural theorem proving. It would provide evidence that large-scale RL can induce human-like iterative reasoning styles that scale with model size and compute, potentially narrowing the gap between informal mathematical intuition and formal verification. The open-sourcing of smaller distilled models is a concrete positive contribution for reproducibility and further research.
major comments (3)
- [Abstract and §4] Abstract and §4 (Experiments): the headline 80.7% pass@8192 result on miniF2F is presented without a matched-budget baseline that applies the identical pass@8192 sampling regime to the strongest prior Lean 4 prover. Without this comparison, it is impossible to determine how much of the reported improvement is attributable to the formal reasoning pattern and RL training versus the large sampling budget itself.
- [§3 and §4.1] §3 (Method) and §4.1: the description of the RL reward design, training data composition, and any decontamination steps against miniF2F is insufficiently detailed. Because the central claim attributes gains to the novel reasoning pattern plus RL, the absence of an explicit training-set overlap audit or contamination analysis leaves the causal attribution under-supported.
- [§4.3] §4.3 (Ablations): while the abstract asserts 'high sample efficiency' and scaling with computational budget, the manuscript does not report pass@k curves for the strongest non-RL baselines at the same range of k values used for Kimina-Prover. This omission weakens the claim that the observed efficiency stems from the reasoning pattern rather than from the particular sampling procedure.
minor comments (2)
- [Introduction] The precise definition and pseudocode for the 'formal reasoning pattern' appear only after the abstract; moving a concise description or illustrative example to the introduction would improve readability.
- [Figures and Tables] Table captions and axis labels in the scaling plots should explicitly state the exact pass@k values and model sizes compared, to avoid ambiguity when readers interpret the scaling trend.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments identify important areas where additional comparisons and details would strengthen the manuscript. We address each major comment below and indicate the revisions we will incorporate.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): the headline 80.7% pass@8192 result on miniF2F is presented without a matched-budget baseline that applies the identical pass@8192 sampling regime to the strongest prior Lean 4 prover. Without this comparison, it is impossible to determine how much of the reported improvement is attributable to the formal reasoning pattern and RL training versus the large sampling budget itself.
Authors: We agree that a matched-budget comparison would help isolate the contributions of the formal reasoning pattern and RL training. In the revised manuscript we will add a direct comparison of the strongest prior Lean 4 prover under the identical pass@8192 regime. We also note that the model already demonstrates competitive pass@1 performance, which is less sensitive to sampling scale, but the additional baseline will clarify the overall picture. revision: yes
-
Referee: [§3 and §4.1] §3 (Method) and §4.1: the description of the RL reward design, training data composition, and any decontamination steps against miniF2F is insufficiently detailed. Because the central claim attributes gains to the novel reasoning pattern plus RL, the absence of an explicit training-set overlap audit or contamination analysis leaves the causal attribution under-supported.
Authors: We accept that the current description is insufficient for full causal attribution. The revised version will expand §3 and §4.1 with a complete specification of the RL reward function, the sources and composition of the training data, and the decontamination procedures applied to avoid overlap with miniF2F, including an explicit contamination audit. revision: yes
-
Referee: [§4.3] §4.3 (Ablations): while the abstract asserts 'high sample efficiency' and scaling with computational budget, the manuscript does not report pass@k curves for the strongest non-RL baselines at the same range of k values used for Kimina-Prover. This omission weakens the claim that the observed efficiency stems from the reasoning pattern rather than from the particular sampling procedure.
Authors: We agree that reporting pass@k curves for the strongest non-RL baselines across the same range of k would better support the sample-efficiency claim. In the revised §4.3 we will include these curves, allowing a direct comparison that isolates the effect of the formal reasoning pattern from sampling procedure differences. revision: yes
Circularity Check
No significant circularity: empirical results on external benchmark
full rationale
The paper reports an empirical outcome from RL training of a language model on formal theorem proving tasks, with headline performance measured directly against the independent miniF2F benchmark (80.7% at pass@8192). No derivation chain, equations, or first-principles predictions are presented that reduce to self-defined inputs, fitted parameters renamed as predictions, or self-citation load-bearing premises. The described 'formal reasoning pattern' is an observational training outcome rather than a tautological construct, and scaling/sample-efficiency claims are experimental observations, not forced by internal definitions. The work is self-contained against external benchmarks with no quoted reductions to its own inputs.
Axiom & Free-Parameter Ledger
invented entities (1)
-
formal reasoning pattern
no independent evidence
Forward citations
Cited by 16 Pith papers
-
MathAtlas: A Benchmark for Autoformalization in the Wild
MathAtlas is the first large-scale benchmark for autoformalizing graduate mathematics, where even strong models reach only 9.8% correctness on theorem statements and drop to 2.6% on the hardest dependency-deep subset.
-
Automatic Textbook Formalization
Multi-agent AI system formalizes entire 500-page graduate algebraic combinatorics textbook into Lean, creating 130K lines of code in one week at human-expert cost.
-
Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy
LLMs display clear performance stratification on formal language tasks aligned with Chomsky hierarchy complexity levels, limited by severe efficiency barriers rather than absolute capability.
-
Lean Atlas: An Integrated Proof Environment for Scalable Human-AI Collaborative Formalization
Lean Atlas visualizes Lean 4 dependency graphs and applies Lean Compass to reduce the nodes needing human semantic review by 27-99% across six evaluated projects.
-
MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI
MLS-Bench shows that current AI agents fall short of reliably inventing generalizable ML methods, with engineering tuning easier than genuine invention.
-
Teaching LLMs Program Semantics via Symbolic Execution Traces
Training Qwen3-8B on symbolic execution traces from Soteria improves violation detection in C programs by over 17 points, transfers across five property types, and shows superadditive gains with chain-of-thought.
-
Surface Sensitivity in Lean 4 Autoformalization
Paraphrase sensitivity in Lean 4 autoformalization arises from compilation failures rather than semantic divergence among successful formalizations.
-
Scaling Self-Play with Self-Guidance
SGS adds self-guidance to LLM self-play for Lean4 theorem proving, surpassing RL baselines and enabling a 7B model to outperform a 671B model after 200 rounds.
-
SFT-GRPO Data Overlap as a Post-Training Hyperparameter for Autoformalization
Disjoint SFT and GRPO data for autoformalization yields up to 10.4pp semantic accuracy gains over full overlap, which renders the GRPO stage redundant.
-
Teaching an Agent to Sketch One Part at a Time
A multi-modal LM agent is trained to produce vector sketches part-by-part via supervised fine-tuning and process-reward RL on the new ControlSketch-Part dataset with automatic part annotations.
-
A Minimal Agent for Automated Theorem Proving
A minimal agentic system achieves competitive performance in automated theorem proving with a simpler design and lower cost than state-of-the-art methods.
-
Aristotle: IMO-level Automated Theorem Proving
Aristotle reaches gold-medal-equivalent performance on 2025 IMO problems via integrated Lean proof search, informal lemma formalization, and a dedicated geometry solver.
-
OptProver: Bridging Olympiad and Optimization through Continual Training in Formal Theorem Proving
OptProver transfers formal theorem proving from Olympiad math to optimization via continual training, achieving SOTA Pass@1 and Pass@32 on a new Lean 4 benchmark while retaining general performance.
-
Rethinking Wireless Communications through Formal Mathematical AI Reasoning
Proposes a three-layer framework using formal AI reasoning for verification, derivation, and discovery in wireless communications theory.
-
Understanding Tool-Augmented Agents for Lean Formalization: A Factorial Analysis
Factorial experiments show that tool-augmented agents achieve substantially higher compilation success and semantic equivalence than one-shot baselines when translating natural language mathematics into Lean 4.
-
AI for Mathematics: Progress, Challenges, and Prospects
AI for math combines task-specific architectures and general foundation models to support research and advance AI reasoning capabilities.
Reference graph
Works this paper leans on
-
[1]
Dt-solver: Automated theorem proving with dynamic-tree sampling guided by proof-level value function
arXiv: 2407.11214 [cs.AI]. URL: https://arxiv.org/abs/2407.11214. Wang, Haiming et al. “Dt-solver: Automated theorem proving with dynamic-tree sampling guided by proof-level value function”. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023, pp. 12632–12646. Wang, Haiming, Huajian Xin...
-
[2]
URL: https://arxiv.org/abs/2310.00656
arXiv: 2310.00656 [cs.AI]. URL: https://arxiv.org/abs/2310.00656. Wang, Ruida et al. MA-LoT: Multi-Agent Lean-based Long Chain-of-Thought Reasoning enhances Formal Theorem Proving. 2025. arXiv: 2503.03205 [cs.CL]. URL: https://arxiv.org/abs/2503.03205. Wu, Zijian et al. InternLM2.5-StepProver: Advancing Automated Theorem Proving via Expert Iteration on La...
- [3]
-
[4]
Filtering and Preprocessing: We filter this dataset based on specific criteria, retaining only problems clearly classified as either proofs or problems with explicit numeric or symbolic outputs. Problems involving geometry and combinatorics are excluded to form a dataset more suitable for autoformalization. This filtered dataset is denoted by auto-stateme...
-
[5]
Autoformalization: The auto-statement-candidates dataset undergoes an automatic formalization process, resulting in the auto-statements dataset. This step converts natural language mathematical problems into formal statements compatible with the Lean 4 proof assistant
-
[6]
Human Annotation and Refinement: To ensure the quality and precision of formalizations, we established a dedicated annotation team tasked with reviewing and refining the outputs from the autoformalization process. Annotated outputs are categorized as follows: • Human Statements: Statements refined by human annotators. • Human Proofs: Select, challenging s...
-
[7]
Prompt Set Creation: The refined formalized statements from the auto-statements and human statements datasets are combined to form our final prompt set. The validated and proven formal statements subsequently enrich the SFT dataset, creating a dynamic cycle of continuous improvement for subsequent iterations of reinforcement learning. This structured and ...
work page 2024
-
[8]
From the two equations, we can eliminate a to solve for d
a + 10d = 60 (h1) We need to find a + 20d. From the two equations, we can eliminate a to solve for d. Subtracting h0 from h1: (a + 10d) − (a + 6d) = 60 − 30 4d = 30 d = 30/4 = 7 .5 or 15 2 Let’s compute this in Lean: have hd : d = 15 / 2 := by linarith Now we have d = 15 2 . Let’s verify this calculation: From a + 10d = 60 and a + 6d = 30, subtracting giv...
work page 1962
-
[9]
Introduce x, y, z as above
-
[10]
Prove x, y, z > 0 using the triangle inequalities
-
[11]
Substitute a = x + y, b = y + z, c = z + x into the goal
-
[12]
Expand the expression
-
[13]
Show it’s equal to the sum of squares
-
[14]
Conclude non-negativity The Lean code to perform this substitution would look like: have ha : a = x + y := by dsimp [x, y, z] linarith ...... The entire Lean proof leveraging the Ravi substitution and expansion would look something like: theorem imo_1983_p6 (a b c : R) (h 0 : 0 < a ∧ 0 < b ∧ 0 < c) (h 1 : c < a + b) (h 2 : b < a + c) (h3 : a < b + c) : 0 ...
-
[15]
n ≡ 0 mod6: n=6k, use a=3, b=3, c=3, d=2k-3
-
[16]
n ≡ 2 mod6: n=6k+2, use a=3, b=2k-11, c=5, d=7
-
[17]
observation of small cases → conjecture formulation → rigorous proof
n ≡ 4 mod6: n=6k+4, use a=3, b=2k-7, c=5, d=5 ...... </think> theorem number_theory_72826 {n : N} (hn : Even n) (h : n > 38) : ∃ a b c d : N, Odd a ∧ Odd b ∧ Odd c ∧ Odd d ∧ a > 1 ∧ b > 1 ∧ c > 1 ∧ d > 1 ∧ n = a * b + c * d := by have h1 : n % 2 = 0 := by exact Nat.even_iff.1 hn have h2 : n > 38 := h have h3 : n % 6 = 0 ∨ n % 6 = 2 ∨ n % 6 = 4 := by have ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.