A new benchmark of 9,415 Lean 4 specifications derived from 2,772 scraped Python property-based tests, plus a three-agent LLM transpilation pipeline and proof-generation baselines.
minictx: Neural theorem proving with (long-)contexts
5 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
s2n-bignum-bench is a new benchmark requiring LLMs to synthesize HOL Light proofs for real-world low-level cryptographic assembly code.
A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
ImProver is an LLM agent using Chain-of-States, error-correction, and retrieval to rewrite Lean proofs for arbitrary user-defined optimization criteria like shortness and readability.
Non-reasoning LLMs fail the equivalence class problem while reasoning LLMs perform better but remain incomplete, with difficulty peaking at phase transition for the former and maximum diameter for the latter.
citing papers explorer
-
ImProver: Agent-Based Automated Proof Optimization
ImProver is an LLM agent using Chain-of-States, error-correction, and retrieval to rewrite Lean proofs for arbitrary user-defined optimization criteria like shortness and readability.
-
How Well Do LLMs Perform on the Simplest Long-Chain Reasoning Tasks: An Empirical Study on the Equivalence Class Problem
Non-reasoning LLMs fail the equivalence class problem while reasoning LLMs perform better but remain incomplete, with difficulty peaking at phase transition for the former and maximum diameter for the latter.