hub

URL https://aclanthology.org/2025

· 2025 · DOI 10.1038/s41586-025-09833-y

27 Pith papers cite this work. Polarity classification is still indexing.

27 Pith papers citing it

open at publisher browse 27 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

background 3 baseline 1

citation-polarity summary

background 3 baseline 1

representative citing papers

FormalRewardBench: A Benchmark for Formal Theorem Proving Reward Models

cs.AI · 2026-05-11 · conditional · novelty 8.0

FormalRewardBench is the first benchmark for reward models in formal theorem proving, consisting of 250 Lean 4 preference pairs that show frontier LLMs scoring 59.8% while specialized provers score only 24.4%.

AXLE: A Cloud Infrastructure for Lean 4 Theorem Proving Utilities

cs.LO · 2026-06-24 · unverdicted · novelty 7.0

AXLE is a multi-tenant cloud platform providing Lean 4 metaprogramming utilities with per-request isolation, multi-version support, and public access via SDK and API, having processed over 500 million requests.

Verifiable Auto-Formalization of Mathematics Using a Relaxed Natural Formal Language

cs.LO · 2026-06-23 · unverdicted · novelty 7.0

Introduces Relaxed NFL intermediate language for LLM-based auto-formalization, with rule-plus-LLM elaboration to Core NFL and tactic-language discharge of verification conditions.

LeanMarathon: Toward Reliable AI Co-Mathematicians through Long-Horizon Lean Autoformalization

cs.AI · 2026-06-03 · unverdicted · novelty 7.0

LeanMarathon uses four contract-scoped agents on an evolving blueprint coordinated by a two-stage orchestrator to formalize seven theorems from Erdős problems in Lean, proving 258 lemmas with no sorry across three runs.

Formalizing Mathematics at Scale

cs.AI · 2026-05-28 · accept · novelty 7.0

A multi-agent framework called AutoformBot autoformalized 26 textbooks spanning analysis, algebra, topology, combinatorics and probability into a verified Lean 4 library of 45k declarations, demonstrating scalable formalization of graduate math.

Less Effort, Shorter Proofs: Reinforcement Learning for Security Protocol Analysis in Tamarin

cs.CR · 2026-05-22 · unverdicted · novelty 7.0

An RL-guided MCTS proof search for Tamarin finds more and shorter proofs than standard search across 16 protocol models.

Formal Conjectures: An Open and Evolving Benchmark for Verified Discovery in Mathematics

cs.AI · 2026-05-13 · unverdicted · novelty 7.0

Formal Conjectures is a Lean 4 benchmark containing 2615 formalized problems with 1029 open conjectures, designed to evaluate automated mathematical reasoning and proof discovery.

AI co-mathematician: Accelerating mathematicians with agentic AI

cs.AI · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

An interactive AI workbench for mathematicians achieves 48% on FrontierMath Tier 4 and helped solve open problems in early tests.

Doubly Saturated Ramsey Graphs: A Case Study in Computer-Assisted Mathematical Discovery

math.CO · 2026-04-23 · unverdicted · novelty 7.0

A SAT-plus-LLM method discovers infinite families of doubly saturated Ramsey-good graphs, answering Grinstead and Roberts' 1982 question.

Explorable Theorems: Making Written Theorems Explorable by Grounding Them in Formal Representations

cs.HC · 2026-04-03 · conditional · novelty 7.0

Explorable theorems ground written proofs in Lean formalizations to enable step-by-step execution, custom example testing, and dependency tracing, with a user study showing improved comprehension.

Theorist Toolbox: Tools for Agent Based LLM-assisted economic theory Research

econ.TH · 2026-06-21 · unverdicted · novelty 6.0

External verification structures, not model capability, determine the reliability of LLM-assisted economic theory, as shown in attempts to design an incentive mechanism for a grade inflation model where adversarial checks caught false claims.

Internalizing Geometric Law: Learning from Solver Residuals for Precision-Critical Generation

cs.LG · 2026-06-08 · unverdicted · novelty 6.0

Presents PyGeoX DSL and 300-problem benchmark, identifies outlier gradient masking under global rewards, and shows Saturating Additive Rewards improve hard-tier solving rate by 2.3x with an 8B model competitive to larger systems.

Self-Improving Language Models with Bidirectional Evolutionary Search

cs.CL · 2026-05-27 · unverdicted · novelty 6.0

Bidirectional Evolutionary Search augments autoregressive expansion with evolutionary recombination operators and dense backward subgoal feedback to produce better candidates than standard best-of-N or tree search for language model self-improvement.

Automating Formal Verification with Agent-Guided Tree Search

cs.LO · 2026-05-26 · unverdicted · novelty 6.0

Agent-directed tree search improves LLM performance on Lean formal verification tasks, with context-based orchestration solving more intermediate specs at lower token cost than baseline agents.

ImProver 2: Iteratively Self-Improving LMs for Neurosymbolic Proof Optimization

cs.AI · 2026-05-21 · unverdicted · novelty 6.0

ImProver 2 combines a data-efficient expert-iteration pipeline with a neurosymbolic scaffold to train a 7B model that outperforms larger models in Lean 4 proof optimization across structural metrics.

An Information-Theoretic Criterion for Efficient Data Synthesis

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

Synthetic data improves models only in information-open generation-training loops with external signals, and coarser signals like binary correctness enable better generalization by converging to the most information-efficient component.

CauSim: Scaling Causal Reasoning with Increasingly Complex Causal Simulators

cs.AI · 2026-05-09 · unverdicted · novelty 6.0

CauSim turns scarce causal reasoning labels into scalable supervised data by having LLMs incrementally construct complex executable structural causal models.

GrandCode: Achieving Grandmaster Level in Competitive Programming via Agentic Reinforcement Learning

cs.AI · 2026-04-03 · unverdicted · novelty 6.0

GrandCode is the first AI system to consistently beat all human participants and place first in live Codeforces competitive programming contests.

A Minimal Agent for Automated Theorem Proving

cs.AI · 2026-02-27 · unverdicted · novelty 6.0

A minimal agentic system achieves competitive performance in automated theorem proving with a simpler design and lower cost than state-of-the-art methods.

FactorLibrary: From Polynomials to Circuits via Recursive Subgoals

cs.LG · 2026-06-24 · unverdicted · novelty 5.0

FactorLibrary stores reusable subexpressions to help RL agents (especially PPO+MCTS top-down) find certified optimal arithmetic circuits for polynomials up to complexity 8 at 91.8% success rate.

From Meta Idea to Advanced Mathematical Discovery -- Human-AI Co-Discovery of Sign-Embedding Quantum Algorithms

cs.LG · 2026-06-12 · unverdicted · novelty 5.0

Human-AI collaboration expanded a meta-idea on rational approximation into sign-embedding quantum algorithms for matrix problems, with humans retaining final judgment on routes and refinements.

Optimizing the Cost-Quality Tradeoff of Agentic Theorem Provers in Lean

cs.CL · 2026-06-03 · unverdicted · novelty 5.0

An agentic theorem prover in Lean uses a control plane to route actions based on cost and success estimates, achieving 28.9% lower average cost than a fixed-step baseline on a PutnamBench subset while preserving performance.

A Theoretical Framework for Self-Play Theorem Proving Algorithms

cs.LG · 2026-06-01 · unverdicted · novelty 5.0

Provides a graph model of theorems and proves exponential growth of proved theorems via random-walk conjecturing under connectivity, plus a diversity-maximizing conjecturer using diffusion similarity from contrastive embeddings.

Automating Formal Verification with Reinforcement Learning and Recursive Inference

cs.LG · 2026-05-29 · unverdicted · novelty 5.0

RLVR training raises verified Dafny pass rates from 9.7% to 31.1% on a filtered benchmark while a Lean proof scaffold lifts success from 46.2% to 69.2% on a pilot set and solves 7 of 42 prior unsolved tasks.

citing papers explorer

Showing 6 of 6 citing papers after filters.

Internalizing Geometric Law: Learning from Solver Residuals for Precision-Critical Generation cs.LG · 2026-06-08 · unverdicted · none · ref 26
Presents PyGeoX DSL and 300-problem benchmark, identifies outlier gradient masking under global rewards, and shows Saturating Additive Rewards improve hard-tier solving rate by 2.3x with an 8B model competitive to larger systems.
An Information-Theoretic Criterion for Efficient Data Synthesis cs.LG · 2026-05-11 · unverdicted · none · ref 14
Synthetic data improves models only in information-open generation-training loops with external signals, and coarser signals like binary correctness enable better generalization by converging to the most information-efficient component.
FactorLibrary: From Polynomials to Circuits via Recursive Subgoals cs.LG · 2026-06-24 · unverdicted · none · ref 60
FactorLibrary stores reusable subexpressions to help RL agents (especially PPO+MCTS top-down) find certified optimal arithmetic circuits for polynomials up to complexity 8 at 91.8% success rate.
From Meta Idea to Advanced Mathematical Discovery -- Human-AI Co-Discovery of Sign-Embedding Quantum Algorithms cs.LG · 2026-06-12 · unverdicted · none · ref 5
Human-AI collaboration expanded a meta-idea on rational approximation into sign-embedding quantum algorithms for matrix problems, with humans retaining final judgment on routes and refinements.
A Theoretical Framework for Self-Play Theorem Proving Algorithms cs.LG · 2026-06-01 · unverdicted · none · ref 37
Provides a graph model of theorems and proves exponential growth of proved theorems via random-walk conjecturing under connectivity, plus a diversity-maximizing conjecturer using diffusion similarity from contrastive embeddings.
Automating Formal Verification with Reinforcement Learning and Recursive Inference cs.LG · 2026-05-29 · unverdicted · none · ref 30
RLVR training raises verified Dafny pass rates from 9.7% to 31.1% on a filtered benchmark while a Lean proof scaffold lifts success from 46.2% to 69.2% on a pilot set and solves 7 of 42 prior unsolved tasks.

URL https://aclanthology.org/2025

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer