{"total":26,"items":[{"citing_arxiv_id":"2606.26442","ref_index":11,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"AXLE: A Cloud Infrastructure for Lean 4 Theorem Proving Utilities","primary_cat":"cs.LO","submitted_at":"2026-06-24T23:09:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AXLE is a multi-tenant cloud platform providing Lean 4 metaprogramming utilities with per-request isolation, multi-version support, and public access via SDK and API, having processed over 500 million requests.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.25394","ref_index":60,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"FactorLibrary: From Polynomials to Circuits via Recursive Subgoals","primary_cat":"cs.LG","submitted_at":"2026-06-24T04:45:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"FactorLibrary stores reusable subexpressions to help RL agents (especially PPO+MCTS top-down) find certified optimal arithmetic circuits for polynomials up to complexity 8 at 91.8% success rate.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.24443","ref_index":10,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Verifiable Auto-Formalization of Mathematics Using a Relaxed Natural Formal Language","primary_cat":"cs.LO","submitted_at":"2026-06-23T11:24:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces Relaxed NFL intermediate language for LLM-based auto-formalization, with rule-plus-LLM elaboration to Core NFL and tactic-language discharge of verification conditions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.22337","ref_index":3,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Theorist Toolbox: Tools for Agent Based LLM-assisted economic theory Research","primary_cat":"econ.TH","submitted_at":"2026-06-21T05:03:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"External verification structures, not model capability, determine the reliability of LLM-assisted economic theory, as shown in attempts to design an incentive mechanism for a grade inflation model where adversarial checks caught false claims.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.24899","ref_index":5,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"From Meta Idea to Advanced Mathematical Discovery -- Human-AI Co-Discovery of Sign-Embedding Quantum Algorithms","primary_cat":"cs.LG","submitted_at":"2026-06-12T13:30:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Human-AI collaboration expanded a meta-idea on rational approximation into sign-embedding quantum algorithms for matrix problems, with humans retaining final judgment on routes and refinements.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.09278","ref_index":26,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Internalizing Geometric Law: Learning from Solver Residuals for Precision-Critical Generation","primary_cat":"cs.LG","submitted_at":"2026-06-08T09:44:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Presents PyGeoX DSL and 300-problem benchmark, identifies outlier gradient masking under global rewards, and shows Saturating Additive Rewards improve hard-tier solving rate by 2.3x with an 8B model competitive to larger systems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.05400","ref_index":16,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"LeanMarathon: Toward Reliable AI Co-Mathematicians through Long-Horizon Lean Autoformalization","primary_cat":"cs.AI","submitted_at":"2026-06-03T20:09:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LeanMarathon uses four contract-scoped agents on an evolving blueprint coordinated by a two-stage orchestrator to formalize seven theorems from Erdős problems in Lean, proving 258 lemmas with no sorry across three runs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.04883","ref_index":55,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Optimizing the Cost-Quality Tradeoff of Agentic Theorem Provers in Lean","primary_cat":"cs.CL","submitted_at":"2026-06-03T13:46:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"An agentic theorem prover in Lean uses a control plane to route actions based on cost and success estimates, achieving 28.9% lower average cost than a fixed-step baseline on a PutnamBench subset while preserving performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01861","ref_index":37,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"A Theoretical Framework for Self-Play Theorem Proving Algorithms","primary_cat":"cs.LG","submitted_at":"2026-06-01T08:12:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Provides a graph model of theorems and proves exponential growth of proved theorems via random-walk conjecturing under connectivity, plus a diversity-maximizing conjecturer using diffusion similarity from contrastive embeddings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00618","ref_index":29,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Efficient Test-time Inference for Generative Planning Models with OCL Search","primary_cat":"cs.AI","submitted_at":"2026-05-30T08:46:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Modified OCL search integrates generative rollouts and learned heuristics for efficient inference in planning models across combinatorial domains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30914","ref_index":30,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Automating Formal Verification with Reinforcement Learning and Recursive Inference","primary_cat":"cs.LG","submitted_at":"2026-05-29T06:59:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"RLVR training raises verified Dafny pass rates from 9.7% to 31.1% on a filtered benchmark while a Lean proof scaffold lifts success from 46.2% to 69.2% on a pilot set and solves 7 of 42 prior unsolved tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29955","ref_index":30,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Formalizing Mathematics at Scale","primary_cat":"cs.AI","submitted_at":"2026-05-28T14:00:22+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A multi-agent framework called AutoformBot autoformalized 26 textbooks spanning analysis, algebra, topology, combinatorics and probability into a verified Lean 4 library of 45k declarations, demonstrating scalable formalization of graduate math.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28814","ref_index":16,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Self-Improving Language Models with Bidirectional Evolutionary Search","primary_cat":"cs.CL","submitted_at":"2026-05-27T17:59:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Bidirectional Evolutionary Search augments autoregressive expansion with evolutionary recombination operators and dense backward subgoal feedback to produce better candidates than standard best-of-N or tree search for language model self-improvement.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23772","ref_index":14,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Agentic Proving for Program Verification","primary_cat":"cs.AI","submitted_at":"2026-05-22T15:41:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Agentic Claude reaches 98.8% valid specs, 87.5% implementation certification, and 98.1% end-to-end success on CLEVER, revealing a mismatch between benchmark difficulty and current prover performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23643","ref_index":21,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Less Effort, Shorter Proofs: Reinforcement Learning for Security Protocol Analysis in Tamarin","primary_cat":"cs.CR","submitted_at":"2026-05-22T13:55:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"An RL-guided MCTS proof search for Tamarin finds more and shorter proofs than standard search across 16 protocol models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22885","ref_index":2,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"ImProver 2: Iteratively Self-Improving LMs for Neurosymbolic Proof Optimization","primary_cat":"cs.AI","submitted_at":"2026-05-21T02:20:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ImProver 2 combines a data-efficient expert-iteration pipeline with a neurosymbolic scaffold to train a 7B model that outperforms larger models in Lean 4 proof optimization across structural metrics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19338","ref_index":8,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"STAR-P\\'olyaMath: Multi-Agent Reasoning under Persistent Meta-Strategic Supervision","primary_cat":"cs.MA","submitted_at":"2026-05-19T04:20:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"STAR-PólyaMath introduces a multi-agent framework with meta-strategic supervision and state-machine orchestration that reports state-of-the-art and perfect scores on eight top math competition benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13171","ref_index":12,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Formal Conjectures: An Open and Evolving Benchmark for Verified Discovery in Mathematics","primary_cat":"cs.AI","submitted_at":"2026-05-13T08:33:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Formal Conjectures is a Lean 4 benchmark containing 2615 formalized problems with 1029 open conjectures, designed to evaluate automated mathematical reasoning and proof discovery.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10141","ref_index":1,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"FormalRewardBench: A Benchmark for Formal Theorem Proving Reward Models","primary_cat":"cs.AI","submitted_at":"2026-05-11T07:51:15+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"FormalRewardBench is the first benchmark for reward models in formal theorem proving, consisting of 250 Lean 4 preference pairs that show frontier LLMs scoring 59.8% while specialized provers score only 24.4%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16379","ref_index":14,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"An Information-Theoretic Criterion for Efficient Data Synthesis","primary_cat":"cs.LG","submitted_at":"2026-05-11T01:27:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Synthetic data improves models only in information-open generation-training loops with external signals, and coarser signals like binary correctness enable better generalization by converging to the most information-efficient component.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09079","ref_index":1,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"CauSim: Scaling Causal Reasoning with Increasingly Complex Causal Simulators","primary_cat":"cs.AI","submitted_at":"2026-05-09T17:39:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CauSim turns scarce causal reasoning labels into scalable supervised data by having LLMs incrementally construct complex executable structural causal models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"scaling and data volume, LLM self-improvement through self-generated simulators, and data augmentation via formalization of existing domain knowledge. 1 Introduction Causal reasoning, the ability to simulate and answer what-if scenarios, is a defining feature of human intelligence. Large Language Models (LLMs) continue to reinforce this claim: despite surpassing human performance across various domains including mathematics [ 1], coding [2, 3], and other knowledge-intensive tasks [ 4, 5], it is well established that LLMs struggle tocausally reason [6, 7, 8, 9]. What makes causal reasoning hard?Consider the counterfactual query in Fig. 1:How big would tumor X have been if drug Y was given?Answering this requires inferring latent factors, intervening on the drug, and propagating the effect through intermediate variables to obtain the final prediction."},{"citing_arxiv_id":"2605.06651","ref_index":23,"ref_count":2,"confidence":0.88,"is_internal_anchor":false,"paper_title":"AI co-mathematician: Accelerating mathematicians with agentic AI","primary_cat":"cs.AI","submitted_at":"2026-05-07T17:56:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"An interactive AI workbench for mathematicians achieves 48% on FrontierMath Tier 4 and helped solve open problems in early tests.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21187","ref_index":29,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Doubly Saturated Ramsey Graphs: A Case Study in Computer-Assisted Mathematical Discovery","primary_cat":"math.CO","submitted_at":"2026-04-23T01:05:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"partial","one_line_summary":"A SAT-plus-LLM method discovers infinite families of doubly saturated Ramsey-good graphs, answering Grinstead and Roberts' 1982 question.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"thanks to the combination of several technologies: - Mature and efficient symbolic reasoning: SAT solvers and computer algebra systems have evolved tremendously, and are currently able to reliably handle large computations. - LLMs capable of producing mathematical arguments: Frontier mod- els have demonstrated remarkable mathematical capabilities [48], ranging from competition problems [29] to research-level questions and open prob- lems [7,18,46]. - Autoformalization: Relying on proof assistants like Lean [36], with large mathematicallibraries[13],LLMscannowformalizeproofsformathematical arguments of increasing complexity [44,45]. arXiv:2604.21187v1 [math.CO] 23 Apr 2026 2 Przybocki et al. In this paper, we give a glimpse of what this golden age looks like, and how these"},{"citing_arxiv_id":"2604.02721","ref_index":8,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"GrandCode: Achieving Grandmaster Level in Competitive Programming via Agentic Reinforcement Learning","primary_cat":"cs.AI","submitted_at":"2026-04-03T04:26:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GrandCode is the first AI system to consistently beat all human participants and place first in live Codeforces competitive programming contests.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Mark Chen, Max Schwarzer, Mostafa Rohaninejad, Nat McAleese, o3 contributors, Oleg Mürk, Rhythm Garg, Rui Shu, Szymon Sidor, Vineet Kosaraju, and Wenda Zhou. Competitive programming with large reasoning models.arXiv preprint arXiv:2502.06807, 2025. [7] Google DeepMind. Gemini 2.5: Our newest gemini model with thinking. https://deepmind. google/blog/gemini-2-5-our-most-intelligent-ai-model/, 2025. Technical blog post. [8] Thomas Hubert, Rishi Mehta, David Silver, et al. Olympiad-level formal mathematical reasoning with reinforcement learning.Nature, 2025. doi: 10.1038/s41586-025-09833-y. [9] IOI. International Olympiad in Informatics (IOI). https://www.ioinformatics.org/, 2026. Accessed 2026-03-25. [10] Chenyu Jiang, Zhenkun Cai, Ye Tian, Zhen Jia, Yida Wang, and Chuan Wu."},{"citing_arxiv_id":"2604.02598","ref_index":21,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Explorable Theorems: Making Written Theorems Explorable by Grounding Them in Formal Representations","primary_cat":"cs.HC","submitted_at":"2026-04-03T00:16:52+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Explorable theorems ground written proofs in Lean formalizations to enable step-by-step execution, custom example testing, and dependency tracing, with a user study showing improved comprehension.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"theorem and proof as context alone, and (2) prompting Gemini to explain how the proof applies to a specific value, given the the- orem statement and proof as context. Participants were also free to ask Gemini any question of their choosing. This choice was motivated by two considerations. First, Gemini has demonstrated strong performance on general-purpose mathematical reasoning and zero-shot proof generation [21]. Second, using an AI chatbot as a baseline reflects a modern and increasingly tool used in math education [37, 43], allowing us to directly compare two ways to read proofs: interacting with a chatbot versus using the explorable theorems system. In our pilot tests, Gemini yielded correct answers, so any differences in comprehension outcomes are less likely attrib-"},{"citing_arxiv_id":"2602.24273","ref_index":4,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"A Minimal Agent for Automated Theorem Proving","primary_cat":"cs.AI","submitted_at":"2026-02-27T18:43:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A minimal agentic system achieves competitive performance in automated theorem proving with a simpler design and lower cost than state-of-the-art methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}