{"total":13,"items":[{"citing_arxiv_id":"2606.29593","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"How AI settled the complexity of the oldest SGD algorithm","primary_cat":"cs.LG","submitted_at":"2026-06-28T20:27:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"AI models discovered the worst-case complexity of the Kaczmarz algorithm for solving linear systems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.02484","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Iteris: Agentic Research Loops for Computational Mathematics","primary_cat":"cs.AI","submitted_at":"2026-06-01T16:54:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Iteris, an agentic research system, produced evidence and drafts for two open computational math problems that were verified after human correction.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01462","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"An Enigma of Artificial Reason: Investigating the Production-Evaluation Gap in Large Reasoning Models","primary_cat":"cs.AI","submitted_at":"2026-05-31T21:46:52+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LRMs show a large production-evaluation gap on the VAIR dataset with valid answers but invalid reasoning, driven by answer confirmation bias as evidenced by CoT analysis, linear probes, and causal patching.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29955","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Formalizing Mathematics at Scale","primary_cat":"cs.AI","submitted_at":"2026-05-28T14:00:22+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A multi-agent framework called AutoformBot autoformalized 26 textbooks spanning analysis, algebra, topology, combinatorics and probability into a verified Lean 4 library of 45k declarations, demonstrating scalable formalization of graduate math.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22875","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RMA: an Agentic System for Research-Level Mathematical Problems","primary_cat":"cs.AI","submitted_at":"2026-05-20T04:54:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RMA, a multi-agent system with structured memory and iterative feedback loops, solves 8 out of 10 research-level math problems on the new First Proof benchmark and outperforms GPT-5.2R and Aletheia according to expert evaluation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09063","ref_index":1,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs","primary_cat":"cs.CL","submitted_at":"2026-05-09T17:14:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Soohak is a 439-problem mathematician-curated benchmark where frontier LLMs reach at most 30.4% on research math challenges and no model exceeds 50% on refusal for ill-posed problems.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"final answers and are frequently derived from competition materials [7, 20, 21] or curated into unified suites [13]. In contrast,research-levelbenchmarks aim to probe advanced mathematical knowledge and longer-horizon reasoning, drawing on research literature or researcher-authored questions, as in FrontierMath [15], RealMath [33], and more recently First Proof [1]. Dataset construction choices also interact strongly with contamination risk. A large fraction of benchmarks are assembled from publicly available exams and competitions or from published sources [19, 11, 13, 8, 33]. But items sourced from exams are vulnerable to overlap with training data, and contamination has been documented in widely used contest-derived sets [8]."},{"citing_arxiv_id":"2605.06651","ref_index":52,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AI co-mathematician: Accelerating mathematicians with agentic AI","primary_cat":"cs.AI","submitted_at":"2026-05-07T17:56:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"An interactive AI workbench for mathematicians achieves 48% on FrontierMath Tier 4 and helped solve open problems in early tests.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20622","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"pAI/MSc: ML Theory Research with Humans on the Loop","primary_cat":"cs.AI","submitted_at":"2026-04-22T14:38:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"pAI/MSc is a customizable multi-agent system that reduces human steering by orders of magnitude when turning a hypothesis into a literature-grounded, mathematically established, experimentally supported manuscript draft in ML theory.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Claude's cycles. Informal note / PDF on Knuth's preprints page, February 2026. URL https://cs.stanford.edu/~knuth/papers/claude-cycles.pdf. Dated 2026-02-28; revised 2026-03- 16. [8] TerenceTao. Thestoryoferdősproblem#1026. BlogpostonWhat's New, December2025. URLhttps: //terrytao.wordpress.com/2025/12/08/the-story-of-erdos-problem-126/. Published 2025-12- 08. [9] Mohammed Abouzaid, Andrew J. Blumberg, Martin Hairer, Joe Kileel, Tamara G. Kolda, Paul D. Nelson, Daniel Spielman, Nikhil Srivastava, Rachel Ward, Shmuel Weinberger, and Lauren Williams. First proof.arXiv preprint arXiv:2602.05192, 2026. doi: 10.48550/arXiv.2602.05192. URLhttps: //arxiv.org/abs/2602.05192. [10] First Proof Project. First batch. Project website, February 2026."},{"citing_arxiv_id":"2604.19837","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Forage V2: Knowledge Evolution and Transfer in Autonomous Agent Organizations","primary_cat":"cs.AI","submitted_at":"2026-04-21T08:14:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Forage V2 enables agent organizations to grow knowledge from 0 to 54 entries over runs and transfer it so weaker models nearly match stronger ones in coverage, cost, and speed on open-world tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11922","ref_index":2,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Spectral Structure in Finite Free Information Inequalities and $p$-Stam Phase Transitions","primary_cat":"math.PR","submitted_at":"2026-04-13T18:09:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Computational discovery via FlowBoost supports conjectures on the singular values of the coupling matrix E_n being 2^{-k/2} independent of n, a sharp p=2 critical exponent for p-Stam inequalities, and bifurcation of extremals for p<2.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07240","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"$k$-server-bench: Automating Potential Discovery for the $k$-Server Conjecture","primary_cat":"cs.MS","submitted_at":"2026-04-08T16:06:43+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"k-server-bench formulates potential-function discovery for the k-server conjecture as a code-based inequality-satisfaction task; current agents fully solve the resolved k=3 case and reduce violations on the open k=4 case.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"feedback enabling rapid iteration; (ii) thek= 3 circle case is solved, providing a concrete calibration target; and (iii) empirically, agents can reach zero detected violations onk= 3 instances and substantially reduce violations on the hardestk= 4 instances beyond prior human-designed potentials in our evaluation suite. In this sense, our setup also contrasts with open-endedproofbenchmarks such as First Proof [1] and Erd¨ os Bench [16, 17], where automated feedback is typically sparser and iteration is less directly supported. Moreover, our task can be viewed as a valuablebenchmarkfor agentic open-ended discov- ery. By providing a clear target-achievingzeroviolations-it supports diagnostic evaluation beyond incremental score improvements . Even in the resolvedk= 3 regime, this criterion"},{"citing_arxiv_id":"2604.06107","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Artificial Intelligence and the Structure of Mathematics","primary_cat":"cs.AI","submitted_at":"2026-04-07T17:19:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"AI agents exploring Platonic mathematical structures via proof hypergraphs may reveal the overall architecture of formal mathematics and what makes parts of it human-accessible.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"ematical reasoning, has seen many of its highest tier problems solved by AI [35]. AI agents have also been indispensable in completing formal proofs of the prime number theorem [56] and higher dimensional sphere packing [57], contributing hundreds of thousands of lines of Lean code and dramatically advancing Hilbert's dream of formalizing mathematics. First- Proof [2], a set of 10 novel research-grade problems that were held-out intermediate results in the work of expert mathematicians saw the majority of its problems solved autonomously [37, 65] within a week of release. Researchers in mathematics, computer science, and theo- retical physics have found modern AI systems useful in the wild, for developing proofs and"},{"citing_arxiv_id":"2604.03789","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Automated Conjecture Resolution with Formal Verification","primary_cat":"cs.LG","submitted_at":"2026-04-04T16:35:16+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"in an autonomous manner [16], as well as a generalized Erdős problem in a semi-autonomous setting with human collaboration [5]. Beyond these,Aletheiahas also addressed genuine research problems, including computing eigenweights for the Arithmetic Hirzebruch Proportionality Principle [14] and establishing bounds for independence sets [31]. Furthermore,Aletheiasolved 6 out of 10 problems in the FirstProof benchmark [15], introduced in [1], which consists of real research problems solved by human mathematicians but not publicly released at the time of evaluation. In parallel, theFullProofsystem [7] demonstrates effective human-AI collaboration in algebraic geometry: the AI handles special cases, humans provide insights for the general case, and the AI subsequently completes the general solution based on these hints."}],"limit":50,"offset":0}