Soohak is a 439-problem mathematician-curated benchmark where frontier LLMs reach at most 30.4% on research math challenges and no model exceeds 50% on refusal for ill-posed problems.
hub Mixed citations
First proof
Mixed citation behavior. Most common role is background (60%).
hub tools
citation-role summary
citation-polarity summary
years
2026 15representative citing papers
This paper introduces a taxonomy of four LLM failure modes on research math proofs and empirically shows premise smuggling in all eight audited Gemini outputs, with a new audit instrument achieving 100% precision.
A multi-agent framework called AutoformBot autoformalized 26 textbooks spanning analysis, algebra, topology, combinatorics and probability into a verified Lean 4 library of 45k declarations, demonstrating scalable formalization of graduate math.
An interactive AI workbench for mathematicians achieves 48% on FrontierMath Tier 4 and helped solve open problems in early tests.
k-server-bench formulates potential-function discovery for the k-server conjecture as a code-based inequality-satisfaction task; current agents fully solve the resolved k=3 case and reduce violations on the open k=4 case.
LRMs show a large production-evaluation gap on the VAIR dataset with valid answers but invalid reasoning, driven by answer confirmation bias as evidenced by CoT analysis, linear probes, and causal patching.
RMA, a multi-agent system with structured memory and iterative feedback loops, solves 8 out of 10 research-level math problems on the new First Proof benchmark and outperforms GPT-5.2R and Aletheia according to expert evaluation.
Computational discovery via FlowBoost supports conjectures on the singular values of the coupling matrix E_n being 2^{-k/2} independent of n, a sharp p=2 critical exponent for p-Stam inequalities, and bifurcation of extremals for p<2.
Reports methodology and results from evaluating multiple AI systems on ten new research-level mathematics problems from active researchers.
Iteris, an agentic research system, produced evidence and drafts for two open computational math problems that were verified after human correction.
pAI/MSc is a customizable multi-agent system that reduces human steering by orders of magnitude when turning a hypothesis into a literature-grounded, mathematically established, experimentally supported manuscript draft in ML theory.
Forage V2 enables agent organizations to grow knowledge from 0 to 54 entries over runs and transfer it so weaker models nearly match stronger ones in coverage, cost, and speed on open-world tasks.
AI agents exploring Platonic mathematical structures via proof hypergraphs may reveal the overall architecture of formal mathematics and what makes parts of it human-accessible.
AI models discovered the worst-case complexity of the Kaczmarz algorithm for solving linear systems.
citing papers explorer
-
Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs
Soohak is a 439-problem mathematician-curated benchmark where frontier LLMs reach at most 30.4% on research math challenges and no model exceeds 50% on refusal for ill-posed problems.
-
Failure Modes of Large Language Models on Research-Level Mathematics: A Taxonomy and an Empirical Characterisation
This paper introduces a taxonomy of four LLM failure modes on research math proofs and empirically shows premise smuggling in all eight audited Gemini outputs, with a new audit instrument achieving 100% precision.
-
Formalizing Mathematics at Scale
A multi-agent framework called AutoformBot autoformalized 26 textbooks spanning analysis, algebra, topology, combinatorics and probability into a verified Lean 4 library of 45k declarations, demonstrating scalable formalization of graduate math.
-
AI co-mathematician: Accelerating mathematicians with agentic AI
An interactive AI workbench for mathematicians achieves 48% on FrontierMath Tier 4 and helped solve open problems in early tests.
-
$k$-server-bench: Automating Potential Discovery for the $k$-Server Conjecture
k-server-bench formulates potential-function discovery for the k-server conjecture as a code-based inequality-satisfaction task; current agents fully solve the resolved k=3 case and reduce violations on the open k=4 case.
-
An Enigma of Artificial Reason: Investigating the Production-Evaluation Gap in Large Reasoning Models
LRMs show a large production-evaluation gap on the VAIR dataset with valid answers but invalid reasoning, driven by answer confirmation bias as evidenced by CoT analysis, linear probes, and causal patching.
-
RMA: an Agentic System for Research-Level Mathematical Problems
RMA, a multi-agent system with structured memory and iterative feedback loops, solves 8 out of 10 research-level math problems on the new First Proof benchmark and outperforms GPT-5.2R and Aletheia according to expert evaluation.
-
Spectral Structure in Finite Free Information Inequalities and $p$-Stam Phase Transitions
Computational discovery via FlowBoost supports conjectures on the singular values of the coupling matrix E_n being 2^{-k/2} independent of n, a sharp p=2 critical exponent for p-Stam inequalities, and bifurcation of extremals for p<2.
-
First Proof Second Batch
Reports methodology and results from evaluating multiple AI systems on ten new research-level mathematics problems from active researchers.
-
Iteris: Agentic Research Loops for Computational Mathematics
Iteris, an agentic research system, produced evidence and drafts for two open computational math problems that were verified after human correction.
-
pAI/MSc: ML Theory Research with Humans on the Loop
pAI/MSc is a customizable multi-agent system that reduces human steering by orders of magnitude when turning a hypothesis into a literature-grounded, mathematically established, experimentally supported manuscript draft in ML theory.
-
Forage V2: Knowledge Evolution and Transfer in Autonomous Agent Organizations
Forage V2 enables agent organizations to grow knowledge from 0 to 54 entries over runs and transfer it so weaker models nearly match stronger ones in coverage, cost, and speed on open-world tasks.
-
Artificial Intelligence and the Structure of Mathematics
AI agents exploring Platonic mathematical structures via proof hypergraphs may reveal the overall architecture of formal mathematics and what makes parts of it human-accessible.
-
How AI settled the complexity of the oldest SGD algorithm
AI models discovered the worst-case complexity of the Kaczmarz algorithm for solving linear systems.
- Automated Conjecture Resolution with Formal Verification